✨ TL;DR
This paper benchmarks cloud and local large language models on two System Dynamics tasks: extracting causal loop diagrams and providing interactive coaching. The best local models match mid-tier cloud performance on diagram extraction (77%) but struggle with long-context error-fixing tasks, with backend implementation choices mattering more than quantization levels.
System Dynamics modeling requires AI assistants that can extract structured causal relationships from text and provide interactive coaching on model building. While cloud-based LLMs offer strong performance, practitioners need to understand whether locally-hosted open-source models can provide comparable assistance, especially given privacy, cost, and deployment constraints. Existing evaluations have not systematically compared cloud versus local LLM performance on domain-specific System Dynamics tasks, nor have they examined how technical implementation choices (backend frameworks, quantization levels, model architectures) affect practical performance on these specialized tasks.
The authors created two purpose-built benchmarks: the CLD Leaderboard with 53 tests for structured causal loop diagram extraction, and the Discussion Leaderboard for evaluating interactive model discussion, feedback explanation, and coaching capabilities. They systematically evaluated multiple LLM families including proprietary cloud APIs and locally-hosted open-source models. For local models, they conducted extensive parameter sweeps across inference backends (llama.cpp GGUF vs. mlx_lm MLX), quantization levels (Q3, Q4_K_M, MLX-3bit, MLX-4bit, MLX-6bit), model architectures (reasoning vs. instruction-tuned), and sampling parameters (temperature, top-p, top-k). All experiments were run on Apple Silicon hardware with models ranging from 67B to 123B parameters, with careful documentation of timing data and exclusion of stuck requests.
What the paper shows.
Cloud models achieved 77-89% overall pass rates on CLD extraction tasks. The best local model (Kimi K2.5 GGUF Q3 zero-shot) reached 77% on CLD extraction, matching mid-tier cloud performance. On the Discussion Leaderboard, local models achieved 50-100% success on model building steps and 47-75% on feedback explanation tasks. However, local models only achieved 0-50% on error fixing tasks, which involve long-context prompts. The systematic parameter sweep across backends, quantization levels, and sampling parameters provided detailed performance profiles for 67B-123B parameter models running on Apple Silicon hardware.
The study is limited to Apple Silicon hardware deployments, which may not generalize to other local deployment scenarios (NVIDIA GPUs, AMD, etc.). The benchmarks are specific to System Dynamics tasks and may not reflect performance on other domain-specific applications. Local models struggle significantly with long-context error-fixing tasks due to memory constraints, limiting their practical utility for complete interactive coaching workflows. The evaluation does not address fine-tuning potential, focusing only on zero-shot and few-shot performance. Timing data required cleaning to exclude stuck requests, indicating reliability issues with certain model-backend-prompt combinations that may affect production deployments.
✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.