Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Terry Leitch

✨ TL;DR

This paper benchmarks cloud and local large language models on two System Dynamics tasks: extracting causal loop diagrams and providing interactive coaching. The best local models match mid-tier cloud performance on diagram extraction (77%) but struggle with long-context error-fixing tasks, with backend implementation choices mattering more than quantization levels.

01 · Problem

System Dynamics modeling requires AI assistants that can extract structured causal relationships from text and provide interactive coaching on model building. While cloud-based LLMs offer strong performance, practitioners need to understand whether locally-hosted open-source models can provide comparable assistance, especially given privacy, cost, and deployment constraints. Existing evaluations have not systematically compared cloud versus local LLM performance on domain-specific System Dynamics tasks, nor have they examined how technical implementation choices (backend frameworks, quantization levels, model architectures) affect practical performance on these specialized tasks.

02 · Approach

The authors created two purpose-built benchmarks: the CLD Leaderboard with 53 tests for structured causal loop diagram extraction, and the Discussion Leaderboard for evaluating interactive model discussion, feedback explanation, and coaching capabilities. They systematically evaluated multiple LLM families including proprietary cloud APIs and locally-hosted open-source models. For local models, they conducted extensive parameter sweeps across inference backends (llama.cpp GGUF vs. mlx_lm MLX), quantization levels (Q3, Q4_K_M, MLX-3bit, MLX-4bit, MLX-6bit), model architectures (reasoning vs. instruction-tuned), and sampling parameters (temperature, top-p, top-k). All experiments were run on Apple Silicon hardware with models ranging from 67B to 123B parameters, with careful documentation of timing data and exclusion of stuck requests.

03 · Key insights

What the paper shows.

01Backend framework choice has larger practical impact than quantization level: llama.cpp provides reliable JSON schema enforcement through grammar-constrained sampling but can hang on long-context prompts, while mlx_lm requires explicit prompt-level JSON instructions but handles long contexts better

02The best local models (Kimi K2.5 GGUF Q3) achieve 77% on CLD extraction, matching mid-tier cloud performance and approaching the 77-89% range of cloud models

03Local models show strong performance on model building steps (50-100%) and feedback explanation (47-75%) but fail dramatically on error fixing tasks (0-50%), which are dominated by long-context prompts that expose memory limitations

04Model architecture type (reasoning vs. instruction-tuned) and deployment configuration matter significantly for domain-specific System Dynamics tasks, requiring systematic evaluation beyond standard NLP benchmarks

04 · Results

Cloud models achieved 77-89% overall pass rates on CLD extraction tasks. The best local model (Kimi K2.5 GGUF Q3 zero-shot) reached 77% on CLD extraction, matching mid-tier cloud performance. On the Discussion Leaderboard, local models achieved 50-100% success on model building steps and 47-75% on feedback explanation tasks. However, local models only achieved 0-50% on error fixing tasks, which involve long-context prompts. The systematic parameter sweep across backends, quantization levels, and sampling parameters provided detailed performance profiles for 67B-123B parameter models running on Apple Silicon hardware.

05 · Limitations

The study is limited to Apple Silicon hardware deployments, which may not generalize to other local deployment scenarios (NVIDIA GPUs, AMD, etc.). The benchmarks are specific to System Dynamics tasks and may not reflect performance on other domain-specific applications. Local models struggle significantly with long-context error-fixing tasks due to memory constraints, limiting their practical utility for complete interactive coaching workflows. The evaluation does not address fine-tuning potential, focusing only on zero-shot and few-shot performance. Timing data required cleaning to exclude stuck requests, indicating reliability issues with certain model-backend-prompt combinations that may affect production deployments.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers