Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving

Hung Cuong Pham; Fatih Gedikli

✨ TL;DR

This paper evaluates and optimizes a BentoML-based AI inference system for scalable model serving, testing it under realistic workloads and applying multi-level optimizations to improve latency and throughput. The work bridges the gap between AI model research and practical deployment by providing concrete performance analysis and optimization strategies.

01 · Problem

AI research typically focuses on model design and algorithmic improvements, but the deployment and inference aspects remain underexplored despite being critical for real-world applications. There is a lack of systematic evaluation and optimization guidance for production AI inference systems, particularly regarding how they perform under realistic, varying workload conditions and what bottlenecks emerge in the serving pipeline.

02 · Approach

The study uses a pre-trained RoBERTa sentiment analysis model deployed via BentoML to establish baseline performance across three realistic workload scenarios. Traffic patterns following gamma and exponential distributions simulate steady, bursty, and high-intensity workloads. Performance metrics including latency percentiles and throughput are collected to identify bottlenecks. Optimization strategies are then applied at multiple levels of the serving stack (runtime, service, and deployment), and the optimized system is reevaluated under identical conditions with statistical comparison to quantify improvements.

03 · Key insights

What the paper shows.

01Systematic performance evaluation under realistic, varied workload distributions is essential for understanding inference system behavior and identifying bottlenecks

02Multi-level optimization across runtime, service, and deployment layers can significantly improve both latency and throughput metrics

03Single-node K3s cluster deployment provides practical resilience during disruptions for scalable AI inference

04Latency and throughput scaling characteristics differ under steady, bursty, and high-intensity workloads, requiring tailored optimization approaches

04 · Results

The study demonstrates practical optimization strategies for BentoML-based inference systems, showing measurable improvements in latency percentiles and throughput when optimizations are applied across multiple stack levels. The results quantify how the system scales under varying workload conditions and how deployment in a K3s cluster influences resilience, though specific numerical improvements are not detailed in the abstract.

05 · Limitations

The evaluation is limited to a single pre-trained model (RoBERTa for sentiment analysis), which may not generalize to other model architectures or task types. The study focuses on a specific inference framework (BentoML) and deployment environment (K3s cluster), potentially limiting applicability to other serving platforms. The abstract does not provide specific numerical results or detailed comparison metrics, making it difficult to assess the magnitude of improvements achieved.

✨ Generated by Claude · Apr 25, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers