Systems for AI – Optimized Inference and Serving

The rapid growth of large language models (LLMs) has revolutionized various fields, from natural language processing to machine learning-driven applications. However, the deployment and serving of these models pose significant challenges. As LLMs continue to scale in size and complexity, the systems that support their inference must also evolve. Optimized inference is crucial for ensuring that these models can perform efficiently, delivering real-time responses at scale without compromising the quality of results. This involves reducing the computational overhead, optimizing memory usage, and ensuring minimal latency while maintaining model accuracy.

In parallel with optimized inference, serving LLMs to end-users in real-time introduces additional system-level challenges, including load balancing, fault tolerance, and resource management. Efficient serving systems must handle large volumes of concurrent requests, often with very stringent latency requirements, while also providing mechanisms for model updates and versioning. Addressing these challenges requires innovative solutions in system design, such as specialized hardware acceleration, distributed computing frameworks, and scalable deployment architectures. By improving both the inference process and the infrastructure for serving these models, we aim to create systems that can scale with the growing demand for AI-driven services while maintaining high performance and reliability.

Featured Publications

β-GNN: A Robust Ensemble Approach Against Graph Structure Perturbation
Haci Ismail Aslan, Philipp Wiesner, Ping Xiong, Odej Kao
Proceedings of the 5th Workshop on Machine Learning and Systems. 2025. [read here]
Beyond Test-Time Compute Strategies: Advocating Energy per Token in LLM Inference
Patrick Wilhelm, Thorsten Wittkopp, Odej Kao
Proceedings of the 5th Workshop on Machine Learning and Systems. 2025. [read here]