Machine Learning System
Wenyue Hua
Jun 12, 2025

Abstract
Machine Learning Systems (MLSys) is an emerging field that sits at the intersection of systems engineering and machine learning, focusing on the practical challenges of deploying, scaling, and optimizing AI systems in real-world environments. MLSys addresses the significant obstacles in designing and implementing systems that support ML models in production, recognizing the radically different development and deployment profiles of modern ML methods compared to traditional software systems. The field encompasses hardware systems for ML, software systems for ML, and optimizations that go beyond predictive accuracy to consider factors like latency, throughput, energy efficiency, and cost-effectiveness.
Publications
LLM-based intelligent agents face significant deployment challenges, particularly related to resource management. Allowing unrestricted access to LLM or tool resources can lead to inefficient or even potentially harmful resource allocation and utilization for agents. Furthermore, the absence of proper scheduling and resource management mechanisms in current agent designs hinders concurrent processing and limits overall system efficiency. As the diversity and complexity of agents continue to grow, addressing these resource management issues becomes increasingly critical to LLM-based agent systems. To address these challenges, this paper proposes the architecture of AIOS (LLM-based AI Agent Operating System) under the context of managing LLM-based agents. It introduces a novel architecture for serving LLM-based agents by isolating resources and LLM-specific services from agent applications into an AIOS kernel. This AIOS kernel provides fundamental services (e.g., scheduling, context management, memory management, storage management, access control) and efficient management of resources (e.g., LLM and external tools) for runtime agents. To enhance usability, AIOS also includes an AIOS-Agent SDK, a comprehensive suite of APIs designed for utilizing functionalities provided by the AIOS kernel. Experimental results demonstrate that using AIOS can achieve up to 2.1x faster execution for serving agents built by various agent frameworks. The source code is available at this url.
Kai Mei,
Xi Zhu,
Wujiang Xu,
Wenyue Hua,
Mingyu Jin,
Zelong Li,
Shuyuan Xu,
Ruosong Ye,
Yingqiang Ge,
Yongfeng Zhang