Distributed and Operating Systems Research at TU Berlin

Distributed and Operating Systems is a research group at TU Berlin by professor Odej Kao. We work at the intersection of distributed systems, operating systems, artificial intelligence, and information systems.

Research

We design and evaluate systems and methods for operating modern data-intensive and AI-driven applications efficiently, reliably, and sustainably across cluster and cloud infrastructures. Our current research topics cover

🧠 Systems for AI: Optimized Training and Inference

We build systems that make training and inference workloads faster, more efficient, and easier to operate at scale. This includes work on performance profiling, automatic configuration and tuning of complex software stacks, and mechanisms to adapt to changing workloads and hardware characteristics. We are interested both in improving individual components and in end-to-end machine learning systems.

🕵 Observability and Reliability

Modern AI and data platforms generate massive streams of logs, metrics, and traces, while distributed ML training and serving must remain robust under failures and performance anomalies. We develop AIOps methods for anomaly detection, failure localization, and root-cause analysis, and we design mechanisms for making large-scale training and inference pipelines more fault tolerant and observable end-to-end.

🌱 Carbon-Aware Computing

Large-scale data processing and AI workloads consume significant amounts of energy. We explore how to schedule and place workloads in ways that reduce their carbon footprint, for example by shifting computations in time or across locations to better align with the availability of renewable energy. This includes co-simulation of energy and computing systems and the design of carbon-aware policies for distributed infrastructures.

🤖 AI for Robots and Physical Systems

We investigate how large language models can coordinate real-world robots in messy, human environments. Our work focuses on building symbolic world states from limited onboard sensors and using LLMs to generate, update, and execute code-level plans that call existing navigation and manipulation stacks, both for single robots and for multi-robot systems that must collaborate.

⚙️ Resource Allocation in Distributed Processing

Selecting and configuring resources for distributed data processing is difficult, as users rarely have full visibility into system behavior, workload dynamics, and infrastructure heterogeneity. We develop models and tools that help decide which and how many resources to allocate, how to place and co-locate tasks, and how to adapt resource allocations over time.