Distributed and Operating Systems is a research group at TU Berlin by professor Odej Kao. It works at the intersection of distributed systems, operating systems, artificial intelligence, and information systems and focuses on AI-Ops as well as adaptive resource management for data-intensive applications.
We develop novel methods for automating aspects of the operation of IT systems. One of the central goals of our research is building autonomous intelligent units that support the IT system operators in optimizing key business objectives (e.g., prevention of Service Level Agreements (SLA) violation and improving quality of service (QS) for customers). To achieve this, we develop artificial intelligence methods for the heterogeneous monitoring data (metrics, logs, traces, source code, customer input, postmortem reports) that expose the internal state of the IT systems or the different failures they experience. Specifically, we identify patterns of anomalous behavior, relate the anomalies with diverse root causes and automatically decide on the most appropriate corrective actions based on the detected root causes.
We also develop methods, systems, and tools to make the implementation, testing, and operation of efficient and dependable data-intensive applications easier. Towards this goal, we work on adaptive resource management for distributed computing environments from small IoT devices to large clusters of virtual resources. Ultimately, we aim to realize systems that automatically adapt to workloads, computing environments, and application requirements. For research on adaptive resource management, there is an ongoing collaboration with the University of Glasgow.
Current research areas include: