Anomaly Localization, and Diagnosis

Digitalization is an integral part of today’s world. Our everyday life increasingly depends on web and mobile applications. Furthermore, numerous industries like transportation, manufacturing, health care, and education rely on digital services. Users expect high availability and reliability. However, systems that grow in size and complexity become increasingly prone to failures. We want to support human experts with the operation of highly complex and distributed systems. Therefore, we research the applicability of artificial intelligence and machine learning to detect anomalies, localize their root cause, and apply diagnosis. This process should narrow down the scope of troubleshooting, allow automation, and eventually reduce the recovery time. A problem needs to be detected before it can be resolved. Thus, anomaly detection is the essential first step towards supporting system operators. A variety of methods can be employed to model the system’s normal behavior and raise alarms when deviations from that norm occur. However, realizing robust and precise anomaly detection for modern computer systems is challenging. Aspects like dynamic vertical scaling, short software update cycles, or resource sharing through virtualization lead to rapid changes. This leads to a fast depreciation of the modeled normal behavior and results in many false alarms. We have a long standing tradition of addressing the problem of anomaly detection from IT system data. Our work covers diverse aspects of the anomaly detection including multiple data types (metrics, logs, traces and multiple modalities).

In highly distributed and dynamic systems such as microservice architectures, the number of interdependent components can reach hundreds or thousands, resulting in billions of monitoring metrics. An anomalous component can affect several other system parts, resulting in hundreds of alarms. The identification of the root cause is time-consuming, partly repetitive, and error-prone for humans. In our group, we study methods to localize the root cause of anomalies in large-scaled complex distributed systems with the overall aim to support operators. Our overall goal is to support human operators by introducing automation when handling occurring IT system problems. Whenever anomalies are detected and localized, resolving operations should be automatically selected and executed to restore an operational state of the system. However, system components can be affected by different anomalies that require a specific operation to be resolved. A deeper understanding of these specifics is necessary to enable the selection of appropriate operations. We research the application of pattern analysis to identify typical anomaly patterns in the system metric data to match them with resolving operations.

Ongoing Research

We currently work on:

Publications