Metric-Based Anomaly Detection and Root Cause Analysis for Distributed Systems

Root Cause Analysis (RCA) is fundamental for diagnosing performance degradation and system failures in distributed systems. Effective RCA enables organizations to maintain operational efficiency and enhance system resilience. Anomaly Detection (AD) plays a critical role in RCA by identifying unusual patterns or deviations from expected behavior in performance metrics, which may indicate underlying issues such as hardware failures, software bugs, or network congestion. By leveraging AD techniques, organizations can be alerted to prevent system failures and performance degradation and rapidly identify potential areas of concern, thereby accelerating the RCA process.

Root Cause Analysis: Many existing RCA approaches employ causal inference techniques, which are tailored to time series data. These methods aim to construct a causality graph representing relationships between services or metrics. While these graphs can provide valuable insights, the process of discovering causal relationships is resource intensive, scales poorly with the number of metrics, and often falls short in terms of accuracy. Other approaches use statistical methods only for fast evaluation and processing or use constructed or preexisting topology graphs along with metrics to locate the potential root causes.

Anomaly Detection: Various methods exist for AD, including clustering, change point detection, z-scoring, or more complex machine learning-based methods. Different anomaly types, e.g., outlier, level shift, or trend anomalies, require tailored approaches. With a focus on key performance indicators, AD methods can be applied to both univariate and multivariate time series data. The performance of AD methods is evaluated with recall and precision. High recall ensures that most anomalies are detected, while high precision minimizes false positives, preventing unnecessary processing of non-existent issues. This is especially problematic when AD methods are sensitive to outliers that do not necessarily indicate meaningful issues in distributed system performance.

Potential Research Questions for Theses:

Which anomaly detection techniques are best suited for which anomaly types (e.g., outliers, level shifts, trend anomalies) in distributed systems, and how can we maintain a high recall while minimizing false positives across these various anomaly types?
How can causal inference methods be enhanced or be supported to effectively handle hidden variables and temporal distortions in metrics produced by distributed systems, thereby improving the accuracy and reliability of root cause analysis?
How can time series forecasting be used to proactively predict root causes and prevent future performance degradation in distributed systems?

Prerequisites:

Solid knowledge of distributed systems is recommended, along with an understanding of statistical methods and/or machine learning techniques. Experience with Python and data analysis will be helpful. Additionally, students are encouraged to read and understand the process of writing a thesis at our research group.

Start: Immediately

Contact: Anton Altenbernd (a.altenbernd@tu-berlin.de)