Metric-Based Anomaly Detection and Root Cause Analysis for Distributed Systems

Root Cause Analysis (RCA) is fundamental for diagnosing performance degradation and system failures in distributed systems. Effective RCA enables organizations to maintain operational efficiency and enhance system resilience. Anomaly Detection (AD) plays a critical role in RCA by identifying unusual patterns or deviations from expected behavior in performance metrics, which may indicate underlying issues such as hardware failures, software bugs, or network congestion. By leveraging AD techniques, organizations can be alerted to prevent system failures and performance degradation and rapidly identify potential areas of concern, thereby accelerating the RCA process.

Root Cause Analysis: Many existing RCA approaches employ causal inference techniques, which are tailored to time series data. These methods aim to construct a causality graph representing relationships between services or metrics. While these graphs can provide valuable insights, the process of discovering causal relationships is resource intensive, scales poorly with the number of metrics, and often falls short in terms of accuracy. Other approaches use statistical methods only for fast evaluation and processing or use constructed or preexisting topology graphs along with metrics to locate the potential root causes.

Anomaly Detection: Various methods exist for AD, including clustering, change point detection, z-scoring, or more complex machine learning-based methods. Different anomaly types, e.g., outlier, level shift, or trend anomalies, require tailored approaches. With a focus on key performance indicators, AD methods can be applied to both univariate and multivariate time series data. The performance of AD methods is evaluated with recall and precision. High recall ensures that most anomalies are detected, while high precision minimizes false positives, preventing unnecessary processing of non-existent issues. This is especially problematic when AD methods are sensitive to outliers that do not necessarily indicate meaningful issues in distributed system performance.

Potential Research Questions for Theses:

Prerequisites:

Start: Immediately

Contact: Anton Altenbernd (a.altenbernd@tu-berlin.de)