Anomaly Localization, and Diagnosis
Digitalization is an integral part of today’s world. Our everyday life increasingly depends on web and mobile applications. Furthermore, numerous industries like transportation, manufacturing, health care, and education rely on digital services. Users expect high availability and reliability. However, systems that grow in size and complexity become increasingly prone to failures. We want to support human experts with the operation of highly complex and distributed systems. Therefore, we research the applicability of artificial intelligence and machine learning to detect anomalies, localize their root cause, and apply diagnosis. This process should narrow down the scope of troubleshooting, allow automation, and eventually reduce the recovery time. A problem needs to be detected before it can be resolved. Thus, anomaly detection is the essential first step towards supporting system operators. A variety of methods can be employed to model the system’s normal behavior and raise alarms when deviations from that norm occur. However, realizing robust and precise anomaly detection for modern computer systems is challenging. Aspects like dynamic vertical scaling, short software update cycles, or resource sharing through virtualization lead to rapid changes. This leads to a fast depreciation of the modeled normal behavior and results in many false alarms. We have a long standing tradition of addressing the problem of anomaly detection from IT system data. Our work covers diverse aspects of the anomaly detection including multiple data types (metrics, logs, traces and multiple modalities).
In highly distributed and dynamic systems such as microservice architectures, the number of interdependent components can reach hundreds or thousands, resulting in billions of monitoring metrics. An anomalous component can affect several other system parts, resulting in hundreds of alarms. The identification of the root cause is time-consuming, partly repetitive, and error-prone for humans. In our group, we study methods to localize the root cause of anomalies in large-scaled complex distributed systems with the overall aim to support operators. Our overall goal is to support human operators by introducing automation when handling occurring IT system problems. Whenever anomalies are detected and localized, resolving operations should be automatically selected and executed to restore an operational state of the system. However, system components can be affected by different anomalies that require a specific operation to be resolved. A deeper understanding of these specifics is necessary to enable the selection of appropriate operations. We research the application of pattern analysis to identify typical anomaly patterns in the system metric data to match them with resolving operations.
Ongoing Research
We currently work on:
-
Log-based Anomaly Detection, where we research new methods and construct different failures and collect datasets with representative anomalies in logs.
-
Log-based Root Cause Analysis, where we research new methods to determine the root cause of problems.
Publications
- Failure Identification from Unstable Log Data using Deep Learning. Jasmin Bogatinovski, Sasho Nedelkoski, Li Wu, Jorge Cardoso, Odej Kao. In 2022 20th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), To Appear. IEEE/ACM, May 2022.
- A2Log: Attentive Augmented Log Anomaly Detection. Thorsten Wittkopp, Alexander Acker, Sasho Nedelkoski, Jasmin Bogatinovski, Dominik Scheinert, Wu Fan, Odej Kao. In Proceedings of the 55th Hawaii International Conference on System Sciences. 2022.
- Robust and Transferable Anomaly Detection in Log Data using Pre-Trained Language Models. Harald Odtt, Jasmin Bogatinovski, Alexander Acker, Nedelkoski Sasho, and Odej Kao. In 43-rd International Conference on Software Engineering, To Appear. ACM, 2021. [arXiv preprint]
- A Taxonomy of Anomalies in Log Data. Thorsten Wittkopp, Philipp Wiesner, Dominik Scheinert, Odej Kao. In International Conference on Service-Oriented Computing. 2021
- Self-attentive Classification-based Anomaly Detection in Unstructured Logs. Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, Jorge Cardoso, and Odej Kao. In ICDM 2020: 20th IEEE International Conference on Data Mining, pages 1196–1201. IEEE, 2020. [arXiv preprint]
- Learning dependencies in distributed cloud applications to identify and localize anomalies. Dominik Scheinert, Alexander Acker, Lauritz Thamsen, Morgan K. Geldenghuys, and Odej Kao. In 43-rd International Conference on Software Engineering, To appear. ACM, 2021.
- Self-Supervised Anomaly Detection from Distributed Traces. Jasmin Bogatinovski, Sasho Nedelkoski, Jorge Cardoso and Odej Kao, 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC), Leicester, UK, 2020, pp. 342-347.
- Multi-Source Anomaly Detection in Distributed IT Systems. Jasmin Bogatinovski, Sasho Nedelkoski In 18th International Conference on Service-Oriented Computing, To appear, Dubai,United Arab Emirates, December 2020. Springer. [arXiv preprint]
- Autoencoder-based condition monitoring and anomaly detection method for rotating machines. Sabtain Ahmad, Kevin Styp-Rekowski, Sasho Nedelkoski, and Odej Kao. In 2020 IEEE International Conference on Big Data, To appear. IEEE, 2020.
- Telesto: A graph neural network model for anomaly classification in cloud services. Dominik Scheinert and Alexander Acker. In 18th International Conference on Service-Oriented Computing, To appear. Springer, 2020
- Anomaly detection and levels of automation for ai-supported system administration. Anton Gulenko, Odej Kao, and Florian Schmidt. In Annual International Symposium on Information Management and Big Data, pages 1–7. Springer, 2019.
- Anomaly detection and classification using distributed tracing and deep learning. Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CC-GRID), pages 241–250. IEEE/ACM, May 2019.
- Anomaly detection from system tracing data using multimodal deep learning. Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), pages 179–186. IEEE, July 2019.
- Unsupervised anomaly alerting for IOT-gateway monitoring using adaptive thresholds and half-space trees. René Wetzig, Anton Gulenko and Florian Schmidt. In 2019 Sixth International Conference on Internet of Things: Systems, Management and Security (IOTSMS), pages 161–168.IEEE, October 2019.
- Detecting anomalous behavior of black-box services modeled with distance-based online clustering. Anton Gulenko, Florian Schmidt, Alexander Acker, Marcel Wallschlager, Odej Kao, and Feng Liu. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 912–915.IEEE, 2018.
- Unsupervised anomaly event detection for cloud monitoring using online arima. Florian Schmidt, Florian Suri-Payer, Anton Gulenko, Marcel Wallschlager, Alexander Acker, and Odej Kao. In 2018 IEEE/ACM International Conference on Utility and Cloud Computing(UCC), pages 71–76. IEEE, December 2018.
- Iftm-unsupervised anomaly detection for virtualized network function services. Florian Schmidt, Anton Gulenko, Marcel Wallschl̈ager, Alexander Acker, Vincent Hennig, Feng Liu, and Odej Kao. In 2018 IEEE International Conference on Web Services (ICWS), pages 187–194. IEEE,2018.
- Unsupervised anomaly event detection for VNF service monitoring using multivariate online arima. Florian Schmidt, Florian Suri-Payer, Anton Gulenko, MarcelWallschl̈ager, Alexander Acker, and Odej Kao. In 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), CloudCom 2018, pages 278–283.IEEE, December 2018.
- Anomaly detection for black box services in edge clouds using packet size distribution. Marcel Wallschlager, Anton Gulenko, Florian Schmidt, Alexander Acker, and Odej Kao. In 2018 7-th IEEE International Conference on Cloud Networking (CloudNet), CloudNet2018, pages 1–6. IEEE,October 2018.
- Patient-individual morphological anomaly detection in multi-lead electrocardiography data streams. Alexander Acker, Florian Schmidt, Anton Gulenko, Reinhard Kietzmann, and Odej Kao. In Big Data (Big Data), 2017 IEEE International Conference on, pages 3841–3846. IEEE, 2017.
- Automated anomaly detection in virtualized services using deep packet inspection. Marcel Wallschl ̈ager, Anton Gulenko, Florian Schmidt, Odej Kao, and Feng Liu. Procedia Computer Science, pages: 510–515, 2017.
- Evaluating machine learning algorithms for anomaly detection in clouds Anton Gulenko, Marcel Wallschlager, Florian Schmidt, Odej Kao, and Feng Liu. In Big Data (Big Data), 2016 IEEE International Conference, pages 2716–2721. IEEE, 2016.
- A system architecture for real-time anomaly detection in large-scale nfv systems. Gulenko, Anton and Wallschlager, Marcel and Schmidt, Florian and Kao, Odej and Liu, Feng. Procedia Computer Science, 94:491–496, 2016.
- Telesto: A graph neural network model for anomaly classification in cloud services. Dominik Scheinert and Alexander Acker. In 18th International Conference on Service-Oriented Computing, To appear. Springer, 2020
- Performance diagnosis in cloud microservices using deep learning. Li Wu, Jasmin Bogatinovski, Sasho Nedelkoski, Johan Tordsson, and Odej Kao. In 18th International Conference on Service-Oriented Computing, To appear, Dubai,United Arab Emirates, December 2020. Springer.
- Microras: Automatic recovery in the absence of historical failure data for microservice systems. Li Wu, Johan Tordsson, Alexander Acker, and Odej Kao. In 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC), pages 227–236. IEEE, 2020
- Microrca: Root cause localization of performance issues in microservices. Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. In NOMS 2020 IEEE/IFIP Network Operations and Management Symposium, pages 1–9. IEEE, 2020.