AIOps Systems
Telecommunication service and network operators are confronted with rising expectations towards availability, performance, and guaranteed QoS. The complexity of modern IT infrastructures has increased to a point, where traditional IT administration procedures fail to holistically ensure the dependability of the systems. In addition, the number of internet-connected devices and the amount of mobile traffic and internet traffic in general is rapidly increasing. This results in highly-distributed environments which not only constitute an increase in complexity through the number of devices but further introduce new operational challenges and a paradox situation: A vulnerable infrastructure has a decisive impact on our everyday life, as it delivers crucial data for i.e. autonomous driving, connected healthcare, or other critical processes.
We are developing frameworks to provide scalable systems for monitoring, hierarchical in-place data analytics, and predictive remediation workflows. We aim to increase the availability, resilience, and fault-tolerance of highly distributed and possibly critical environments.
Therefore we are researching methods to apply and incorporate AIOps methods in those environments while complying with the imposed requirements. This includes - among other approaches - gradually automating administrative processes, developing methods for the profiling and scheduling of machine learning workflows, improving the lifecycle from model training to deployment as well as utilizing decentralized peer-to-peer approaches to cope with the increasing scale of Cloud, Edge and Fog Computing environments.
In a nutshell, we are conducting research on the design, operation and maintenance of AI systems that combine machine learning workflows and sensing capabilities in order to automatically detect anomalous situations and act accordingly.
Ongoing Research
We currently work on:
Cloud Testbed for Failure Injection, in this project we construct a large scale testbed and collect data for different failure types.
Publications
- Towards a Cognitive Compute Continuum: An Architecture for Ad-Hoc Self-Managed Swarms Ferrer, Ana Juan and Becker, Soeren and Schmidt, Florian and Thamsen, Lauritz and Kao, Odej CCGrid. 2021
- Artificial Intelligence for IT Operations (AIOPS) Workshop White Paper, Jasmin Bogatinovski and Sasho Nedelkoski and Alexander Acker and Florian Schmidt and Thorsten Wittkopp and Soeren Becker and Jorge Cardoso and Odej Kao. arXiv arXiv/2101.06054, 2021.
- Towards AIOps in Edge Computing Environments. Becker Soeren, Schmidt Florian, Gulenko Anton, Acker Alexander and Kao, Odej International Conference on Big Data. 2020
- Ai-governance and levels of automation for aiops-supported system administration. Anton Gulenko, Alexander Acker, Odej Kao, and Feng Liu. In The 29th International Conference on Computer Communications and Networks, pages 1–6. IEEE, 2020.
- Multi-source distributed system data for ai-powered analytics. Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Mandapati, Jorge Cardoso, and Odej Kao. In ESOCC 2020: European Conference On Service-Oriented And Cloud Com-puting, pages 161–176. Springer International Publishing, September 2020
- Bitflow: An In Situ Stream Processing Framework. Gulenko, Anton and Acker, Alexander and Schmidt, Florian and Becker, Soeren and Kao, Odej International Conference on Autonomic Computing and Self-Organizing Systems 2020
- Online density grid pattern analysis to classify anomalies in cloud and nfv systems. Alexander Acker, Florian Schmidt, Anton Gulenko, and Odej Kao. In 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), CloudCom 2018, pages 290–295. IEEE,December 2018.
- A system architecture for real-time anomaly detection in large-scale nfv systems. Anton Gulenko, Marcel Wallschläger, Florian Schmidt, Odej Kao, and Feng Liu. Procedia Computer Science, 2016.
Saeed Haddadi Makhsous, Anton Gulenko, Odej Kao, and Feng Liu.
- High available deployment of cloud-based virtualized network functions. In High Performance Computing & Simulation (HPCS), 2016 International Conference on, pages 468–475. IEEE, 2016.