– layout: default title: —

Maschine Learning-Driven Co-Location of Data-Parallel Batch Jobs: From Kernel Signals to Symbiotic Scheduling

Assignment:

As the amount of data available to researchers in fields ranging from bioinformatics to physics to remote sensing continues to grow, the importance of scientific workflow management systems (SWMS) has increased dramatically. These systems play a critical role in creating and executing scalable data analysis pipelines and use resource management systems to schedule analysis jobs on high-performance compute infrastructures (HPC).

HPC schedulers usually rely on co-locating applications on shared machines, but this often leads to interference in caches, memory bandwidth, I/O paths, and network stacks, making performance, latency, and energy use difficult to predict. However, due to growing concerns about carbon emissions and rising energy costs, improving energy efficiency can relieve the need for expensive infrastructure overhauls or increased energy supply. To improve practical efficiency, data analysis workflows could be adapted by allowing jobs to share resources if no decrease in the operating efficiency or energy usage is expected. To enable decision making at this level, detailed insights into the effects of resource sharing on the kernel-level are necessary.

The central question of this thesis is how kernel-visible signals can be used to detect resource contention of user-space applications, especially applied on HPC jobs and how such information can guide co-location decisions so that jobs complement rather than disturb each other. Furthermore, using selected signals, the thesis will investigate how to predict the performance and energy impact of co-locating jobs on shared machines.

Tasks:

During this thesis you will analyze patterns in energy consumption and performance degradation by co-locating selected workloads on different platforms to identify, quantify and mitigate the effects of co-located jobs on heterogenous compute clusters. You will develop and evaluate machine learning models to predict the performance and energy impact of co-locating jobs based on kernel-level signals.

The quality of the developed methods should be evaluated on a compute cluster with real-world workflows.

Research Questions:

Requirements:

Knowledge of operating system concepts, machine learning techniques and advanced programming skills (e.g. Go, Rust, Python) are required.

Start: Immediately

Contact: Niklas Fomin (fomin ∂ tu-berlin.de)

Note: For undergraduate students, the topic can be scaled and be either expanded or narrowed down after consultation with the staff member.

Resources: