Maschine Learning-Driven Co-Location of Data-Parallel Batch Jobs

Overview:

As the amount of data available to researchers in fields ranging from bioinformatics to physics to remote sensing continues to grow, the importance of scientific workflow management systems (SWMS) has increased dramatically. These systems play a critical role in creating and executing scalable data analysis pipelines and use resource management systems to schedule analysis jobs on high-performance compute infrastructures (HPC).

HPC schedulers usually rely on co-locating applications on shared machines, but this often leads to interference in caches, memory bandwidth, I/O paths, and network stacks, making performance, latency, and energy use difficult to predict. However, due to growing concerns about carbon emissions and rising energy costs, improving energy efficiency can relieve the need for expensive infrastructure overhauls or increased energy supply. To improve practical efficiency, workflow and resource management systems need to be adapted by allowing jobs to share resources only if no decrease in the operating efficiency or energy usage is expected. To enable decision making at this level, detailed insights into the effects of resource sharing on the kernel-level are necessary.

The central question of this thesis is how kernel-visible signals can be used to detect resource contention of user-space applications, especially applied on HPC jobs and how such information can guide co-location decisions so that jobs complement rather than disturb each other. Furthermore, using selected signals, the thesis will investigate how to predict the performance and energy impact of co-locating jobs on shared machines.

Assignment:

This thesis lies at the intersection of high-performance-computing, maschine learning and operating systems.

During this thesis you will analyze patterns in energy consumption and performance degradation by co-locating selected workloads on different platforms to identify, quantify and mitigate the effects of co-located jobs on heterogenous compute clusters. You will develop and evaluate machine learning models to predict the performance and energy impact of co-locating jobs based on kernel-level signals. To equip schedulers with the ability to make informed co-location decisions, you will design and implement symbiotic scheduling algorithms and heuristics that use the developed models. You will implement your strategies as extensions of for e.g. Snakemake, Nextflow, Kubernetes or Slurm or simulate promising approaches in WRENCH or WFCommons tools.

The quality of the developed methods should be evaluated on a compute cluster with real-world workflows.

Research Questions:

How to model the co-location of jobs such that resource usages (CPU, memory, I/O, or network) complement each other, reducing interference, energy consumption and communication overhead?
How to format and process low-level job traces, so that machine learning models accurately estimate resource contention?
How are interfering resources related to heterogeneous hardware and how can this be modeled?

Requirements:

Solid knowledge of operating system concepts, machine learning techniques and advanced programming skills (e.g. Go, Rust, Python) are required. The willingness to learn about unfamiliar systems, tools and methods is expected. Your learning process will be highly rewarding.

Start: Immediately

Contact: Niklas Fomin (fomin ∂ tu-berlin.de)

Note: If applicable, we expect the results to be used for further research and publication in cooperation with the student (if wanted).

Resources:

Paper: https://ieeexplore.ieee.org/document/11408462
Paper: https://dl.acm.org/doi/10.1109/SC.2010.43
Paper: https://ieeexplore.ieee.org/abstract/document/11401461
Github: https://github.com/Niklasfomin/ColocationExperiments
Github: https://github.com/dos-group/energy-process-mapping
Github: https://github.com/dos-group/gpu_power_benchmark
Github: https://github.com/dos-group/py-lotaru