A Framework for Incorporating Microarchitectural Contention into System Simulations

Overview:

As the amount of data available to researchers in fields ranging from bioinformatics to physics to remote sensing continues to grow, the importance of scientific workflow management systems (SWMS) has increased dramatically. These systems play a critical role in creating and executing scalable data analysis pipelines and use resource management systems to schedule analysis jobs on high-performance compute infrastructures (HPC).

HPC schedulers usually rely on co-locating applications on shared machines, but this often leads to interference in caches, memory bandwidth, I/O paths, and network stacks, making performance, latency, and energy use difficult to predict. To improve practical efficiency, schedulers need to be adapted by allowing jobs to share resources only if no decrease in the operating efficiency or energy usage is expected. To enable decision making at this level, detailed insights into the effects of resource sharing on the kernel-level are necessary.

Simulation provides a controlled and reproducible environment to systematically study these low-level contention effects across diverse workload combinations and hardware configurations, without the cost and variability of real deployments. By embedding co-location models into a simulation framework, scheduler policies can be evaluated and optimized using predictive signals before being applied in production systems.

The central question of this thesis is how kernel-visible signals can be used to model and detect resource contention on shared compute-resources. A set of selected signals will be implemented and simulation so that the performance and energy impact of co-locating jobs on shared machines can be studied in the future.

Assignment:

This thesis lies at the intersection of high-performance-computing and operating systems.

During this thesis you will analyze patterns in energy consumption and performance degradation by co-locating selected workloads on different platforms to identify, quantify and mitigate the effects of co-located jobs on heterogeneous compute clusters. You will develop accurate models for resource contention effects on the hardware- and software level using low-level system metrics and intergrate them in prominent simulation frameworks like WRENCH, SimGrid or CloudSim.

The quality of the developed methods should be evaluated by comparison against a real compute-cluster with real-world workflows.

Research Questions:

How to model the co-location of jobs such that resource usages (CPU, memory, I/O, or network) complement each other, reducing interference, energy consumption and communication overhead?
How to format and process low-level job traces, so that machine learning models accurately estimate resource contention?
How are interfering resources related to heterogeneous hardware and how can this be modeled?

Requirements:

Solid knowledge of operating systems concepts and advanced programming skills in C++ and Python are required. The willingness to learn about unfamiliar systems, tools and methods is expected. Your learning process will be highly rewarding.

Start: Immediately

Contact: Niklas Fomin (fomin ∂ tu-berlin.de)

Note: If applicable, we expect the results to be used for further research and publication in cooperation with the student (if wanted).

Resources:

Link to Papers: https://wrench-project.org/publications
Github: https://github.com/wrench-project/wrench