– layout: default title: —
Assignment:
As the amount of data available to researchers in fields ranging from bioinformatics to physics to remote sensing continues to grow, the importance of scientific workflow management systems (SWMS) has increased dramatically. These systems play a critical role in creating and executing scalable data analysis pipelines and use resource management systems to schedule analysis jobs on high-performance compute infrastructures (HPC).
HPC schedulers usually rely on co-locating applications on shared machines, but this often leads to interference in caches, memory bandwidth, I/O paths, and network stacks, making performance, latency, and energy use difficult to predict. To improve practical efficiency, schedulers could be adapted by allowing jobs to share resources only if no decrease in the operating efficiency or energy usage is expected. To enable decision making at this level, detailed insights into the effects of resource sharing on the kernel-level are necessary.
Simulation provides a controlled and reproducible environment to systematically study these low-level contention effects across diverse workload combinations and hardware configurations, without the cost and variability of real deployments. By embedding co-location models into a simulation framework, scheduler policies can be evaluated and optimized using predictive signals before being applied in production systems.
The central question of this thesis is how kernel-visible signals can be used to model and detect resource contention on shared compute-resources. A set of selected signals will be implemented and simulation so that the performance and energy impact of co-locating jobs on shared machines can be studied in the future.
Tasks:
During this thesis you will analyze patterns in energy consumption and performance degradation by co-locating selected workloads on different platforms to identify, quantify and mitigate the effects of co-located jobs on heterogeneous compute clusters. You will develop accurate models for resource contention effects on the hardware- and software level using low-level system metrics and intergrate them in prominent simulation frameworks like WRENCH, SimGrid or CloudSim.
The quality of the developed methods should be evaluated by comparison against a real compute-cluster with real-world workflows.
Research Questions:
Requirements:
Knowledge of operating system concepts, machine learning techniques and advanced programming skills in C++ and Python are required.
Start: Immediately
Contact: Niklas Fomin (fomin ∂ tu-berlin.de)
Note: For undergraduate students, the topic can be scaled and be either expanded or narrowed down after consultation with the staff member.
Resources: