Overview:
Scientific workflows orchestrate complex data analyses on large, compute-intensive datasets that often exceed the capacity of a single computer or even a single cluster. Offloading jobs across multiple clusters can alleviate resource constraints and shorten execution time while controlling cost.
However, due to growing concerns about carbon emissions and rising energy costs, improving energy efficiency, shortening the makespan or intelligently distributing computational load can relieve the need for expensive infrastructure overhauls or increased energy supply. To improve practical efficiency, workflow or resource management systems need to be extended to adaptively react to fluctuations in energy availability and load patterns of the workload by choosing the optimal data-center, cluster or maschine to execute the workflow jobs. To make these decisions, detailed insights into both the available infrastructures and the runtime behavior of the workloads, such as resource and energy consumption, which are inherently uncertain become critical.
A central requirement is adaptability under dynamic and uncertain environments, including fluctuating energy availability and varying workload characteristics. To address this, online learning techniques continuously update model parameters at runtime as new observations become available, for example by refining energy consumption estimates under changing system loads. In parallel, reinforcement learning enables the system to learn decision policies through interaction with the environment, such as selecting execution sites or scheduling strategies that minimize makespan or energy usage over time, while balancing exploration of new configurations with exploitation of known efficient placements.
Assignment:
This thesis lies at the intersection of high-performance-computing, distributed systems, maschine learning and operating systems.
During this thesis, you will develop and evaluate online machine learning and reinforcement learning techniques that continuously learn to predict system metrics and energy consumption of data analysis jobs during runtime. Leveraging fine-grained telemetry, the models are designed to adapt to changing workload characteristics and infrastructure conditions, providing up-to-date and uncertainty-aware estimates throughout execution.
Building on these models, you will design algorithms and heuristics for distributing entire workflows or individual tasks across multiple data centers or clusters. The decision-making process is optimized using the learned predictions, enabling adaptive placement strategies that balance energy efficiency, execution time, and resource utilization under dynamic and heterogeneous conditions. You will implement your strategies as extensions of for e.g. Snakemake, Nextflow, Kubernetes or Slurm or simulate promising approaches in WRENCH or WFCommons tools.
The quality of the developed methods should be evaluated on a multi-cluster or multi-cluster with real-world workflows and compared to existing approaches of our own prior work.
Possible Research Questions and Directions:
Requirements:
Solid knowledge of operating system concepts, machine learning techniques and advanced programming skills (e.g. Go, Rust, Python) are required. You need to be willing to learn about unfamiliar systems, tools and methods. Your learning process will be highly rewarding.
Start: Immediately
Contact: Niklas Fomin (fomin ∂ tu-berlin.de)
Note: If applicable, we expect the results to be used for further research and publication in cooperation with the student (if wanted).
Resources: