Multi-Objective Optimization of Multi-Center Computing using Online Learning

Overview:

Scientific workflows orchestrate complex data analyses on large, compute-intensive datasets that often exceed the capacity of a single computer or even a single cluster. Offloading jobs across multiple clusters can alleviate resource constraints and shorten execution time while controlling cost.

However, due to growing concerns about carbon emissions and rising energy costs, improving energy efficiency, shortening the makespan or intelligently distributing computational load can relieve the need for expensive infrastructure overhauls or increased energy supply. To improve practical efficiency, workflow or resource management systems need to be extended to adaptively react to fluctuations in energy availability and load patterns of the workload by choosing the optimal data-center, cluster or maschine to execute the workflow jobs. To make these decisions, detailed insights into both the available infrastructures and the runtime behavior of the workloads, such as resource and energy consumption, which are inherently uncertain become critical.

A central requirement is adaptability under dynamic and uncertain environments, including fluctuating energy availability and varying workload characteristics. To address this, online learning techniques continuously update model parameters at runtime as new observations become available, for example by refining energy consumption estimates under changing system loads. In parallel, reinforcement learning enables the system to learn decision policies through interaction with the environment, such as selecting execution sites or scheduling strategies that minimize makespan or energy usage over time, while balancing exploration of new configurations with exploitation of known efficient placements.

Assignment:

This thesis lies at the intersection of high-performance-computing, distributed systems, maschine learning and operating systems.

During this thesis, you will develop and evaluate online machine learning and reinforcement learning techniques that continuously learn to predict system metrics and energy consumption of data analysis jobs during runtime. Leveraging fine-grained telemetry, the models are designed to adapt to changing workload characteristics and infrastructure conditions, providing up-to-date and uncertainty-aware estimates throughout execution.

Building on these models, you will design algorithms and heuristics for distributing entire workflows or individual tasks across multiple data centers or clusters. The decision-making process is optimized using the learned predictions, enabling adaptive placement strategies that balance energy efficiency, execution time, and resource utilization under dynamic and heterogeneous conditions. You will implement your strategies as extensions of for e.g. Snakemake, Nextflow, Kubernetes or Slurm or simulate promising approaches in WRENCH or WFCommons tools.

The quality of the developed methods should be evaluated on a multi-cluster or multi-cluster with real-world workflows and compared to existing approaches of our own prior work.

Possible Research Questions and Directions:

Considering a data center’s spatiotemporal variability of carbon emissions, how to use task models and time series data to map tasks to machines while optimizing for carbon emission reductions?
How to incorporate various sources of uncertainty appropriately into the algorithmic mapping problem description as well as into the algorithms solving it?
How can we extend the carbon-aware task mapping approaches to scheduling whole and multiple DAWs in one or more compute center(s)?
How can we learn from previous deviations between estimates and measurements from complete DAW executions, in particular in a highly dynamic infrastructure environment?
How can the schedule be adapted dynamically without computing it completely from scratch?
How can the uncertainty-aware scheduling algorithm work in a decentralized manner with coordination to account for efficient multi-site execution?

Requirements:

Solid knowledge of operating system concepts, machine learning techniques and advanced programming skills (e.g. Go, Rust, Python) are required. You need to be willing to learn about unfamiliar systems, tools and methods. Your learning process will be highly rewarding.

Start: Immediately

Contact: Niklas Fomin (fomin ∂ tu-berlin.de)

Note: If applicable, we expect the results to be used for further research and publication in cooperation with the student (if wanted).

Resources: