Learning Cross-Platform Power Models for Multi-Site Energy Profiling

Overview:

As the amount of data available to researchers in fields ranging from bioinformatics to physics to remote sensing continues to grow, the importance of scientific workflow management systems has increased dramatically. These systems play a critical role in creating and executing scalable data analysis pipelines on high-performance compute infrastructures.

However, due to growing concerns about carbon emissions and rising energy costs, improving energy efficiency can relieve the need for expensive infrastructure overhauls or increased energy supply. To improve practical efficiency, data analysis workflows could be adapted, software optimized, or entire pipelines assigned to different machines based on energy availability and load patterns. To enable this decision making, detailed insights into energy consumption are necessary.

In this thesis you will investigate how rich compute telemetry like performance counters, utilization signals and power sensors enable data-driven power models that are both fine-grained and portable enough to compare machines across platforms. The core motivation is that reliable power estimation is a prerequisite for attributing energy to workloads in shared environments and for turning energy into carbon using time-varying electricity-carbon signals.

Assignment:

This thesis lies at the intersection of high-performance-computing, maschine learning and operating systems.

During this thesis you will develop and evaluate machine learning models to predict power consumption based on telemetry data on heterogeneous infrastructures. You will analyze patterns in energy consumption across different workloads and platforms to identify opportunities for optimization and enable intelligent decision-making for resource management systems.

The quality of the developed methods should be evaluated on a compute cluster with real-world workflows and compared to existing state-of-the-art approaches.

Research Questions:

Requirements:

Solid knowledge of operating system concepts, machine learning techniques and advanced programming skills (e.g. Go, Rust, Python) are required. The willingness to learn about unfamiliar systems, tools and methods is expected. Your learning process will be highly rewarding.

Start: Immediately

Contact: Niklas Fomin (fomin ∂ tu-berlin.de)

Note: If applicable, we expect the results to be used for further research and publication in cooperation with the student (if wanted).

Resources: