Resource Allocation for Scientific Workflows in Heterogeneous Infrastructures
Our overall goal is to enhance the portability of DAWs, allowing scientists to focus on the domain-specific challenges in their DAWs.
We aim to achieve this through exploiting the characteristics of heterogeneous infrastructures to find an adaptive task-resource allocation.
First, we want to develop novel methods to automatically describe and discover heterogeneous components and topologies
This knowledge is then used to dynamically create infrastructure-aware task execution profiles at workflow runtime.
Motivation
Scientific Workflows consists of a huge amount of recurring tasks, where the execution can take several days.
Resource managers handle the task-resource assignments, while they ensure that CPU and memory constraints are taken into account.
However, they do not take other resource characteristics like CPU clock rates or memory latencies into account when managing heterogeneous clusters.
Therefore, only simplistic black-box scheduling algorithms can be used, which do not take task characteristics and heterogeneous infrastructure into account.
Approach
In a first step we want to improve the knowledge a resource manager has about the existing infrastructure.
Therefore, we want to develop novel methods to describe and profile heterogeneous infrastructures and networks.
In the second step we want to use the historic runtime data to model task-resource profiles and to predict the runtime of tasks.
Through the extensive knowledge the task-resource assignment process can be improved and the workflow runtime decreased.
We implement our systems solutions into scientific workflow systems and evaluate them on real-world scientifc workflows.
Publications
- Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters. Jonathan Bader, Lauritz Thamsen, Svetlana Kulagina, Jonathan Will, Henning Meyerhenke, and Odej Kao. Proceedings of the 2021 IEEE International Conference on Big Data (Big Data). IEEE. 2021. [arXiv]
- Lotaru: Locally estimating runtimes of scientific workflow tasks in heterogeneous clusters. Jonathan Bader, Fabian Lehmann, Lauritz Thamsen, Jonathan Will, Ulf Leser, Odej Kao. Proceedings of the 34th International Conference on Scientific and Statistical Database Management. ACM. 2022. [arXiv]
- Reshi: Recommending resources for scientific workflow tasks on heterogeneous infrastructures. Jonathan Bader, Fabian Lehmann, Alexander Groth, Lauritz Thamsen, Dominik Scheinert, Jonathan Will, Ulf Leser, Odej Kao. 2022 IEEE International Performance, Computing, and Communications Conference (IPCCC). IEEE. 2022. [arXiv]
- Macaw: The machine learning magnetometer calibration workflow. Jonathan Bader, Kevin Styp-Rekowski, Leon Doehler, Soeren Becker, Odei Kao. 2022 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE. 2022. [arXiv]
- Towards Advanced Monitoring for Scientific Workflows. Jonathan Bader, Joel Witzke, Soeren Becker, Ansgar Lößer, Fabian Lehmann, Leon Doehler, Anh Duc Vu, Odej Kao. 2022 IEEE International Conference on Big Data (Big Data). IEEE. 2022. [arXiv]
- Leveraging Reinforcement Learning for Task Resource Allocation in Scientific Workflows. Jonathan Bader, Nicolas Zunker, Soeren Becker, Odej Kao. 2022 IEEE International Conference on Big Data (Big Data). IEEE. 2022. [arXiv]
- How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface. Fabian Lehmann, Jonathan Bader, Friedrich Tschirpke, Lauritz Thamsen, Ulf Leser. 2023 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE/ACM. 2023. [arXiv]
- Lotaru: Locally predicting workflow task runtimes for resource management on heterogeneous infrastructures. Jonathan Bader, Fabian Lehmann, Lauritz Thamsen, Ulf Leser, Odej Kao. 2024. Elsevier. [Elsevier]
If you have any questions or are interested in collaborating with us on this topic, please get in touch with Jonathan!
Acknowledgments
This work was funded by the German Research Foundation (DFG), CRC 1404: “FONDA: Foundations of Workflows for Large-Scale Scientific Data Analysis”.