Resource Allocation for Distributed Processing
Choosing and configuring cluster resources for distributed data processing jobs can be a challenging task.
Even expert users often do not fully understand system and workload dynamics, also just because there is usually no full information for all the factors that influence the performance of processing jobs.
At the same time, configuring cluster resources so that jobs execute without significant bottlenecks and taking into account constraints for the execution time and utilized resources is important.
We, therefore, work on resource allocation methods and tools that take such requirements into account and utilize profiling, monitoring, and performance modeling to select adequate sets of resources.
Moreover, co-locating processing tasks with complementary resource demands in shared infrastructures can further increase the resource utilization and job throughput.
We, therefore, aim to answer the following questions for different data processing workloads with our research: What kind of resource should be allocated for a job and its tasks? Which job should be run next when resources become available? Where should a specific task be placed in a particular infrastructure? Should certain tasks be co-located onto shared resources?
To answer these questions, we use monitoring data, profiling runs, different performance models, as well as scoring and optimization methods.
Ongoing Research
We currently work on multiple topics in this area:
People
Publications
- Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?. Jonathan Will, Lauritz Thamsen, Dominik Scheinert, Odej Kao. Proceedings of the 35th International Conference on Scientific and Statistical Database Management (SSDBM). ACM. 2023. [arXiv preprint]
- How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface. Fabian Lehmann, Jonathan Bader, Friedrich Tschirpke, Lauritz Thamsen, Ulf Leser. 2023 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid). 2023. IEEE/ACM. [arXiv preprint]
- Leveraging Reinforcement Learning for Task Resource Allocation in Scientific Workflows. Jonathan Bader, Nicolas Zunker, Soeren Becker, Odej Kao. 2022 IEEE International Conference on Big Data (Big Data). 2022. IEEE. [arXiv preprint]
- Towards Advanced Monitoring for Scientific Workflows. Jonathan Bader, Joel Witzke, Soeren Becker, Ansgar Lößer, Fabian Lehmann, Leon Doehler, Anh Duc Vu, Odej Kao. 2022 IEEE International Conference on Big Data (Big Data). 2022. IEEE. [arXiv preprint]
- Macaw: The machine learning magnetometer calibration workflow. Jonathan Bader, Kevin Styp-Rekowski, Leon Doehler, Soeren Becker, Odej Kao. 2022 IEEE International Conference on Data Mining Workshops (ICDMW). 2022. IEEE. [arXiv preprint]
- Reshi: Recommending resources for scientific workflow tasks on heterogeneous infrastructures. Jonathan Bader, Fabian Lehmann, Alexander Groth, Lauritz Thamsen, Dominik Scheinert, Jonathan Will, Ulf Leser, Odej Kao. 2022 IEEE International Performance, Computing, and Communications Conference (IPCCC). IEEE. 2022. [arXiv preprint]
- Lotaru: Locally estimating runtimes of scientific workflow tasks in heterogeneous clusters. Jonathan Bader, Fabian Lehmann, Lauritz Thamsen, Jonathan Will, Ulf Leser, Odej Kao. Proceedings of the 34th International Conference on Scientific and Statistical Database Management. ACM. 2022. [arXiv preprint]
- AuctionWhisk: Using an auction‐inspired approach for function placement in serverless fog platforms. David Bermbach, Jonathan Bader, Jonathan Hasenburg, Tobias Pfandzelter, Lauritz Thamsen. Software: Practice and Experience 52 (5). Wiley. 2022. [Wiley Open Access]
- Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing. Jonathan Will, Jonathan Bader, Dominik Scheinert, Lauritz Thamsen, and Odej Kao. To appear in the Proceedings of the 2022 IEEE International Conference on Big Data (Big Data). IEEE. 2022. [arXiv preprint]
- Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing. Jonathan Will, Jonathan Bader, Dominik Scheinert, Lauritz Thamsen, and Odej Kao. To appear in the Proceedings of the 2022 IEEE International Conference on Cloud Engineering (IC2E). IEEE. 2022. [arXiv preprint]
- Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud. Jonathan Will , Onur Arslan , Jonathan Bader , Dominik Scheinert , and Lauritz Thamsen. To appear in the Proceedings of the 2021 IEEE International Conference on Big Data (Big Data). Presented at the International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD). IEEE. 2021. [arXiv preprint] [video]
- On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds. Dominik Scheinert, Alireza Alamgiralem, Jonathan Bader, Jonathan Will, Thorsten Wittkopp, Lauritz Thamsen. To appear in the Proceedings of the 2021 IEEE International Conference on Big Data (Big Data). Presented at the 5th International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD). IEEE. 2021. [arXiv preprint]
- Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters. Jonathan Bader, Lauritz Thamsen, Svetlana Kulagina, Jonathan Will, Henning Meyerhenke, and Odej Kao. To appear in the Proceedings of the 2021 IEEE International Conference on Big Data (Big Data). IEEE. 2021. [arXiv preprint]
- Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation. Dominik Scheinert, Houkun Zhu, Lauritz Thamsen, Morgan K. Geldenhuys, Jonathan Will, Alexander Acker, and Odej Kao. To appear in the Proceedings of the 40th IEEE International Performance Computing and Communications Conference (IPCCC). IEEE. 2021. [arXiv preprint]
- Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts. Dominik Scheinert, Lauritz Thamsen, Houkun Zhu, Jonathan Will, Alexander Acker, Thorsten Wittkopp, and Odej Kao. In the Proceedings of the 23rd IEEE International Conference on Cluster Computing (CLUSTER). IEEE. 2021. [arXiv preprint]
- C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds. Jonathan Will, Lauritz Thamsen, Dominik Scheinert, Jonathan Bader, and Odej Kao. In the Proceedings of the 9th IEEE International Conference on Cloud Engineering (IC2E). IEEE. 2021. [arXiv preprint] [video]
- LEAF: Simulating Large Energy-Aware Fog Computing Environments. Philipp Wiesner and Lauritz Thamsen. In the Proceedings of the 2021 IEEE 5th International Conference on Fog and Edge Computing (ICFEC). IEEE. 2021. [Open Access] [video] [code]
- Let’s Wait Awhile: How Temporal Workload Shifting Can Reduce Carbon Emissions in the Cloud. Philipp Wiesner, Ilja Behnke, Dominik Scheinert, Kordian Gontarska, and Lauritz Thamsen. To appear in the Proceedings of the 22nd International Middleware Conference (Middleware). ACM. 2021. [Open Access] [code]
- LOS: Local-Optimistic Scheduling of Periodic Model Training For Anomaly Detection on Sensor Data Streams in Meshed Edge Networks. Soeren Becker, Florian Schmidt, Lauritz Thamsen, Ana Juan Ferrer, and Odej Kao. To appear in the Proceedings of the 2nd IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS). IEEE. 2021.
- Mary, Hugo, and Hugo*: Learning to Schedule Distributed Data-Parallel Processing Jobs on Shared Clusters. Lauritz Thamsen, Jossekin Beilharz, Vinh Thuy Tran, Sasho Nedelkoski, and Odej Kao. In Concurrency and Computation: Practice and Experience (e5823). Wiley. 2020. [Open Access] [code]
- Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs. Jonathan Will, Jonathan Bader, and Lauritz Thamsen. In the Proceedings of the 2020 IEEE International Conference on Big Data (Big Data). Presented at the 4th International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD). IEEE. 2020. [arXiv preprint] [video] [data]
- Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs. Lauritz Thamsen, Ilya Verbitskiy, Sasho Nedelkoski, Vinh Thuy Tran, Vinícius Meyer, Miguel G. Xavier, Odej Kao, and César A. F. De Rose. In the Proceedings of the Euro-Par 2019 Workshops (Euro-Par). Presented at the 1st International Workshop on Parallel Programming Models in High-Performance Cloud. Springer. 2019. [Google Scholar] [code]
- CoBell: Runtime Prediction for Distributed Dataflow Jobs in Shared Clusters. Ilya Verbitskiy, Lauritz Thamsen, Thomas Renner, and Odej Kao. In the Proceedings of the 10th IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE. 2018. [Google Scholar]
- Learning Efficient Co-locations for Scheduling Distributed Dataflows in Shared Clusters. Lauritz Thamsen, Ilya Verbitskiy, Benjamin Rabier, and Odej Kao. In Services Transactions on Big Data (Vol. 4, No. 1). Services Society. 2018. [Open Access]
- Scheduling Stream Processing Tasks on Geo-Distributed Heterogeneous Resources. Gerrit Janßen, Ilya Verbitskiy, Thomas Renner, and Lauritz Thamsen. In the Proceedings of the 2018 IEEE International Conference on Big Data (IEEE BigData). Presented at the First International Workshop on the Internet of Things Data Analytics (IoTDA). IEEE. 2018. [Google Scholar]
- Adaptive Resource Management for Distributed Data Analytics Based on Container-level Cluster Monitoring. Thomas Renner, Lauritz Thamsen, and Odej Kao. In the Proceedings of the 6th International Conference on Data Science, Technology and Applications (DATA). SCITEPRESS. 2017. [Google Scholar]
- Ellis: Dynamically Scaling Distributed Dataflows to Meet Runtime Targets. Lauritz Thamsen, Ilya Verbitskiy, Jossekin Beilharz, Thomas Renner, Andreas Polze, and Odej Kao. In the Proceedings of the 9th IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE. 2017. [Google Scholar] [code]
- Scheduling Recurring Distributed Dataflow Jobs Based on Resource Utilization and Interference. Lauritz Thamsen, Benjamin Rabier, Florian Schmidt, Thomas Renner, and Odej Kao. In the Proceedings of the 6th IEEE BigData Congress. IEEE. 2017. [Google Scholar] [code]
- SMiPE: Estimating the Progress of Recurring Iterative Distributed Dataflows. Jannis Koch, Lauritz Thamsen, Florian Schmidt, and Odej Kao. In the Proceedings of the 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT). IEEE. 2017. [Google Scholar] [code]
- Selecting Resources for Distributed Dataflow Systems According to Runtime Targets. Lauritz Thamsen, Ilya Verbitskiy, Florian Schmidt, Thomas Renner, and Odej Kao. In the Proceedings of the 35th IEEE International Performance Computing and Communications Conference (IPCCC). IEEE. 2016. [Google Scholar] [code] [data]
- When to Use a Distributed Dataflow Engine: Evaluating the Performance of Apache Flink. Ilya Verbitskiy, Lauritz Thamsen, and Odej Kao. In the Proceedings of the IEEE International Conference on Cloud and Big Data Computing (CBDCom). IEEE. 2016. [Google Scholar]
- Continuously Improving the Resource Utilization of Iterative Parallel Dataflows. Lauritz Thamsen, Thomas Renner, and Odej Kao. In the Proceedings of the IEEE International Conference on Distributed Computing Systems Workshops (ICDCSW). Presented at the International Workshop on Big Data and Cloud Performance (DCPerf). IEEE. 2016. [Google Scholar]