Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters.
Yet, selecting appropriate computational resources for dataflow jobs –– that neither lead to bottlenecks nor to low resource utilization –– is often challenging, even for expert users such as data engineers.
Some existing automated approaches to resource selection rely on the assumption that a job is recurring to learn from previous runs or to warrant the cost of full test runs to learn from. However, this assumption often does not hold since many jobs are too unique.
Other approaches do job profiling on e.g. small samples of the input dataset to learn about the job’s behavior, which is then extrapolated to processing the full dataset on a full cluster.
The downside is the incurred overhead of time and computational resources.
In this research task, you will explore a third approach, dynamic cluster resource allocation, which involves continuously adapting resource allocations on the fly. Allocation decisions are made based on various factors like resource availability and the workload’s current resource needs. There are various origins for a need to adapt the resource allocation of an ongoing data processing workload, such as changes in resource “cost” ($/CO²/availability) or ensuring to reach performance targets.
Required Skills:
The minimum skills required are knowledge of distributed computing systems and experience with data science. Also, you need to read and understand the process of writing a thesis at our research group.
Start: Immediately
Contact: Jonathan Will (will ∂ tu-berlin.de)