Resource Configuration for Graph Workloads in Distributed Dataflow Systems

Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs –– that neither lead to bottlenecks nor to low resource utilization –– is often challenging, even for expert users such as data engineers.

Further, existing automated approaches to resource selection rely on the assumption that a job is recurring to learn from previous runs or to warrant the cost of full test runs to learn from. However, this assumption often does not hold since many jobs are too unique. Especially some key characteristics of the dataset that is being processed can drastically influence the usage of different resources like RAM, CPU, or I/O.

Your task is to investigate the relationship between the given dataset and the dataflow job’s resource access patterns for graph algorithms, focusing mainly on memory usage.

Required Skills:

The minimum skills required are knowledge of distributed computing systems and some experience with machine learning. Previous knowledge about graphs is definitely useful here. Also, you need to read and understand the process of writing a thesis at our research group.

Start: immediately

Contact: Jonathan Will (will ∂ tu-berlin.de)