Memory Usage Profiling for Data-parallel Batch Processing Jobs

Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters.
Yet, selecting appropriate computational resources for dataflow jobs –– that neither lead to bottlenecks nor to low resource utilization –– is often challenging, even for expert users such as data engineers. Right-sizing the memory allocation is especially crucial for resource efficiency.

Some existing automated approaches to resource selection try to model the job’s memory access patterns through profiling. Here, the job is executed on small samples of the input dataset and on reduced hardware and memory usage is measured. Then, through extrapolation, the memory usage estimated for an execution on the full input dataset and the amount of memory to allocate to the cluster is selected according to the job’s exact needs.

While some parts of a job’s memory allocation can be measured through the dataflow system’s logs or APIs, a holistic view appears only possible by directly measuring the memory usage of the actual process and its child processes. However, the measurement of a job’s exact memory needs at a given point is complicated by Java’s reliance on automatic garbage collection. Already dereferenced objects may still occupy memory until a (full) garbage collection takes place, which is deferred by contemporary garbage collectors for performance reasons. Thereby the actual memory needs of the job are being misrepresented.

In this research task, you will explore ways to profile the memory usage of a Spark job for the sake of eventually achieving a suitable memory allocation.

Required Skills:

The minimum skills required are knowledge of distributed computing systems and experience with data science. Also, you need to read and understand the process of writing a thesis at our research group.

Start: Immediately

Contact: Jonathan Will (will ∂ tu-berlin.de)