The aim of this thesis is to predict the scaling behavior of a Spark batch data processing job on a distributed cluster using lightweight profiling techniques. The primary objective is to understand how a Spark job can be scaled, and what factors can impact its scalability.
Methodology:
The research will be conducted in two main stages.
First, a distributed cluster will be emulated on a single node (e.g. a laptop) by configuring Spark to use multiple cores. This profiling will be conducted using a representative sample of the input dataset to simulate real-world conditions. It will be repeated under different conditions, e.g. different numbers of cores, and resource utilization metrics will be continuously measured.
Second, machine learning algorithms will be used to identify patterns and trends in the data, which will help make predictions about how the Spark job will scale on a larger cluster. This will be done by comparing the performance and resource utilization data of the single node and distributed cluster configurations.
Required Skills:
The minimum skills required are knowledge of distributed computing systems and experience with machine learning. Also, you need to read and understand the process of writing a thesis at our research group.
Start: immediately
Contact: Jonathan Will (will ∂ tu-berlin.de)