Runtime Prediction and Resource Allocation for Distributed Dataflows

Aiming to support distributed dataflow users in meeting their performance expectations with minimal resources, we work on runtime prediction and resource allocation for systems like Spark and Flink.

Motivation

Many organizations need to analyze large datasets such as data collected by thousands of sensors or generated by millions of users. Even with the parallel capacities that single nodes provide today, the size of some datasets requires the resources of many nodes. Consequently, distributed systems have been developed to manage and process large datasets with clusters of commodity resources.
A popular class of such distributed systems are distributed dataflow systems like Flink, Spark, and Beam. These systems offer high-level programming abstractions revolving around frameworks of data-parallel operators, provide efficient and fault-tolerant distributed runtime engines, and come with libraries of common data processing jobs. These systems thereby make it significantly easier for users to develop scalable data-parallel processing programs that run on large sets of cluster resources.
However, users still need to select adequate resources for their jobs and carefully configure the systems for an efficient distributed processing that meets performance expectations. Yet, even expert users often do not fully understand system and workload dynamics since many factors determine the runtime behavior of jobs (e.g. programs, input datasets, system implementations and configurations, hardware architectures). In fact, users currently overprovision heavily to make sure their jobs meet performance expectations, inducing significant overheads and yielding low resource utilizations and, thus, unnecessary costs and energy consumption.

Approach

Instead of having users essentially guess adequate sets of resources and system configurations, resource management systems should support users with these tasks. Yet, to allocate resources for specific distributed dataflow jobs in line with given performance expectations, resource management systems require accurate performance models. Such performance models can be learned either from a cluster’s execution history or dedicated profiling runs (or a combination of both). They allow to predict the runtimes of a job for specific sets of resources. Based on these predictions, a minimal scale-out and set of resources can be chosen automatically in accordance with a user’s runtime target. Furthermore, monitoring of the actual job performance and dynamic adjustments based on performance data and models can be used to address runtime variance.

As shown in the figure, the idea is to have a user submit a job with a targeted runtime, but without a specific resource reservation. The specific resource reservation is then calculated using models that estimate the runtime for a specific scale-out. Scale-out models can be trained either for an entire job or for each stage/iteration of a job. In case fine-grained models exist for a job, it is possible to estimate a running job’s progress towards a runtime target, based on the elapsed time and the scale-out models for the remaining stages/iterations, to, if necessary, scale-out a job dynamically. The scale-out models can be trained on previous executions of recurring jobs, selecting similar executions for accurate runtime predictions. Ultimately, this allows users to concentrate on their programs and have systems make more informed resource management decisions.

Results

We developed and evaluated a variety of tools for runtime prediction and resource allocation for distributed dataflows jobs:

In addition, we also published a dataset that contains 930 unique Spark job executions, along with an analysis of the data that suggests possibilities for a more collaborative approach to performance modeling and cluster configuration optimization.

Publications

Contact

If you have any questions or are interested in collaborating with us on this topic, please get in touch with Lauritz!

Acknowledgments

This work has been supported through grants by the German Science Foundation as Stratosphere (DFG Research Unit FOR 1306) and FONDA (DFG Collaborative Research Center 1404) as well as by the German Ministry for Education and Research as Berlin Big Data Center BBDC (BMBMF grant 01IS14013A) and Berlin Institute for the Foundations of Learning and Data BIFOLD (BMBF grant 01IS18025A).