One of the main causes of performance problems in distributed data processing systems (from the original MapReduce to modern Spark and Flink) is "stragglers." Stragglers are parts of the input that take an unexpectedly long time to process, delaying the completion of the whole job, and wasting resources that stay idle. Stragglers can happen due to imbalance of data distribution or processing complexity, hardware/networking anomalies, and a variety of other factors.
Google Cloud Dataflow is the first system to address the problem of stragglers in a fully general way. By dynamically redistributing parts of already launched work from straggler workers onto idle workers to maximize utilization, Google Cloud Dataflow is able to preserve data consistency and minimizing re-execution.
This talk describes the theory and practice behind Cloud Dataflow's approach to straggler elimination, as well as the associated non-obvious challenges, benefits, and implications of the technique.
I like distributed systems, functional programming and academic music. bio from Twitter
Sign in to add slides, notes or videos to this session