IndexedRDD: Efficient fine-grained updates for RDDs

A session at Spark Summit 2015

Monday 15th June, 2015

3:30pm to 4:00pm

Spark's core abstraction is the RDD, an immutable distributed dataset. Spark requires immutability to enable dataset reuse, fault tolerance, and straggler mitigation. But new Spark applications like streaming aggregation and incremental graph processing seem to need mutation: a new tweet requires updating a user's tweet count; a new movie rating requires updating a small number of predictions. Existing solutions sacrifice either flexibility or efficiency. Bulk transformations are wasteful for small updates. Direct mutation sacrifices fault tolerance. Even complex solutions, such as storing data in a durable, atomically-updated external database, encounter problems with dataset reuse and complex dependency graphs. This talk will introduce IndexedRDD, our solution for fine-grained RDD updates that retains all of Spark's advantages. IndexedRDD uses a range of techniques from functional programming and versioned databases. We will describe its implementation, its solutions to GC overhead and memory constraints, and its performance.

About the speaker

This person is speaking at this event.
Ankur Dave

CS PhD student at UC Berkeley bio from LinkedIn

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 3:30pm4:00pm PST

Date Mon 15th June 2015

Short URL


Official event site


View the schedule


See something wrong?

Report an issue with this session