Migrating complex Data Aggregation from Hadoop to Spark

A session at Spark Summit 2015

Monday 15th June, 2015

5:30pm to 5:45pm

This talk discusses our experience of moving from Hadoop MR to Spark. Our initial implementation used a multiple stage aggregation framework within Hadoop MR to join, de-dupe, and group 12TB of incoming data every 3 hours. There was an additional requirement to join other heterogeneous data sources along with implementation of algorithms like HyperLogLog. The Hadoop MR Cluster size and cost exponentially increased with scale for these use-cases so we evaluated Spark. A Spark 1.1 cluster ran the aggregation pipeline with 60% fewer nodes and 30% cost savings as compared to Hadoop MR within the SLA limits. The HyperLogLog implementation ran 5-8X faster than Hadoop MR on the same number of nodes. Optimizations and tuning were necessary on serialization (Kyro), parallelism, partitioning, compression, batch size, and memory assignments. The Spark extension provided by Cascading was used to migrate the code to Spark.

About the speakers

This person is speaking at this event.
Ashish Kumar Singh

BigData Developer at PubMatic bio from LinkedIn

This person is speaking at this event.
Puneet Kumar

Big Data Architect at PubMatic Inc bio from LinkedIn

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 5:30pm5:45pm PST

Date Mon 15th June 2015

Short URL


Official event site


View the schedule


See something wrong?

Report an issue with this session