Scaling Hadoop Applications

A session at ApacheCon North America 2011

Thursday 10th November, 2011

11:30am to 12:20pm (PST)

Apache Hadoop makes it extremely easy to develop parallel programs based on MapReduce programming paradigm by taking care of work decomposition, distribution, assignment, communication, monitoring, and handling intermittent failures. However, developing Hadoop applications that linearly scale to hundreds, or even thousands of nodes, requires extensive understanding of Hadoop architecture and internals, in addition to hundreds of tunable configuration parameters. In this talk, I illustrate common techniques for building scalable Hadoop applications, and pitfalls to avoid. I will explain the seven major causes of sublinear scalability of parallel programs in the context of Hadoop, with real-world examples based on my experiences with hundreds of production applications at Yahoo! and elsewhere. I will conclude with a scalability checklist for Hadoop applications, and a methodical approach to identify and eliminate scalability bottlenecks.

About the speaker

This person is speaking at this event.
Milind Bhandarkar

Parallel programmer, Hadoop evangelist, student of distributed systems. Currently Chief Architect, Greenplum Labs at EMC^2. Opinions are my own, of course! bio from Twitter

Coverage of this session

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 11:30am12:20pm PST

Date Thu 10th November 2011

Short URL


Official session page


View the schedule


See something wrong?

Report an issue with this session