Using Hadoop to do Agile Iterative ETL

A session at Strata New York 2012

  • Kevin Beyer

Wednesday 24th October, 2012

10:50am to 11:30am (EST)

Traditional ETL assumes you know the target schema and organization of the data. That used to be a realistic assumption, but in a big-data world, data is much bigger, lower density and new sources arrive and evolve much more quickly. Implicit in this is that you are storing data before you know how you are going to use it.

A naive answer to this is schema-on-read. Just write data into Hadoop, and figure it out what you have and how you want to assemble it when you need it. But this means that advanced developers and lots of domain knowledge are needed any time anyone wants to pull anything from Hadoop. The sets the bar too high, and leads to complex and inflexible custom-coded integrations and jobs.

A new approach that we propose is ‘agile iterative ETL’. Hadoop makes this possible, since the data lands in its raw form and can be processed a first time and then revisited when additional detail or refinement is needed.

In other words:
1. land raw data in Hadoop,
2. lazily add metadata, and
3. iteratively construct and refine marts/cubes based on the metadata from step 2.

The big difference is that, once steps #1 and #2 are completed, a relatively unsophisticated user could drive #3. This approach can be used as a recipe for Hadoop developers looking to build a much more agile pipeline, and is heavily utilized in Platfora’s architecture.

About the speakers

This person is speaking at this event.
Ben Werther

CEO and Founder of Platfora. Revolutionizing BI and analytics for big data and Hadoop. bio from Twitter

This person is speaking at this event.
Kevin Beyer

Principal Architect, Platfora

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 10:50am11:30am EST

Date Wed 24th October 2012

Short URL


View the schedule


See something wrong?

Report an issue with this session