Wednesday 24th October, 2012
10:50am to 11:30am
Traditional ETL assumes you know the target schema and organization of the data. That used to be a realistic assumption, but in a big-data world, data is much bigger, lower density and new sources arrive and evolve much more quickly. Implicit in this is that you are storing data before you know how you are going to use it.
A naive answer to this is schema-on-read. Just write data into Hadoop, and figure it out what you have and how you want to assemble it when you need it. But this means that advanced developers and lots of domain knowledge are needed any time anyone wants to pull anything from Hadoop. The sets the bar too high, and leads to complex and inflexible custom-coded integrations and jobs.
A new approach that we propose is ‘agile iterative ETL’. Hadoop makes this possible, since the data lands in its raw form and can be processed a first time and then revisited when additional detail or refinement is needed.
In other words:
1. land raw data in Hadoop,
2. lazily add metadata, and
3. iteratively construct and refine marts/cubes based on the metadata from step 2.
The big difference is that, once steps #1 and #2 are completed, a relatively unsophisticated user could drive #3. This approach can be used as a recipe for Hadoop developers looking to build a much more agile pipeline, and is heavily utilized in Platfora’s architecture.
CEO and Founder of Platfora. Revolutionizing BI and analytics for big data and Hadoop. bio from Twitter
Principal Architect, Platfora
Sign in to add slides, notes or videos to this session