Sessions at Strata 2012 about HBase

Your current filters are…

Tuesday 28th February 2012

  • Introduction to Apache Hadoop

    by Sarah Sproehnle

    This tutorial provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop, plus its associated ecosystem. This session is intended for those who are new to Hadoop and are seeking to understand where Hadoop is appropriate and how it fits with existing systems.

    The agenda will include:

    • The rationale for Hadoop
    • Understanding the Hadoop Distributed File System (HDFS) and MapReduce
    • Common Hadoop use cases including recommendation engines, ETL, time-series analysis and more
    • How Hadoop integrates with other systems like Relational Databases and Data Warehouses
    • Overview of the other components in a typical Hadoop “stack” such as these Apache projects: Hive, Pig, HBase, Sqoop, Flume and Oozie

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom CD, Santa Clara Convention Center

  • Developing applications for Apache Hadoop

    by Sarah Sproehnle

    This tutorial will explain how to leverage a Hadoop cluster to do data analysis using Java MapReduce, Apache Hive and Apache Pig. It is recommended that participants have experience with some programming language. Topics include:

    • Why are Hadoop and MapReduce needed?
    • Writing a Java MapReduce program
    • Common algorithms applied to Hadoop such as indexing, classification, joining data sets and graph processing
    • Data analysis with Hive and Pig
    • Overview of writing applications that use Apache HBase

    At 1:30pm to 5:00pm, Tuesday 28th February

    In GA J, Santa Clara Convention Center

Thursday 1st March 2012

  • Petabyte Scale, Automated Support for Remote Devices

    by Kumar Palaniappan and Ron Bodkin

    NetApp is a fast growing provider of storage technology. Its devices “phone home” regularly, sending unstructured auto-support log and configuration data back to centralized data centers. This data is used to provide timely support, to improve sales, and to plan product improvements. To allow this, data is collected, organized, and analyzed. The system currently ingests 5 TB of compressed data per week, which is growing 40% per year. NetApp was previously storing flat files on disk volumes and keeping summary data in relational databases. Now NetApp is working with Think Big Analytics, deploying Hadoop, HBase and related technologies to ingest, organize, transform and present auto-support data. This will enable business users to make decisions and provide timely response, and will enable automated response based on predictive models. Key requirements include:

    • Query data in seconds within 5 minutes of event occurrence.
    • Execute complex ad hoc queries to investigate issues and plan accordingly.
    • Build models to predict support issues and capacity limits to take action before issues arise.
    • Build models for cross-sale opportunities.
    • Expose data to applications through REST interfaces

    In this session we look at the the lessons learned while designing and implementing a system to:

    • Collect 1000 messages of 20MB compressed per minute.
    • Store 2 PB of incoming support events by 2015.
    • Provide low latency access to support information and configuration changes in HBase at scale within 5 minutes of event arrival.
    • Support complex ad hoc queries that join multiple data sets accessing diverse structured and unstructured large scale data sets
    • Operate efficiently at scale.
    • Integrate with a data warehouse in Oracle.

    At 4:00pm to 4:40pm, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center