Sessions at Chicago Data Summit about Hadoop with slides

Your current filters are…


Tuesday 26th April 2011

  • Data Processing with Hadoop: Scalable and Cost Effective

    by Doug Cutting

    Hadoop is a new paradigm for data processing that scales near linearly to petabytes of data. Commodity hardware running open source software provides unprecedented cost effectiveness. It is affordable to save large, raw datasets, unfiltered, in Hadoop's file system. Together with Hadoop's computational power, this facilitates operations such as ad hoc analysis and retroactive schema changes. An extensive open source tool-set is being built around these capabilities, making it easy to integrate Hadoop into many new application areas.

    At 1:45pm to 2:40pm, Tuesday 26th April

  • Apache HBase: An Introduction

    by Todd Lipcon

    Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.

    At 2:45pm to 3:30pm, Tuesday 26th April

    Coverage slide deck

  • Extending the Enterprise Data Warehouse with Hadoop

    by Jonathan Seidman and Rob Lancaster

    Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge for many companies now is in bridging the gap between the data in the data warehouse and the data in Hadoop. In this talk we'll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.

    At 2:45pm to 3:30pm, Tuesday 26th April

  • Flume: An Introduction

    by Jonathan Hsieh

    Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It has four goals in mind: Reliability, Scalability, Extensibility, and Manageability. Its horizontal scalable architecture offers fault-tolerant end-to-end delivery guarantees, support for low-latency event processing, provides a centralized management interface , and exposes metrics for ingest monitoring and reporting. It natively supports writing data to Hadoop's HDFS but also has a simple extension interface that allows it to write to other scalable data systems such as low-latency datastores or incremental search indexers.

    At 3:45pm to 4:30pm, Tuesday 26th April

  • Geo-based Content Processing Using HBase

    by Ravi Veeramachaneni

    NAVTEQ uses Cloudera Distribution including Apache Hadoop (CDH) and HBase with Cloudera Enterprise support to process and store location content data. With HBase and its distributed and column-oriented architecture, NAVTEQ is able to process large amounts of data in a scalable and cost-effective way.

    At 3:45pm to 4:30pm, Tuesday 26th April

  • Cloudera's Distribution including Apache Hadoop & Cloudera Enterprise

    by Charles Zedlewski

    This session will discuss what's new in the recently released CDH3 and Enterprise 3.5 products. We'll review how usage of Hadoop has evolving in the enterprise and how CDH3 and Enterprise 3.5 meet these new challenges with advances in functionality, performance, security and manageability.

    At 4:35pm to 5:15pm, Tuesday 26th April

Schedule incomplete?

Add a new session

Filter by Day

Filter by coverage

Filter by Topic