by Doug Cutting
Hadoop is a new paradigm for data processing that scales near linearly to petabytes of data. Commodity hardware running open source software provides unprecedented cost effectiveness. It is affordable to save large, raw datasets, unfiltered, in Hadoop's file system. Together with Hadoop's computational power, this facilitates operations such as ad hoc analysis and retroactive schema changes. An extensive open source tool-set is being built around these capabilities, making it easy to integrate Hadoop into many new application areas.
by Todd Lipcon
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
by Jonathan Seidman and Rob Lancaster
Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge for many companies now is in bridging the gap between the data in the data warehouse and the data in Hadoop. In this talk we'll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.
Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It has four goals in mind: Reliability, Scalability, Extensibility, and Manageability. Its horizontal scalable architecture offers fault-tolerant end-to-end delivery guarantees, support for low-latency event processing, provides a centralized management interface , and exposes metrics for ingest monitoring and reporting. It natively supports writing data to Hadoop's HDFS but also has a simple extension interface that allows it to write to other scalable data systems such as low-latency datastores or incremental search indexers.
by Ravi Veeramachaneni
NAVTEQ uses Cloudera Distribution including Apache Hadoop (CDH) and HBase with Cloudera Enterprise support to process and store location content data. With HBase and its distributed and column-oriented architecture, NAVTEQ is able to process large amounts of data in a scalable and cost-effective way.
by Charles Zedlewski
This session will discuss what's new in the recently released CDH3 and Enterprise 3.5 products. We'll review how usage of Hadoop has evolving in the enterprise and how CDH3 and Enterprise 3.5 meet these new challenges with advances in functionality, performance, security and manageability.