by Arun Murthy
Apache Hadoop has made giant strides since the last Hadoop Summit: the community has released hadoop-1.0 after nearly 6 years and is now on the cusp of the Hadoop.next (think of it as hadoop-2.0). Given the next generation of MR is out with 0.23.0 and 0.23.1, there is a new set of features that have been requested in the community. In this talk we will talk about the next set of features like pre emption, web services and near real time analysis and how we are working on tackling these in the near future. In this talk we will also cover the roadmap for Next Gen Map Reduce and timelines along with the release schedule for Apache Hadoop.
by Adam Gray
Big Data technologies such as Hadoop, NoSQL, and scalable object stores are an ideal fit for the elasticity and scalability of a cloud deployment. In this talk we will take a look at several common architectural patterns that are being used today on the AWS cloud to take advantage of these synergies while overcoming some of its inherent limitations. This will include the use of Amazon S3 as a common store for multiple transient Hadoop clusters and how this affects disaster recovery, job scheduling, and software upgrades. We’ll also look at how to build a dynamically scalable Hadoop data warehouse that uses Amazon S3, HDFS, and Amazon DynamoDB to create a three-tiered architecture. Finally, we’ll explore how Amazon Elastic MapReduce is being used to perform sophisticated analytics on data stored in Amazon DynamoDB.
by Andrew Ryan
The Hadoop Distributed Filesystem, or HDFS, provides the storage layer to a variety of critical services at Facebook. The HDFS Namenode is often singled out as a particularly weak aspect of the design of HDFS, because it represents a single point of failure within an otherwise redundant system. To address this weakness, Facebook has been developing a highly available Namenode, known as Avatarnode. The objective of this study was to determine how much effect Avatarnode would have on overall service reliability and durability. To analyze this, we categorized, by root cause, the last two years` of operational incidents in the Data Warehouse and Messages services at Facebook, a total of 66 incidents. We were able to show that approximately 10% of each service`s incidents would have been prevented had Avatarnode been in place. Avatarnode would have prevented none of our incidents that involved data loss, and all of the most severe data loss incidents were a result of human error or software bugs. Our conclusions is that Avatarnode will improve the reliability of services that use HDFS, but that the HDFS Namenode represents only a small portion of overall operational incidents in services that use HDFS as a storage layer.
by Nish Parikh
Temporal dynamics not only play an important role in stock markets, traffic, weather and network systems but also in web search, recommender systems and eCommerce. In eCommerce, temporal variations are great indicators of the business. Performing temporal data mining tasks like burst detection, event prediction and temporal similarity search on large scale data is challenging. Most challenges are related to de-noising large volume data and approximating algorithms which if applied naively are quadratic. eBay is one of the leading online marketplaces with ~100M users, 100+M items and 100+M daily queries. We talk about how we used Hadoop to generate a 7 year noise free corpus of eCommerce queries at eBay. We describe the use of MapReduce over Hadoop for burst detection. Burst detection  enables us to find daily trends and spikes in demand. Year over year burst patterns are analyzed to detect seasonality and predict events. We also discuss the power of Hadoop in scaling up temporal similarity search over billions of queries. The mined information has uses in Temporal Recommender Systems, Query Suggestion Systems and Pop-up Commerce.  N Parikh, N Sundaresan. Scalable and near real-time burst detection from eCommerce queries. KDD`08  N Sundaresan. Popup Commerce, Towards Building Transient and Thematic Stores
by Nathan Marz
Storm is a distributed and fault-tolerant realtime computation system, doing for realtime computation what Hadoop did for batch computation. Storm can be used together with Hadoop to make a potent realtime analytics stack. Although Hadoop is not a realtime system, it can be used to support a realtime system. Building a batch/realtime stack requires solving a lot of sub-problems: – Getting data to both Hadoop and Storm – Exporting views of the Hadoop data into a readable index – Using an appropriate queuing broker to feed Storm – Choosing an appropriate database to serve the realtime indexes updated via Storm – Syncing the views produced independently by Hadoop and Storm when doing queries Come learn how we`ve solved these problems at Twitter to do complex analytics in realtime.
How YapMap built a new type of conversational search technology using Hadoop as the backbone. While adoption of Hadoop for big data analytics is becoming very common, substantially fewer companies are using Hadoop as a system of record to directly support their primary business. We’ll discuss our experiences using Hadoop in this second way. We’ll take you through our system architecture and how Hadoop ecosystem components can work side by side with traditional application technologies. We’ll cover:
Building a distributed targeted crawler using Zookeeper as a coordinated locking service
How we rely on HBase’s atomicity and optimistic locking to coordinate processing pipelines
Using Map Reduce, HBase regions and bi-level sharding for truly distributed index creation
Running diskless index servers that pull multiple reduce outputs directly from the distributed file system into memory to form a single index
Using Mahout to aid in user exploration
How we integrated the use of Zookeeper, HBase and MapReduce with application focused technologies including JavaEE6, CDI, SQL, Protobuf and RabbitMQ.
Deployment decisions including switching from CDH to unsupported MapR, using Inifinband, building a six node research cluster for $1500, using desktop drives & white boxes, utilizing huge pages, gc hell, etc.
by Fabrice Bonan
Enterprises are faced with ever increasing amounts of data, and problems related to volume and usage demands have forced IT managers and developers to seek out new solutions. Fortunately, this has resulted in an explosion of innovation in massively parallel processing and non-relational data storage. Apache Hadoop, an open source software platform, has quickly become the technology of choice for large organizations in need of sophisticated analysis and transformation of petabytes of structured and complex data. Fabrice Bonan will discuss how users can access, manipulate and store huge volumes of data in Hadoop and benefit from high-performance, cost-optimized data integration with ultimate scalability. During this session, attendees will learn how to:
- Leverage the explosive growth of data
- Deploy and tap into the powerful architecture of Hadoop
- Process massive data volumes through a combination of Hadoop and open source architecture
Given its ability to analyze structured, unstructured, and “multi-structured” data, Hadoop is an increasingly viable option for analytics and business intelligence within the enterprise. Dramatically more scalable and cost-effective than traditional data warehousing technologies, Hadoop is also increasingly used to perform new kinds of analytics that were previously impossible. When it comes to Big Data, retailers are at the forefront of leveraging large volumes of nuanced information about customers, to improve the effectiveness of promotional campaigns, refine pricing models, and lower overall customer acquisition costs. Retailers compete fiercely for consumers’ attention, time, and money, and effective use of analytics can result in sustained competitive advantage. Forward-thinking retailers can now take advantage of all data sources to construct a complete picture of a customer. This invariably consists of both structured data (customer and inventory records, spreadsheets, etc.) and unstructured data (clickstream logs, email archives, customer feedback and comment fields, etc.). This allows, for example, online retailers with structured, transactional sales data to connect that data with unstructured comments from product reviews, providing insight into how reviews affect consumers’ propensity to purchase a particular product. This session will examine several real-world customer use cases applying combined analysis of structured and unstructured data.
by Lei Chang
Greenplum Database (GPDB) is an industry-leading massively parallel processing (MPP) database offering, providing high performance, scalable, and mission-critical analytic processing. In the Greenplum Unified Analytics Platform (UAP), GPDB and Hadoop fuse the co-processing of structured and unstructured data under a single user interface to empower data science teams. At the bottom layers of GPDB and Hadoop, however, are distinct storage systems: GPDB is based on a local Posix compliant file system and most Hadoop systems are based on HDFS. Having two distinct storage systems increases both the capital and operational costs for supporting unified analytics. One possible solution to this problem is to allow GPDB to run natively on HDFS. This talk will give an introduction to how GPDB lives on HDFS by using a pluggable storage layer in the kernel of the GPDB engine. It also introduces the features we added to HDFS to support the full transactional semantics of GPDB, and how we increased the concurrency of HDFS access from C clients. Detailed experimental performance results will be presented to compare GPDB on HDFS (GOH) to other state-of-the-art big data processing frameworks. We will also discuss our experiences building GOH and the opportunities opened up by the fusion of an MPP database and Hadoop.
Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 0.23 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc. We describe how to take advantages of the new features and their benefits. We also discuss some of the misconceptions and myths about HDFS. The second half of the talk describes our plans for HDFS over the next year. This includes improvements such as Snapshots, Disaster recovery, RAID, performance, scaling, etc.
by Joanthan Hsieh and Jeff Bean
Apache HBase is a rapidly-evolving random-access distributed data store built on top of Apache Hadoop’s HDFS and Apache ZooKeeper. Drawing from real-world support experiences, this talk provides administrators insight into improving HBase’s availability and recovering from situations where HBase is not available. We share tips on the common root causes of unavailability, explain how to diagnose them, and prescribe measures for ensuring maximum availability of an HBase cluster. We discuss new features that improve recovery time such as distributed log splitting as well as supportability improvements. We will also describe utilities including new failure recovery tools that we have developed and contributed that can be used to diagnose and repair rare corruption problems on live HBase systems.
by Michael Radwin
The Merchant Lookup Service at Intuit enables users and products to look up business details by:
Business name (including partial name & misspellings)
Business location (street address, latitude and longitude)
Business type (category, SIC)
User location (IP,GPS-enabled device location)
This powerful service enables auto-suggest, auto-complete and auto-correct within product. The project aims at providing a more complete, canonical business profile by bringing together data and metadata from across the various information providers as well as merchants from Intuit’s small business customer base. The Business Directory Service is available as a web-service that can be integrated into desktop, web and mobile applications. It is available through a REST API whose response times are minimized because the data is indexed in Solr and distributed. The backend is powered by HBase, which stores this comprehensive,deduplicated, canonical merchant information. Hundreds of millions of records that have duplicates that exist due to sparse, manually entered information by Intuit’s small business customers as well as records from different information providers are de-duplicated through a series of Hadoop jobs resulting in a canonical set of merchants. The deduping pipeline has various components like Reader, Index Generator, various Matchers, Score Combiner and Merchant Splicer.
by Jimmy Lin
The success of data-driven solutions to difficult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in large-scale machine learning. We presents a case study of Twitter`s integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. Prior to this work, the platform has already demonstrated value in “traditional” data warehousing and business intelligence tasks. Here, we focus on extensions to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and user-defined functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment, as well as access to rich libraries of user-defined functions and the output of other scripts. This talk is based on a forthcoming paper in SIGMOD 2012.
Trending use cases have pointed out the complementary nature of Hadoop and existing data management systems—emphasizing the importance of leveraging SQL, engineering, and operational skills, as well as incorporating novel uses of MapReduce to improve distributed analytic processing. Many vendors have provided interfaces between SQL systems and Hadoop but have not been able to semantically integrate these technologies while Hive, Pig and SQL processing islands proliferate. This session will discuss how Teradata is working with Hortonworks to optimize the use of Hadoop within the Teradata Analytical Ecosystem to ingest, store, and refine new data types, as well as exciting new developments to bridge the gap between Hadoop and SQL to unlock deeper insights from data in Hadoop. The use of Teradata Aster as a tightly integrated SQL-MapReduce® Discovery Platform for Hadoop environments will also be discussed.
by Mike Brown
This session provides details on how comScore uses Hadoop to process over 30 billion (over 2TB compressed per day) internet and mobile events per day to understand and report on web behavior. This will include the methods to ensure scalability and provide high uptime for this mission critical data. The talk will highlight the use of Hadoop to determine and calculate the metrics used in it’s flagship MediaMetrix product. Details on how disparate information sources can be quickly and efficiently combined and analyzed will be addressed. The talk will also detail how algorithms running on top of Hadoop provide deep insight into user behavior and can be used to develop broad insights into user behavior. The session also touches on comScores best demonstrated practices for large scale processing with Hadoop and the ability to combine profile information on its panelists and their Web activities to develop insights about internet usage.
by Michael Zeller and Ulrich Rueckert
While Hadoop provides an excellent platform for data aggregation and general analytics, it also can provide the right platform for advanced predictive analytics against vast amounts of data, preferably with low latency and in real-time. This drives the business need for comprehensive solutions that combine the aspects of big data with an agile integration of data mining models. Facilitating this convergence is the Predictive Model Markup Language (PMML), a vendor-independent standard to represent and exchange data mining models that is supported by all major data mining vendors and open source tools. This presentation will outline the benefits of the PMML standard as key element of data science best practices and its application in the context of distributed processing. In a live demonstration, we will showcase how Datameer and the Zementis Universal PMML Plug-in take advantage of a highly parallel Hadoop architecture to efficiently derive predictions from very large volumes of data. In this session, the audience will learn: How to leverage predictive analytics in the context of big data — Introduction to the Predictive Model Markup Language (PMML) open standard for data mining — How to reduce cost and complexity of predictive analytics
The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (HDFS-1623). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.
by Vishal Malik
Existing storage tiering solutions are mostly hardware based and very expensive. RAID costs are very high and policy based tiering is not transparent to the user and is done mostly at hardware level. iMStor Platform that we have developed allows easy to manage storage systems with policy based management at the user level to control what type of data lives in what type of Hadoop based storage engine ranging from HBase, MongoDB to CouchDB, Riak or Redis. iMStor provides software based storage engine based on Hadoop stack for doing all the intelligent work which today is done in hardware and non-Hadoop based platforms.
by Oscar Boykin
Scala is a functional programming language on the JVM. Hadoop uses a functional programming model to represent large-scale distributed computation. Scala is thus a very natural match for Hadoop. We will present Scalding, which is built on top of Cascading. Scalding brings an API very similar to Scala`s collection API to allow users to write jobs as they might locally and run those Jobs at scale. Scalding differs from Pig in that there is a single syntax for your entire job. There is no concept of UDFs, instead, users can inline functions, or use their own java or scala classes directly in the Scalding job. Twitter uses Scalding for data analysis and machine learning, particularly in cases where we need more than sql-like queries on the logs, for instance fitting models and matrix processing. It scales beautifully from simple, grep-like jobs all the way up to jobs with hundreds of map-reduce pairs. This talk will present the Scalding DSL and show some example jobs for common use cases.
by Eric C Yang and Kan Zhang
To deploy the whole Hadoop stack and manage its full life cycle is no small feat. This is partly due to the interdependency among services and partly due to the large number of nodes in a Hadoop cluster. For example, when upgrading an HDFS cluster, one needs to make sure NameNode is upgraded successfully before starting up the DataNodes. Such cross-node dependencies are difficult to specify in existing configuration management systems like Puppet. Hadoop clusters can have thousands of nodes and a desirable configuration management system needs to handle multiple such clusters. For example, a use case may be decommissioning a set of nodes from one cluster and adding them to another. This calls for scalable cluster status monitoring and action status tracking. In this session, we present a highly scalable configuration management solution that supports arbitrary state-based cross-node dependencies.
by Ron Bodkin and Mike Peterson
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
I`ll describe what I see as the most current and exciting big data challenges at NASA, across a number of application domains: * Planetary Science * Next generation Earth science decadal missions * Radio Astronomy and the next generation instruments including the Square Kilometre Array * Snow Hydrology and Climate Impacts The session will focus on defining the problem space, suggesting how technologies like Apache Hadoop can be leveraged, and where Apache Hadoop fits against the Apache OODT technology, pioneered originally by NASA and now another Apache big data technology growing at the foundation. In addition, I will present what I feel to be the key next steps architecturally that the Hadoop community can take to better integrate and apply to the realm of science data systems.
by Abe Taha
Hadoop is fast becoming an integral part of BI infrastructures to empower everyone in the business to affordably analyze large amounts of multi-structured data not previously possible with traditional OLAP or RDBMS technologies. This session will cover best practices for implementing Big Data Analytics directly on Hadoop. Topics will include architecture, integration and data exchange; self-service data exploration and analysis; openness and extensibility; the roles of Hive and the Hive metastore; integration with BI dashboards, reporting and visualization tools; using existing SAS and SPSS models and advanced analytics; ingesting data with Sqoop, Flume, and ETL.
by Benjamin Reed
Zookeeper is an important member of the Hadoop ecosystem. It is widely used in industry to coordinate distributed systems. A common use-case of Zookeeper is to dynamically maintain membership and other configuration metadata for its users. Zookeeper itself is a replicated distributed system. Unfortunately, the membership and all other configuration parameters of Zookeeper are static — they’re loaded during boot and cannot be altered. Operators resort to “rolling restart”, a manually intensive and error-prone method of changing the configuration that has caused data loss and inconsistency in production. Automatic reconfiguration functionality has been requested by operators since 2008 (Zookeeper-107). Several previous proposals were found incorrect and rejected. We implemented a new reconfiguration protocol in Zookeeper and are currently integrating it into the codebase. It fully automates configuration changes: the set of Zookeeper servers, their roles, addresses, etc. can be changed dynamically, without service interruption and while maintaining data consistency. By leveraging the properties already provided by Zookeeper our protocol is considerably simpler than state of the art. Our protocol also encompasses the clients — clients are rebalanced across servers in the new configuration, while keeping the extent of migration proportional to the change in membership.
by Stephen O'Sullivan and Jeremy King
Use case – Research project to look at Hadoop for a business case, and moving the results -Became the store for all web site performance / platform stats for operation analytics – 8 different groups start using the the cluster and it went from 8 to 12 nodes (with a total of 64TB) within (7 months) – Started to do more business analytics, but had to start moving data around as space becomes an issue – Business funds a 250+ node cluster is purchased with a total of 1.2PB of space… – …And this cluster will increase by 200+ nodes by the end of the year How Hadoop will be used on the next generation @WalmartLabs / walmart.com platform. – Storing all the fine grain data generated by walmart.com – Realtime analytics for site performance and alerting – Hybrid model * with Data Warehouses (that are using current BI tools) * with noSQL
We use “SaasBase Analytics” to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in realtime. The requirement we started from was to get large amounts of data available in near realtime (minutes) to large amounts of users for large amounts of (different) queries that take milliseconds to execute. This set our problem apart from classical solutions such as Hive and PIG. In this talk I`ll go through the design of the solution and the strategies (and hacks) to achieve low latency and scalability from theoretical model to the entire process of ETL to warehousing and queries.
by Tim Mallalieu
There are a number of excellent databases that have proven invaluable when working with `big data` – including Cassandra, Riak, DynamoDB, MongoDB, and HBase. So, how do you decide which is the right solution for you? This talk will start by briefly discussing some of the theory involved with distributed databases – from the oft-cited (and almost-as-often misunderstood) CAP theorem, to vector clocks and the difficulties of eventual consistency, and much more. We will then compare how mainstream distributed databases make use of these concepts, and the tradeoffs they incur. Unfortunately, due to the nature of these tradeoffs, there is no one-size-fits all solution. So, we`ll discuss what classes of problems each of these systems is appropriate for – and include some real world benchmarks – to help you make an informed decisions about your distributed systems.
NetApp collects 500 TB per year of unstructured data from devices that phone home, sending unstructured auto-support log and configuration data back to centralized data centers. NetApp needed to scale their architecture, to provide timely support data, to extend their data warehouse, and to be able to do ad hoc analysis and build predictive models for device support,product analysis and cross-sales. This presentation presents the challenges and a solution architecture based on Apache Hadoop, Apache HBase, Apache Flume, Apache AVRO, HIVE as well as PIG and Solr. The presentation reviews lessons from implementation.
13th–14th June 2012