Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.
by Michael Sun
We successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties at CBS Interactive. After I introduce Lumberjack, the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release, I will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, we achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).
It’s increasingly clear that Big Data is not just about volume – but also the variety, complexity and velocity of enterprise information. Integrating data with insights from unstructured information such as documents, call logs, and web content is essential to driving sustainable business value. Aggregating and analyzing unstructured content is challenging because human expression is diverse, varies by location, and changes over time. To understand the causes of data trends, you need advanced text analytic capabilities. Furthermore, you need a system that provides direct, real-time access to discover hidden insights. In this session, you will learn how united information access (UIA) uniquely completes the picture by integrating Big Data directly with unstructured content and advanced text analytics, and making it directly accessible to business users.
by Jim Haas
Our need for better scalability in processing weblogs is illustrated by the change in requirements – processing 250 million vs. 1 billion web events a day (and growing). The Data Waregoup at CBSi has been transitioning core processes to re-architected hadoop processes for two years. We will cover strategies used for successfully transitioning core ETL processes to big data capabilities and present a how-to guide of re-architecting a mission critical Data Warehouse environment while it’s running.
by Y Masatani
NTT DATA has been providing Hadoop professional services for enterprise customers for years. In this talk we will categorize Hadoop integration cases based on our experience and illustrate archetypal design practices how Hadoop clusters are deployed into existing infrastructure and services. We will also present enhancement cases motivated by customer’s demand including GPU for big math, HDFS capable storage system, etc.
by Josh Lospinoso
In this session we will look at Reveal, a statistical network analysis library built on Hadoop that uses relational event history analysis to grapple with the complexity, temporal causality, and uncertainty associated with dynamically evolving, growing, and changing networks. There are a broad range of applications for this work, from finance to social network analysis to network security.
Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and other common issues.
by Lance Riedel
Jive is using Flume to deliver the content of a social web (250M messages/day) to HDFS and HBase. Flume’s flexible architecture allows us to stream data to our production data center as well as Amazon’s Web Services datacenter. We periodically build and merge Lucene indices with Hadoop jobs and deploy these to Katta to provide near real time search results. This talk will explore our infrastructure and decisions we’ve made to handle a fast growing set of real time data feeds. We will further explore other uses for Flume throughout Jive including log collection and our distributed event bus.
by Matt Aslett
Who is contributing to the Hadoop ecosystem, what are they contributing, and why? Who are the vendors that are supplying Hadoop-related products and services and what do they want from Hadoop? How is the expanding ecosystem benefiting or damaging the Apache Hadoop project? What are the emerging alternatives to Hadoop and what chance do they have? In this session, the 451 Group will seek to answer these questions based on their latest research and present their perspective of where Hadoop fits in the total data management landscape.
by Charles Zedlewski and Eli Collins
Many people refer to Apache Hadoop as their system of choice for big data management but few actually use just Apache Hadoop. Hadoop has become a proxy for a much larger system which has HDFS storage at its core. The Apache Hadoop based “big data stack” has changed dramatically over the past 24 months and will change even more over the next 24 months. This session will explore the trends in the evolution of the Hadoop stack, change in architecture and changes in the kinds of use cases that are supported. It will also review the role of interoperability and cohesion in the Apache Hadoop stack and the role of Apache Bigtop in this regard.
Hadoop is making its way into the enterprise, as organizations look to extract valuable information and intelligence from the mountains of data in their storage environments. The way in which this data is analyzed and stored is changing, and Hadoop has become a critical part of this transformation. In this session, Vanessa will cover the trends we are seeing in the enterprise in regards to Hadoop adoption and how it’s being used, as well as predictions on where we see Hadoop and Big Data in general, going as we enter 2012.
by Steven Noels
Lily is a repository made for the age of Data, and combines CDH, HBase and Solr in a powerful, high-level, developer-friendly backing store for content-centric application with ambition to scale. In this session, we highlight why we choose HBase as the foundation for Lily, and how Lily will allow users to not only store, index and search vast quantities of data, but also to track audience behaviour and generate recommendations, all in real-time.
by Soundar Velu
Raptor combines Hadoop & HBase with machine learning models for adaptive data segmentation, partitioning, bucketing, and filtering to enable ad-hoc queries and real-time analytics.
Raptor has intelligent optimization algorithms that switch query execution between HBase and MapReduce. Raptor can create per-block dynamic bloom filters for adaptive filtering. A policy manager allows optimized indexing and autosharding.
This session will address how Raptor has been used in prototype systems in predictive trading, times-series analytics, smart customer care solutions, and a generalized analytics solution that can be hosted on the cloud.
by Jeremy Glesner and Richard Clayton
Security in a distributed environment is a growing concern for most industries. Few face security challenges like the Defense Community, who must balance complex security constraints with timeliness and accuracy. We propose to briefly discuss the security paradigms defined in DCID 6/3 by NSA for secure storage and access of data (the “Protection Level” system). In addition, we will describe the implications of each level on the Hadoop architecture and various patterns organizations can implement to meet these requirements within the Hadoop ecosystem. We conclude with our “wish list” of features essential to meet the federal security requirements.
by Jean-Pierre Dijcks
Analyzing new and diverse digital data streams can reveal new sources of economic value, provide fresh insights into customer behavior and identify market trends early on. But this influx of new data can create challenges for IT departments. To derive real business value from Big Data, you need the right tools to capture and organize a wide variety of data types from different sources, and to be able to easily analyze it within the context of all your enterprise data. Attend this session to learn how Oracle’s end-to-end value chain for Big Data can help you unlock the value of Big Data.
At bitly we study behaviour on the internet by capturing clicks on shortened URLs. This link traffic comes in many forms yet, when studying human behaviour, we’re only interested in using ‘organic’ traffic: the traffic patterns caused by actual humans clicking on links that have been shared on the social web. To extract these patterns, we employ Python/Numpy, streaming Hadoop and some Machine Learning to create a model of organic traffic patterns based on bitly’s click logs. This model lets us extract the traffic we’re interested in from the variety of patterns generated by inorganic entities following bitly links.
by Jacob Rapp
Hadoop is a popular framework for web 2.0 and enterprise businesses who are challenged to store, process and analyze large amounts of data as part of their business requirements. Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. This session reviews the state of Hadoop enterprise environments, discusses fundamental and advanced Hadoop concepts and reviews benchmarking analysis and projection for big data growth as related to Data Center and Cluster designs. The session also discusses network architecture tradeoffs, and the advantages of close integration between compute and networking.
HDFS HA has been a highly sought after feature for years. Through collaboration between Cloudera, Facebook, Yahoo!, and others, a high availability system for the HDFS Name Node is actively being worked on. This talk will discuss the architecture and setup of this system.
by Eric Sammer, Gregory Baker and Karthik Ranganathan
This session will be a panel discussion with experienced Hadoop Operations practitioners from several different organizations. We’ll discuss the role, the challenges and how both these will change in the coming years.
by Tony Baer
As Big Data has captured attention as one of “the next big things” in enterprise IT, most of the spotlight has focused on early adopters. But what is the state of Big Data adoption across the enterprise mainstream? Ovum recently surveyed 150 global organizations in a variety of vertical industries with revenue of $500 million+ and manage large enterprise data warehouses. We will share the findings from the research in this session. We will reveal similarities in awareness, readiness, and business drivers when compared.
Much of Hadoop adoption thus far has been for use cases such as processing log files, text mining, and storing masses of file data — all very necessary, but largely not exciting. In this presentation, Michael Cutler presents a selection of methodologies, primarily using Mahout, that will enable you to derive real insight into your data (mined in Hadoop) and build a recommendation engine focused on the implicit data collected from your users.
by Daniel Abadi
As Hadoop rapidly becomes the universal standard for scalable data analysis and processing, it is increasingly important to understand its strengths and weaknesses for particular application scenarios in order to avoid inefficiency pitfalls. For example, Hadoop has great potential to perform scalable graph analysis if it is used correctly. Recent benchmarking has shown that simple implementations can be 1300 times less efficient than a more optimal Hadoop-centered implementation. In this talk, Daniel Abadi will give an overview of a recent research project at Yale University that investigates how to perform sub-graph pattern matching within a Hadoop-centered system that is three orders of magnitude faster than a more simple approach. In his talk Daniel will highlight how the cleaning, transforming, and parallel processing strengths of Hadoop are combined with storage optimized for graph data analysis. He will then discuss further changes that are needed in the core Hadoop framework to take performance to the next level.
by Arvind Prabhakar and Guy Harrison
As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative.
We’ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we’ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.
by Bob Gourley
Federal, State and Local governments and the development community surrounding them are busy creating solutions leveraging the Apache Foundation Hadoop capabilities. This session will highlight the top five picked from an all star panel of judges. Who will take home the coveted Government Big Data Solutions Award for 2011? This presentation will also highlight key Big Data mission needs in the federal space and provide other insights which can fuel solutions in the sector.
by Aaron Kimball and Garrett Wu
WibiData is a collaborative data mining and predictive modeling platform for large-scale, multi-structured, user-centric data. It leverages HBase to combine batch analysis and real time access within the same system, and integrates with existing BI, reporting and analysis tools. WibiData offers a set of libraries for common user-centric analytic tasks, and more advanced data mining libraries for personalization, recommendation, and other predictive modeling applications. Developers can write re-usable libraries that are also accessible to data scientists and analysts alongside the WibiData libraries. In this talk, we will provide a technical overview of WibiData, and show how we used it to build FoneDoktor, a mobile app that collects data about device performance and app resource usage to offer personalized battery and performance improvement recommendations directly to users.
by Aaron Beppu
How can you rank product search results when you have very little data about how past shoppers have interacted with the products? Through large scale analysis of its clickstream data, Etsy is automatically discovering product attributes (things like materials, prices, or text features) which signal that a search result is particularly relevant (or irrelevant) to a given query. This attribute-level approach makes it possible to appropriately rank products in search results- even if those products are brand new and one-of-a-kind. This presentation discusses Etsy’s efforts to predict relevance in product search, in which Hadoop is a central component.
by Charlie Lougheed and Doug Meil
Explorys, founded in 2009 in partnership with the Cleveland Clinic, is one of the largest clinical repositories in the United States with 10 million lives under contract.
HBase and Hadoop are at the center of Explorys. The Explorys healthcare platform is based upon a massively parallel computing model that enables subscribers to search and analyze patient populations, treatment protocols, and clinical outcomes. Already spanning billions of anonymized clinical records, Explorys provides uniquely powerful and HIPAA compliant solutions for accelerating life saving discovery.
by Greg Rokita and Krishnan Parasuraman
Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In this session, you will learn about the similarities and differences of Hadoop and parallel data warehouses, and typical best practices. Edmunds will discuss how they increased delivery speed, reduced risk, and achieved faster reporting by combining ELT and ETL. For example, Edmunds ingests raw data into Hadoop and HBase then reprocesses the raw data in Netezza. You will also learn how Edmunds uses prototyping to work on nearly raw data with the company’s Analytics Team using Netezza.
by David “Sunny” Sundstrom
SGI has been a leading commercial vendor of Hadoop clusters since 2008. Leveraging SGI’s experience with high performance clusters at scale, SGI has delivered individual Hadoop clusters of up to 4000 nodes. In this presentation, through the discussion of representative customer use cases, you’ll explore major design considerations for performance and power optimization, how integrated Hadoop solutions leveraging CDH, SGI Rackable clusters, and SGI Management Center best meet customer needs, and how SGI envisions the needs of enterprise customers evolving as Hadoop continues to move into mainstream adoption.
8th–9th November 2011