Hadoop World 2011 schedule

Tuesday 8th November 2011

  • General Session: Keynote Speakers

    by Hugh E Williams, Larry Feinsmith and Mike Olson

    At 8:30am to 10:00am, Tuesday 8th November

  • Building Realtime Big Data Services at Facebook with Hadoop and HBase

    by Jonathan Gray

    Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.

    At 10:15am to 11:05am, Tuesday 8th November

  • Building Web Analytics Processing on Hadoop at CBS Interactive

    by Michael Sun

    We successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties at CBS Interactive. After I introduce Lumberjack, the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release, I will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, we achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).

    At 10:15am to 11:05am, Tuesday 8th November

  • Completing the Big Data Picture: Understanding Why and Not Just What

    by Sid Probstein

    It’s increasingly clear that Big Data is not just about volume – but also the variety, complexity and velocity of enterprise information. Integrating data with insights from unstructured information such as documents, call logs, and web content is essential to driving sustainable business value. Aggregating and analyzing unstructured content is challenging because human expression is diverse, varies by location, and changes over time. To understand the causes of data trends, you need advanced text analytic capabilities. Furthermore, you need a system that provides direct, real-time access to discover hidden insights. In this session, you will learn how united information access (UIA) uniquely completes the picture by integrating Big Data directly with unstructured content and advanced text analytics, and making it directly accessible to business users.

    At 10:15am to 11:05am, Tuesday 8th November

  • Hadoop in a Mission Critical Environment

    by Jim Haas

    Our need for better scalability in processing weblogs is illustrated by the change in requirements – processing 250 million vs. 1 billion web events a day (and growing). The Data Waregoup at CBSi has been transitioning core processes to re-architected hadoop processes for two years. We will cover strategies used for successfully transitioning core ETL processes to big data capabilities and present a how-to guide of re-architecting a mission critical Data Warehouse environment while it’s running.

    At 10:15am to 11:05am, Tuesday 8th November

  • Hadoop's Life in Enterprise Systems

    by Y Masatani

    NTT DATA has been providing Hadoop professional services for enterprise customers for years. In this talk we will categorize Hadoop integration cases based on our experience and illustrate archetypal design practices how Hadoop clusters are deployed into existing infrastructure and services. We will also present enhancement cases motivated by customer’s demand including GPU for big math, HDFS capable storage system, etc.

    At 10:15am to 11:05am, Tuesday 8th November

  • Building Relational Event History Model with Hadoop

    by Josh Lospinoso

    In this session we will look at Reveal, a statistical network analysis library built on Hadoop that uses relational event history analysis to grapple with the complexity, temporal causality, and uncertainty associated with dynamically evolving, growing, and changing networks. There are a broad range of applications for this work, from finance to social network analysis to network security.

    At 11:15am to 12:05pm, Tuesday 8th November

  • Hadoop Troubleshooting 101

    by Kathleen Ting

    Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and other common issues.

    At 11:15am to 12:05pm, Tuesday 8th November

  • Storing and Indexing Social Media Content in the Hadoop Ecosystem

    by Lance Riedel

    Jive is using Flume to deliver the content of a social web (250M messages/day) to HDFS and HBase. Flume’s flexible architecture allows us to stream data to our production data center as well as Amazon’s Web Services datacenter. We periodically build and merge Lucene indices with Hadoop jobs and deploy these to Katta to provide near real time search results. This talk will explore our infrastructure and decisions we’ve made to handle a fast growing set of real time data feeds. We will further explore other uses for Flume throughout Jive including log collection and our distributed event bus.

    At 11:15am to 12:05pm, Tuesday 8th November

  • The Blind Men and the Elephant

    by Matt Aslett

    Who is contributing to the Hadoop ecosystem, what are they contributing, and why? Who are the vendors that are supplying Hadoop-related products and services and what do they want from Hadoop? How is the expanding ecosystem benefiting or damaging the Apache Hadoop project? What are the emerging alternatives to Hadoop and what chance do they have? In this session, the 451 Group will seek to answer these questions based on their latest research and present their perspective of where Hadoop fits in the total data management landscape.

    At 11:15am to 12:05pm, Tuesday 8th November

  • The Hadoop Stack - Then, Now and In The Future

    by Eli Collins and Charles Zedlewski

    Many people refer to Apache Hadoop as their system of choice for big data management but few actually use just Apache Hadoop. Hadoop has become a proxy for a much larger system which has HDFS storage at its core. The Apache Hadoop based “big data stack” has changed dramatically over the past 24 months and will change even more over the next 24 months. This session will explore the trends in the evolution of the Hadoop stack, change in architecture and changes in the kinds of use cases that are supported. It will also review the role of interoperability and cohesion in the Apache Hadoop stack and the role of Apache Bigtop in this regard.

    At 11:15am to 12:05pm, Tuesday 8th November

  • Hadoop Trends & Predictions

    by Vanessa Alvarez

    Hadoop is making its way into the enterprise, as organizations look to extract valuable information and intelligence from the mountains of data in their storage environments. The way in which this data is analyzed and stored is changing, and Hadoop has become a critical part of this transformation. In this session, Vanessa will cover the trends we are seeing in the enterprise in regards to Hadoop adoption and how it’s being used, as well as predictions on where we see Hadoop and Big Data in general, going as we enter 2012.

    At 1:15pm to 2:05pm, Tuesday 8th November

  • Lily: Smart Data at Scale, Made Easy

    by Steven Noels

    Lily is a repository made for the age of Data, and combines CDH, HBase and Solr in a powerful, high-level, developer-friendly backing store for content-centric application with ambition to scale. In this session, we highlight why we choose HBase as the foundation for Lily, and how Lily will allow users to not only store, index and search vast quantities of data, but also to track audience behaviour and generate recommendations, all in real-time.

    At 1:15pm to 2:05pm, Tuesday 8th November

  • Raptor - Real-time Analytics on Hadoop

    by Soundar Velu

    Raptor combines Hadoop & HBase with machine learning models for adaptive data segmentation, partitioning, bucketing, and filtering to enable ad-hoc queries and real-time analytics.
    Raptor has intelligent optimization algorithms that switch query execution between HBase and MapReduce. Raptor can create per-block dynamic bloom filters for adaptive filtering. A policy manager allows optimized indexing and autosharding.
    This session will address how Raptor has been used in prototype systems in predictive trading, times-series analytics, smart customer care solutions, and a generalized analytics solution that can be hosted on the cloud.

    At 1:15pm to 2:05pm, Tuesday 8th November

  • Security Considerations for Hadoop Deployments

    by Richard Clayton and Jeremy Glesner

    Security in a distributed environment is a growing concern for most industries. Few face security challenges like the Defense Community, who must balance complex security constraints with timeliness and accuracy. We propose to briefly discuss the security paradigms defined in DCID 6/3 by NSA for secure storage and access of data (the “Protection Level” system). In addition, we will describe the implications of each level on the Hadoop architecture and various patterns organizations can implement to meet these requirements within the Hadoop ecosystem. We conclude with our “wish list” of features essential to meet the federal security requirements.

    At 1:15pm to 2:05pm, Tuesday 8th November

  • Unlocking the Value of Big Data with Oracle

    by Jean-Pierre Dijcks

    Analyzing new and diverse digital data streams can reveal new sources of economic value, provide fresh insights into customer behavior and identify market trends early on. But this influx of new data can create challenges for IT departments. To derive real business value from Big Data, you need the right tools to capture and organize a wide variety of data types from different sources, and to be able to easily analyze it within the context of all your enterprise data. Attend this session to learn how Oracle’s end-to-end value chain for Big Data can help you unlock the value of Big Data.

    At 1:15pm to 2:05pm, Tuesday 8th November

  • Building a Model of Organic Link Traffic

    by Brian David Eoff

    At bitly we study behaviour on the internet by capturing clicks on shortened URLs. This link traffic comes in many forms yet, when studying human behaviour, we’re only interested in using ‘organic’ traffic: the traffic patterns caused by actual humans clicking on links that have been shared on the social web. To extract these patterns, we employ Python/Numpy, streaming Hadoop and some Machine Learning to create a model of organic traffic patterns based on bitly’s click logs. This model lets us extract the traffic we’re interested in from the variety of patterns generated by inorganic entities following bitly links.

    At 2:15pm to 3:05pm, Tuesday 8th November

    Coverage slide deck

  • Hadoop Network and Compute Architecture Considerations

    by Jacob Rapp

    Hadoop is a popular framework for web 2.0 and enterprise businesses who are challenged to store, process and analyze large amounts of data as part of their business requirements. Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. This session reviews the state of Hadoop enterprise environments, discusses fundamental and advanced Hadoop concepts and reviews benchmarking analysis and projection for big data growth as related to Data Center and Cluster designs. The session also discusses network architecture tradeoffs, and the advantages of close integration between compute and networking.

    At 2:15pm to 3:05pm, Tuesday 8th November

  • HDFS Name Node High Availability

    by Suresh Srinivas and Aaron T. Myers

    HDFS HA has been a highly sought after feature for years. Through collaboration between Cloudera, Facebook, Yahoo!, and others, a high availability system for the HDFS Name Node is actively being worked on. This talk will discuss the architecture and setup of this system.

    At 2:15pm to 3:05pm, Tuesday 8th November

  • Life in Hadoop Ops - Tales From the Trenches

    by Gregory Baker, Eric Sammer and Karthik Ranganathan

    This session will be a panel discussion with experienced Hadoop Operations practitioners from several different organizations. We’ll discuss the role, the challenges and how both these will change in the coming years.

    At 2:15pm to 3:05pm, Tuesday 8th November

  • The State of Big Data Adoption in the Enterprise

    by Tony Baer

    As Big Data has captured attention as one of “the next big things” in enterprise IT, most of the spotlight has focused on early adopters. But what is the state of Big Data adoption across the enterprise mainstream? Ovum recently surveyed 150 global organizations in a variety of vertical industries with revenue of $500 million+ and manage large enterprise data warehouses. We will share the findings from the research in this session. We will reveal similarities in awareness, readiness, and business drivers when compared.

    At 2:15pm to 3:05pm, Tuesday 8th November

  • Data Mining in Hadoop, Making Sense Of It in Mahout!

    by Michael Cutler

    Much of Hadoop adoption thus far has been for use cases such as processing log files, text mining, and storing masses of file data — all very necessary, but largely not exciting. In this presentation, Michael Cutler presents a selection of methodologies, primarily using Mahout, that will enable you to derive real insight into your data (mined in Hadoop) and build a recommendation engine focused on the implicit data collected from your users.

    At 3:30pm to 4:20pm, Tuesday 8th November

  • Hadoop and Graph Data Management: Challenges and Opportunities

    by Daniel Abadi

    As Hadoop rapidly becomes the universal standard for scalable data analysis and processing, it is increasingly important to understand its strengths and weaknesses for particular application scenarios in order to avoid inefficiency pitfalls. For example, Hadoop has great potential to perform scalable graph analysis if it is used correctly. Recent benchmarking has shown that simple implementations can be 1300 times less efficient than a more optimal Hadoop-centered implementation. In this talk, Daniel Abadi will give an overview of a recent research project at Yale University that investigates how to perform sub-graph pattern matching within a Hadoop-centered system that is three orders of magnitude faster than a more simple approach. In his talk Daniel will highlight how the cleaning, transforming, and parallel processing strengths of Hadoop are combined with storage optimized for graph data analysis. He will then discuss further changes that are needed in the core Hadoop framework to take performance to the next level.

    At 3:30pm to 4:20pm, Tuesday 8th November

  • Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools

    by Guy Harrison and Arvind Prabhakar

    As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative.

    We’ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we’ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.

    At 3:30pm to 4:20pm, Tuesday 8th November

  • The Hadoop Award for Government Excellence

    by Bob Gourley

    Federal, State and Local governments and the development community surrounding them are busy creating solutions leveraging the Apache Foundation Hadoop capabilities. This session will highlight the top five picked from an all star panel of judges. Who will take home the coveted Government Big Data Solutions Award for 2011? This presentation will also highlight key Big Data mission needs in the federal space and provide other insights which can fuel solutions in the sector.

    At 3:30pm to 4:20pm, Tuesday 8th November

  • WibiData: Building Personalized Applications with HBase

    by Garrett Wu and Aaron Kimball

    WibiData is a collaborative data mining and predictive modeling platform for large-scale, multi-structured, user-centric data. It leverages HBase to combine batch analysis and real time access within the same system, and integrates with existing BI, reporting and analysis tools. WibiData offers a set of libraries for common user-centric analytic tasks, and more advanced data mining libraries for personalization, recommendation, and other predictive modeling applications. Developers can write re-usable libraries that are also accessible to data scientists and analysts alongside the WibiData libraries. In this talk, we will provide a technical overview of WibiData, and show how we used it to build FoneDoktor, a mobile app that collects data about device performance and app resource usage to offer personalized battery and performance improvement recommendations directly to users.

    At 3:30pm to 4:20pm, Tuesday 8th November

  • Data Mining for Product Search Ranking

    by Aaron Beppu

    How can you rank product search results when you have very little data about how past shoppers have interacted with the products? Through large scale analysis of its clickstream data, Etsy is automatically discovering product attributes (things like materials, prices, or text features) which signal that a search result is particularly relevant (or irrelevant) to a given query. This attribute-level approach makes it possible to appropriately rank products in search results- even if those products are brand new and one-of-a-kind. This presentation discusses Etsy’s efforts to predict relevance in product search, in which Hadoop is a central component.

    At 4:30pm to 5:20pm, Tuesday 8th November

  • From Big Data to Lives Saved: HBase in Healthcare

    by Charlie Lougheed and Doug Meil

    Explorys, founded in 2009 in partnership with the Cleveland Clinic, is one of the largest clinical repositories in the United States with 10 million lives under contract.

    HBase and Hadoop are at the center of Explorys. The Explorys healthcare platform is based upon a massively parallel computing model that enables subscribers to search and analyze patient populations, treatment protocols, and clinical outcomes. Already spanning billions of anonymized clinical records, Explorys provides uniquely powerful and HIPAA compliant solutions for accelerating life saving discovery.

    At 4:30pm to 5:20pm, Tuesday 8th November

  • Hadoop and Netezza Deployment Models and Case Study

    by Krishnan Parasuraman and Greg Rokita

    Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In this session, you will learn about the similarities and differences of Hadoop and parallel data warehouses, and typical best practices. Edmunds will discuss how they increased delivery speed, reduced risk, and achieved faster reporting by combining ELT and ETL. For example, Edmunds ingests raw data into Hadoop and HBase then reprocesses the raw data in Netezza. You will also learn how Edmunds uses prototyping to work on nearly raw data with the company’s Analytics Team using Netezza.

    At 4:30pm to 5:20pm, Tuesday 8th November

  • I Want to Be BIG - Lessons Learned at Scale

    by David “Sunny” Sundstrom

    SGI has been a leading commercial vendor of Hadoop clusters since 2008. Leveraging SGI’s experience with high performance clusters at scale, SGI has delivered individual Hadoop clusters of up to 4000 nodes. In this presentation, through the discussion of representative customer use cases, you’ll explore major design considerations for performance and power optimization, how integrated Hadoop solutions leveraging CDH, SGI Rackable clusters, and SGI Management Center best meet customer needs, and how SGI envisions the needs of enterprise customers evolving as Hadoop continues to move into mainstream adoption.

    At 4:30pm to 5:20pm, Tuesday 8th November

Schedule incomplete?

Add a new session

Filter by Day

Filter by coverage

Filter by Topic