Sessions at Strata 2012 about Hadoop

Your current filters are…

Tuesday 28th February 2012

  • Big Data Without the Heavy Lifting

    by Chris Deptula and James Dixon

    The big data world is extremely chaotic based on technology in its infancy. Learn how to tame this chaos, integrate it within your existing data environments (RDBMS, analytic databases, applications), manage the workflow, orchestrate jobs, improve productivity and make using big data technologies accessible to a much wider spectrum of developers, analysts and data scientists. Learn how you can actually leverage Hadoop and NoSQL stores via an intuitive, graphical big data IDE – eliminating the need for deep developer skills such as Hadoop MapReduce, Pig scripting, or NoSQL queries.

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom F, Santa Clara Convention Center

  • Hadoop Data Warehousing with Hive

    by Dean Wampler and Jason Rutherglen

    In this hands-on tutorial, you’ll learn how to install and use Hive for Hadoop-based data warehousing. You’ll also learn some tricks of the trade and how to handle known issues.

    Using the Hive Tutorial Tools

    We’ll email instructions to you before the tutorial so you can come prepared with the necessary tools installed and ready to go. This prior preparation will let us use the whole tutorial time to learn Hive’s query language and other important topics. At the beginning of the tutorial we’ll show you how to use these tools.

    Writing Hive Queries

    We’ll spend most of the tutorial using a series of hands-on exercises with actual Hive queries, so you can learn by doing. We’ll go over all the main features of Hive’s query language, HiveQL, and how Hive works with data in Hadoop.

    Advanced Techniques

    Hive is very flexible about the formats of data files, the “schema” of records and so forth. We’ll discuss options for customizing these and other aspects of your Hive and data cluster setup. We’ll briefly examine how you can write Java user defined functions (UDFs) and other plugins that extend Hive for data formats that aren’t supported natively.

    Hive in the Hadoop Ecosystem

    We’ll conclude with a discussion of Hive’s place in the Hadoop ecosystem, such as how it compares to other available tools. We’ll discuss installation and configuration issues that ensure the best performance and ease of use in a real production cluster. In particular, we’ll discuss how to create Hive’s separate “metadata” store in a traditional relational database, such as MySQL. We’ll offer tips on data formats and layouts that improve performance in various scenarios.

    At 9:00am to 12:30pm, Tuesday 28th February

    Coverage slide deck

  • Introduction to Apache Hadoop

    by Sarah Sproehnle

    This tutorial provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop, plus its associated ecosystem. This session is intended for those who are new to Hadoop and are seeking to understand where Hadoop is appropriate and how it fits with existing systems.

    The agenda will include:

    • The rationale for Hadoop
    • Understanding the Hadoop Distributed File System (HDFS) and MapReduce
    • Common Hadoop use cases including recommendation engines, ETL, time-series analysis and more
    • How Hadoop integrates with other systems like Relational Databases and Data Warehouses
    • Overview of the other components in a typical Hadoop “stack” such as these Apache projects: Hive, Pig, HBase, Sqoop, Flume and Oozie

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom CD, Santa Clara Convention Center

  • Developing applications for Apache Hadoop

    by Sarah Sproehnle

    This tutorial will explain how to leverage a Hadoop cluster to do data analysis using Java MapReduce, Apache Hive and Apache Pig. It is recommended that participants have experience with some programming language. Topics include:

    • Why are Hadoop and MapReduce needed?
    • Writing a Java MapReduce program
    • Common algorithms applied to Hadoop such as indexing, classification, joining data sets and graph processing
    • Data analysis with Hive and Pig
    • Overview of writing applications that use Apache HBase

    At 1:30pm to 5:00pm, Tuesday 28th February

    In GA J, Santa Clara Convention Center

  • Survival Analysis for Cache Time-to-Live Optimization

    by Rob Lancaster

    We examine the effectiveness of a statistical technique known as survival analysis to optimize the cache time-to-live for hotel rates in a hotel rate cache. We describe how we collect and prepare nearly a billion records per day utilizing MongoDB and Hadoop. Finally, we show how this analysis is improving the operation of our hotel rate cache.

    At 3:30pm to 4:00pm, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

Wednesday 29th February 2012

  • The Apache Hadoop Ecosystem

    by Doug Cutting

    Apache Hadoop forms the kernel of an operating system for Big Data. This ecosystem of interdependent projects enables institutions to affordably explore ever vaster quantities of data. The platform is young, but it is strong and vibrant, built to evolve.

    At 8:50am to 9:00am, Wednesday 29th February

    In Mission City Ballroom, Santa Clara Convention Center

    Coverage video

  • Guns, Drugs and Oil: Attacking Big Problems with Big Data

    by Mike Olson

    Tools for attacking big data problems originated at consumer internet companies, but the number and variety of big data problems have spread across industries and around the world. I’ll present a brief summary of some of the critical social and business problems that we’re attacking with the open source Apache Hadoop platform.

    At 9:20am to 9:30am, Wednesday 29th February

    In Mission City Ballroom, Santa Clara Convention Center

    Coverage video

  • RHadoop, R meets Hadoop

    by Antonio Piccolboni

    Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.

    • rhdfs provides file level manipulation for HDFS, the Hadoop file system
    • rhbase provides access to HBASE, the hadoop database
    • rmr allows to write mapreduce programs in R

    rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.

    At 10:40am to 11:20am, Wednesday 29th February

    In GA J, Santa Clara Convention Center

  • The Future of Hadoop: Becoming an Enterprise Standard

    by Eric Baldeschwieler

    During the last 12 months, Apache Hadoop has received an enormous amount of attention for its ability to transform the way organizations capitalize on their data in a cost effective manner. The technology has evolved to a point where organizations of all sizes and industries are testing its power as a potential solution to their own data management challenges.

    However, there are still technology and knowledge gaps hindering adoption of Apache Hadoop as an enterprise standard. Among these gaps are the complexity of the system, the lack of technical content that exists to assist with its usage, and that it requires intensive developer and data scientist skills to be used properly. With virtually every Fortune 500 company constructing their Hadoop strategy today, many in the IT community are wondering what the future of Hadoop will look like.

    In this session, Hortonworks CEO Eric Baldeschwieler will look at the current state of Apache Hadoop, how the ecosystem is evolving by working together to close the existing technological and knowledge gaps, and present a roadmap for the future of the project.

    At 10:40am to 11:20am, Wednesday 29th February

    In Ballroom CD, Santa Clara Convention Center

  • Hadoop + JavaScript: what we learned

    by Asad Khan

    In this session we will discuss two key aspects of using JavaScript in the Hadoop environment. The first one is how we can reach to a much broader set of developers by enabling JavaScript support on Hadoop. The JavaScript fluent API that works on top of other languages like PigLatin let developers define MapReduce jobs in a style that is much more natural; even to those who are unfamiliar to the Hadoop environment.

    The second one is how to enable simple experiences directly through an HTML5-based interface. The lightweight Web interface gives developer the same experience as they would get on the Server. The web interface provides a zero installation experience to the developer across all client platforms. This also allowed us to use HTML5 support in the browsers to give some basic data visualization support for quick data analysis and charting.

    During the session we will also share how we used other open source projects like Rhino to enable JavaScript on top of Hadoop.

    At 2:20pm to 3:00pm, Wednesday 29th February

    In Ballroom CD, Santa Clara Convention Center

  • Getting the Most from Your Hadoop Big Data Cluster

    by Rohit Valia

    The Hadoop framework is an established solution for big data management and analysis. In practice, Hadoop applications vary significantly. Your data center infrastructure is used by multiple lines of business and multiple differing workloads.

    This session looks at the requirements for a multi-tenant big data cluster: one where different lines of businesses, different projects, and multiple applications can be run with assured SLAs, resulting in higher utilization and ROI for these clusters.

    This session is sponsored by Platform Computing

    At 4:00pm to 4:40pm, Wednesday 29th February

    In Ballroom G, Santa Clara Convention Center

  • Hadoop Plugin for MongoDB: The Elephant in the Room

    by Steve Francia

    Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.

    At 4:00pm to 4:40pm, Wednesday 29th February

    In GA J, Santa Clara Convention Center

  • Analyzing Hadoop Source Code with Hadoop

    by Stefan Groschupf

    Using Hadoop based business intelligence analytics, this session looks at the Hadoop source code and its development over time and illustrates some interesting and fun facts we will share with the audience. This talk will illustrate text and related analytics with Hadoop on Hadoop to reveal the true hidden secrets of the elephant.

    This entertaining session highlights the value of data correlation across multiple datasets and the visualization of those correlations to reveal hidden data relationships.

    At 4:50pm to 5:30pm, Wednesday 29th February

    In GA J, Santa Clara Convention Center

Thursday 1st March 2012

  • Hadoop Analytics in Financial Services

    by Stefan Groschupf

    Once social media and web companies discovered Hadoop as the good enough solution for any data analytics problem that did not fit into mysql, Hadoop was on a rapid rise on the financial industry. The reasons the financial industry is adopting Hadoop very fast are very different than in other industries. Banks typically are not engineering driven organizations and terms like agile development, shared root key or cron tab scheduling are no go’s in a bank but standard around Hadoop.

    This entertaining talk for bankers and other financial services managers with technical experience or engineers discusses four business intelligence platform deployments on Hadoop:

    1. Long-term storage and analytics of transactions and the huge cost saves Hadoop can provide;

    2. Identifying cross and up sell opportunities by analyzing web log files in combination with customer profiles;

    3. Value-at-risk analytics; and

    4. Understanding the SLA issues and identifying problems in a thousands-of-nodes, big services oriented architecture.

    This session discusses the different use cases and the challenges to overcome in building and using BI on Hadoop.

    At 11:30am to 12:10pm, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center

  • Using Map/Reduce to Speed Analysis of Video Surveillance

    by JP Morgenthal

    In video surveillance, hundreds of hours of video recordings are culled from multiple cameras. Within this video are hours of recordings that do not change from one minute to the next, one hours to the next and in some cases, one day to the next. Identifying information that is interesting and that can be shared, analyzed and viewed by a larger community from this video is a time-consuming task that often requires human intervention assisted by digital processing tools.

    Using Map/Reduce we can harness parallel processing and clusters of graphical processors to identify and tag useful periods of time for faster analysis. The result is an aggregate video file that contains metadata tags that link back to the start of those scenes in the original file. In essence, creating an index into hundreds-of-thousands of hours of recording that can be reviewed, shared and analyzed by a much larger group of individuals.

    This session will review examples where this is being done in the real world and discuss the process for developing a Hadoop process that can break a video down into scenes that are analyzed by maps to determine interest and then reduced into a single index file that contains 30 seconds of recording around that scene. Moreover, the file will contain the necessary metadata to jump back into the original at the start point and allow the viewer to view the scene in context of the entire recording.

    At 1:30pm to 2:10pm, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center

  • Petabyte Scale, Automated Support for Remote Devices

    by Kumar Palaniappan and Ron Bodkin

    NetApp is a fast growing provider of storage technology. Its devices “phone home” regularly, sending unstructured auto-support log and configuration data back to centralized data centers. This data is used to provide timely support, to improve sales, and to plan product improvements. To allow this, data is collected, organized, and analyzed. The system currently ingests 5 TB of compressed data per week, which is growing 40% per year. NetApp was previously storing flat files on disk volumes and keeping summary data in relational databases. Now NetApp is working with Think Big Analytics, deploying Hadoop, HBase and related technologies to ingest, organize, transform and present auto-support data. This will enable business users to make decisions and provide timely response, and will enable automated response based on predictive models. Key requirements include:

    • Query data in seconds within 5 minutes of event occurrence.
    • Execute complex ad hoc queries to investigate issues and plan accordingly.
    • Build models to predict support issues and capacity limits to take action before issues arise.
    • Build models for cross-sale opportunities.
    • Expose data to applications through REST interfaces

    In this session we look at the the lessons learned while designing and implementing a system to:

    • Collect 1000 messages of 20MB compressed per minute.
    • Store 2 PB of incoming support events by 2015.
    • Provide low latency access to support information and configuration changes in HBase at scale within 5 minutes of event arrival.
    • Support complex ad hoc queries that join multiple data sets accessing diverse structured and unstructured large scale data sets
    • Operate efficiently at scale.
    • Integrate with a data warehouse in Oracle.

    At 4:00pm to 4:40pm, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center