Sessions at Strata 2012 in Ballroom CD

Your current filters are…

Tuesday 28th February 2012

  • Introduction to Apache Hadoop

    by Sarah Sproehnle

    This tutorial provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop, plus its associated ecosystem. This session is intended for those who are new to Hadoop and are seeking to understand where Hadoop is appropriate and how it fits with existing systems.

    The agenda will include:

    • The rationale for Hadoop
    • Understanding the Hadoop Distributed File System (HDFS) and MapReduce
    • Common Hadoop use cases including recommendation engines, ETL, time-series analysis and more
    • How Hadoop integrates with other systems like Relational Databases and Data Warehouses
    • Overview of the other components in a typical Hadoop “stack” such as these Apache projects: Hive, Pig, HBase, Sqoop, Flume and Oozie

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom CD, Santa Clara Convention Center

  • The Two Most Important Algorithms in Predictive Modeling Today

    by Mike Bowles and Jeremy Howard

    When doing predictive modelling, there are two situations in which you might find yourself:

    You need to fit a well-defined parameterised model to your data, so you require a learning algorithm which can find those parameters on a large data set without over-fitting
    You just need a “black box” which can predict your dependent variable as accurately as possible, so you need a learning algorithm which can automatically identify the structure, interactions, and relationships in the data
    For case (1), lasso and elastic-net regularized generalized linear models are a set of modern algorithms which meet all these needs. They are fast, work on huge data sets, and avoid over-fitting automatically. They are available in the “glmnet” package in R.

    For case (2), ensembles of decision trees (often known as “Random Forests”) have been the most successful general-purpose algorithm in modern times. For instance, most Kaggle competitions have at least one top entry that heavily uses this approach. This algorithm is very simple to understand, and is fast and easy to apply. It is available in the “randomForest” package in R.

    Mike and Jeremy will explain in simple terms, using no complex math, how these algorithms work, and will also explain using numerous examples how to apply them using R. They will also provide advice on how to select from these algorithms, and will show how to prepare the data, and how to use the trained models in practice.

    At 1:30pm to 5:00pm, Tuesday 28th February

    In Ballroom CD, Santa Clara Convention Center

Wednesday 29th February 2012

  • The Future of Hadoop: Becoming an Enterprise Standard

    by Eric Baldeschwieler

    During the last 12 months, Apache Hadoop has received an enormous amount of attention for its ability to transform the way organizations capitalize on their data in a cost effective manner. The technology has evolved to a point where organizations of all sizes and industries are testing its power as a potential solution to their own data management challenges.

    However, there are still technology and knowledge gaps hindering adoption of Apache Hadoop as an enterprise standard. Among these gaps are the complexity of the system, the lack of technical content that exists to assist with its usage, and that it requires intensive developer and data scientist skills to be used properly. With virtually every Fortune 500 company constructing their Hadoop strategy today, many in the IT community are wondering what the future of Hadoop will look like.

    In this session, Hortonworks CEO Eric Baldeschwieler will look at the current state of Apache Hadoop, how the ecosystem is evolving by working together to close the existing technological and knowledge gaps, and present a roadmap for the future of the project.

    At 10:40am to 11:20am, Wednesday 29th February

    In Ballroom CD, Santa Clara Convention Center

  • I Didn't Know You Could Do All that with Hadoop

    by Jack Norris

    Hadoop is gaining momentum with most companies having already deployed Hadoop in some fashion or are testing it in the lab. But there are many aspects of Hadoop that are not fully understood and appreciated including – How Hadoop can easily be leveraged by non-programmers, how to use Hadoop to quickly outperform complex models, how to easily integrate Hadoop into existing environments, and the two step process to use legacy applications with Hadoop.

    During the session, Ted Dunning will show that while counter intuitive, as data size increases simple algorithms perform better than complex models on small data. This can greatly simplify the deployment and development of Hadoop applications and the talk will include several examples of machine learning deployments across multiple industries.

    This session will also cover recent developments that make Hadoop access available to rank-and -file users. This expands access with standard applications to view and manipulate data beyond programmer access. This session will provide detailed descriptions of the following:

    1) Getting data into and out of the Hadoop cluster as quickly as possible 2) Allowing real-time components to easily access cluster data 3) Using well-known and understood standard tools to access cluster data 4) Making Hadoop easier to use and operate 5) Leveraging existing code in map-reduce settings 6) Integrating map-reduce systems into existing analytic systems

    At 11:30am to 12:10pm, Wednesday 29th February

    In Ballroom CD, Santa Clara Convention Center

  • Collaborative Filtering using MapReduce

    by Sam Shah

    Collborative filtering is a method of making predictions about a user’s interests based on the preferences of many other users. It’s used to make recommendations on many Internet sites, including LinkedIn. For instance, there’s a “Viewers of this profile also viewed” module on a user’s profile that shows other covisited pages. This “wisdom of the crowd” recommendation platform, built atop Hadoop, exists across many entities on LinkedIn, including jobs, companies, etc., and is a significant driver of engagement.

    During this talk, I will build a complete, scalable item-to-item collaborative filtering MapReduce flow in front of the audience. We’ll then get into some performance optimizations, model improvements, and practical considerations: a few simple tweaks can result in an order of magnitude performance improvement and a substantial increase in clickthroughs from the naive approach. This simple covisitation method gets us more than 80% of the way to the more sophisticated algorithms we have tried.

    This is a practical talk that is accessible to all.

    At 1:30pm to 2:10pm, Wednesday 29th February

    In Ballroom CD, Santa Clara Convention Center

  • Hadoop + JavaScript: what we learned

    by Asad Khan

    In this session we will discuss two key aspects of using JavaScript in the Hadoop environment. The first one is how we can reach to a much broader set of developers by enabling JavaScript support on Hadoop. The JavaScript fluent API that works on top of other languages like PigLatin let developers define MapReduce jobs in a style that is much more natural; even to those who are unfamiliar to the Hadoop environment.

    The second one is how to enable simple experiences directly through an HTML5-based interface. The lightweight Web interface gives developer the same experience as they would get on the Server. The web interface provides a zero installation experience to the developer across all client platforms. This also allowed us to use HTML5 support in the browsers to give some basic data visualization support for quick data analysis and charting.

    During the session we will also share how we used other open source projects like Rhino to enable JavaScript on top of Hadoop.

    At 2:20pm to 3:00pm, Wednesday 29th February

    In Ballroom CD, Santa Clara Convention Center

  • Architecting Virtualized Infrastructure for Big Data

    by Richard McDougall

    This session will teach participants how to architect big data systems that leverage virtualization and platform as a service.

    We will walk through a layered approach to building a unified analytics platform using virtualization, provisioning tools and platform as a service. We will show how virtualization can be used to simplify deployment and provisioning of Hadoop, SQL and NoSQL databases. We will describe the workload patterns of Hadoop and the infrastructure design implications. We will discuss the current and future role of PaaS to make it easy to deploy Java, SQL, R, and Python jobs against big-data sets.

    At 4:00pm to 4:40pm, Wednesday 29th February

    In Ballroom CD, Santa Clara Convention Center

  • Aggregating and serving local places data and ads at Citygrid

    by Kin Lane and Ana Martinez

    Join us for an in depth architectural review of the latest infrastructure built by Citygrid to process and serve the local places data available via Citygrid APIs.

    We will present how Hadoop is used to process large amounts of inbound data from disparate sources and to solve the complex problem of matching for places.

    We will also discuss how Hadoop is used to generate the Solr and MongoDB indexes used for serving.

    We will describe the function of the places, content and ad APIs and SDKs, and the characteristics of their underlying data, in the context of real world use cases.

    We will focus on some of the limitations of Lucene and Solr for geographic search, and discuss some of the most recent developments we are exploring for our next generation APIs.

    Finally, we will give a preview of Citygrid’s next generation real time event processing system, inspired by Twitter’s Rainbird and build on top of Cassandra.

    At 4:50pm to 5:30pm, Wednesday 29th February

    In Ballroom CD, Santa Clara Convention Center

Thursday 1st March 2012

  • Mining the Eventbrite Social Graph for Recommending Events

    by Vipul Sharma

    Recommendation systems have become critical for delivering relevant and personalized content to your users. Such systems not only drive revenues and generate significant user engagement for web companies but also are a great discovery tool for users. Facebook’s newsfeed, Linkedin’s people you may know and Eventbrite’s event recommendations are some great examples of recommendation systems.

    During this talk we will share the architecture and design of Eventbrite’s data platform and recommendation engine. We will describe how we mined a massive social graph of 18M users and 6B first degree connections to provide relevant event recommendations. We will provide details of our data platform, which supports processing more than 2 TB social graph data daily. We intent to describe how Hadoop is becoming the most important tool to do data mining and also discuss how machine learning is changing in presence of Hadoop and big data.

    We hope to provide enough details that folks can learn from our experiences while building their data platform and recommendation systems.

    At 10:40am to 11:20am, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center

    Coverage slide deck

  • Hadoop Analytics in Financial Services

    by Stefan Groschupf

    Once social media and web companies discovered Hadoop as the good enough solution for any data analytics problem that did not fit into mysql, Hadoop was on a rapid rise on the financial industry. The reasons the financial industry is adopting Hadoop very fast are very different than in other industries. Banks typically are not engineering driven organizations and terms like agile development, shared root key or cron tab scheduling are no go’s in a bank but standard around Hadoop.

    This entertaining talk for bankers and other financial services managers with technical experience or engineers discusses four business intelligence platform deployments on Hadoop:

    1. Long-term storage and analytics of transactions and the huge cost saves Hadoop can provide;

    2. Identifying cross and up sell opportunities by analyzing web log files in combination with customer profiles;

    3. Value-at-risk analytics; and

    4. Understanding the SLA issues and identifying problems in a thousands-of-nodes, big services oriented architecture.

    This session discusses the different use cases and the challenges to overcome in building and using BI on Hadoop.

    At 11:30am to 12:10pm, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center

  • Using Map/Reduce to Speed Analysis of Video Surveillance

    by JP Morgenthal

    In video surveillance, hundreds of hours of video recordings are culled from multiple cameras. Within this video are hours of recordings that do not change from one minute to the next, one hours to the next and in some cases, one day to the next. Identifying information that is interesting and that can be shared, analyzed and viewed by a larger community from this video is a time-consuming task that often requires human intervention assisted by digital processing tools.

    Using Map/Reduce we can harness parallel processing and clusters of graphical processors to identify and tag useful periods of time for faster analysis. The result is an aggregate video file that contains metadata tags that link back to the start of those scenes in the original file. In essence, creating an index into hundreds-of-thousands of hours of recording that can be reviewed, shared and analyzed by a much larger group of individuals.

    This session will review examples where this is being done in the real world and discuss the process for developing a Hadoop process that can break a video down into scenes that are analyzed by maps to determine interest and then reduced into a single index file that contains 30 seconds of recording around that scene. Moreover, the file will contain the necessary metadata to jump back into the original at the start point and allow the viewer to view the scene in context of the entire recording.

    At 1:30pm to 2:10pm, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center

  • Beyond Map/Reduce: Getting Creative with Parallel Processing

    by Ed Kohlwey

    While Map/Reduce is an excellent environment for some parallel computing tasks, there are many ways to use a cluster beyond Map/Reduce. Within the last year, the YARN and NextGen Map/Reduce has been contributed into the Hadoop trunk, Mesos has been released as an open source project, and a variety of new parallel programming environments have emerged such as Spark, Giraph, Golden Orb, Accumulo, and others.

    We will discuss the features of YARN and Mesos, and talk about obvious yet relatively unexplored uses of these cluster schedulers as simple work queues. Examples will be provided in the context of machine learning. Next, we will provide an overview of the Bulk-Synchronous-Parallel model of computation, and compare and contrast the implementations that have emerged over the last year. We will also discuss two other alternative environments: Spark, an in-memory version of Map/Reduce which features a Scala-based interpreter; and Accumulo, a BigTable-style database that implements a novel model for parallel computation and was recently released by the NSA.

    At 2:20pm to 3:00pm, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center

    Coverage slide deck

  • Petabyte Scale, Automated Support for Remote Devices

    by Kumar Palaniappan and Ron Bodkin

    NetApp is a fast growing provider of storage technology. Its devices “phone home” regularly, sending unstructured auto-support log and configuration data back to centralized data centers. This data is used to provide timely support, to improve sales, and to plan product improvements. To allow this, data is collected, organized, and analyzed. The system currently ingests 5 TB of compressed data per week, which is growing 40% per year. NetApp was previously storing flat files on disk volumes and keeping summary data in relational databases. Now NetApp is working with Think Big Analytics, deploying Hadoop, HBase and related technologies to ingest, organize, transform and present auto-support data. This will enable business users to make decisions and provide timely response, and will enable automated response based on predictive models. Key requirements include:

    • Query data in seconds within 5 minutes of event occurrence.
    • Execute complex ad hoc queries to investigate issues and plan accordingly.
    • Build models to predict support issues and capacity limits to take action before issues arise.
    • Build models for cross-sale opportunities.
    • Expose data to applications through REST interfaces

    In this session we look at the the lessons learned while designing and implementing a system to:

    • Collect 1000 messages of 20MB compressed per minute.
    • Store 2 PB of incoming support events by 2015.
    • Provide low latency access to support information and configuration changes in HBase at scale within 5 minutes of event arrival.
    • Support complex ad hoc queries that join multiple data sets accessing diverse structured and unstructured large scale data sets
    • Operate efficiently at scale.
    • Integrate with a data warehouse in Oracle.

    At 4:00pm to 4:40pm, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center

  • Big Analytics Beyond the Elephants

    by Paul Brown

    Scientists dealt with big data and big analytics for at least a decade before the business world precipitated buzz-words like ‘Big Data’, ‘Data Tsunami’ and ‘the Industrial Revolution of data’ from the strange broth of their marketing solution and came to realize they had the same problems. Both the scientific world and the commercial world share requirements for a high performance informatics platform supporting the collection, curation, collaboration, exploration, and analysis of massive datasets.

    In this talk we will sketch the design of SciDB and explain how it differs from hadoop-based systems, SQL DBMS products, and NoSQL platforms, and explain why that matters. We will present benchmarking data and present a computational genomics use case that showcase SciDB’s massively scalable parallel analytics.

    SciDB is an emerging open source analytical database that runs on a commodity hardware grid or in the cloud. SciDB natively supports:

    • An array data model – a flexible, compact, extensible data model for rich, highly dimensional data

    • Massively scale math – non-embarassingly parallel operations like linear algebra operations on matrices too large to fit in memory as well as transparently scalable R, MatLab, and SAS style analytics without requiring code for data distribution or parallel computation

    • Versioning and Provenance – Data is updated, but never overwritten. The raw data, the derived data, and the derivation are kept for reproducibility, what-if modeling, back-testing, and re-analysis

    • Uncertainty support – data carry error bars, probability distribution or confidence metrics that can be propagated through calculations

    • Smart storage – compact storage for both dense and sparse data that is efficient for location-based, time-series, and instrument data

    At 4:50pm to 5:30pm, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center