Sessions at Strata 2012 in GA J

Your current filters are…

Tuesday 28th February 2012

  • Designing Data Vizualisations Workshop

    by Noah Iliinsky

    We will discuss how to figure out what story to tell, select the right data, and pick appropriate layout and encodings. The goal is to learn how to create a visualization that conveys appropriate knowledge to a specific audience (which may include the designer).

    We’ll briefly discuss tools, including pencil and paper. No prior technology or graphic design experience is necessary. An awareness of some basic user-centered design concepts will be helpful.

    Understanding of your specific data or data types will help immensely. Please do bring data sets to play with.

    At 9:00am to 12:30pm, Tuesday 28th February

    In GA J, Santa Clara Convention Center

  • Developing applications for Apache Hadoop

    by Sarah Sproehnle

    This tutorial will explain how to leverage a Hadoop cluster to do data analysis using Java MapReduce, Apache Hive and Apache Pig. It is recommended that participants have experience with some programming language. Topics include:

    • Why are Hadoop and MapReduce needed?
    • Writing a Java MapReduce program
    • Common algorithms applied to Hadoop such as indexing, classification, joining data sets and graph processing
    • Data analysis with Hive and Pig
    • Overview of writing applications that use Apache HBase

    At 1:30pm to 5:00pm, Tuesday 28th February

    In GA J, Santa Clara Convention Center

Wednesday 29th February 2012

  • RHadoop, R meets Hadoop

    by Antonio Piccolboni

    Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.

    • rhdfs provides file level manipulation for HDFS, the Hadoop file system
    • rhbase provides access to HBASE, the hadoop database
    • rmr allows to write mapreduce programs in R

    rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.

    At 10:40am to 11:20am, Wednesday 29th February

    In GA J, Santa Clara Convention Center

  • Monitoring Apache Hadoop - a big data problem?

    by Henry Robinson

    It’s arguably the case that failure, in any form, is the most significant reason that distributed systems are difficult. Without the possibility of failure many traditional challenges in distributed computing, such as replication or leader election, become much simpler, much more performant, or both. But failure, of machines, processes and people, remains an unavoidable reality.

    In this talk I want to demonstrate three things. First, why failure makes distributed systems design hard. Second, why understanding the root cause of a failure or outage is vital to operators of large distributed systems. Third, why doing that root cause analysis is itself difficult – because of the problems with understanding causality in distributed systems – and how we’ve had some success at Cloudera treating it as a big-data problem.

    I’ll explain the role that failure plays in distributed systems design quickly, by showing how complex operations become trivially simple when the possibility of failure is removed.

    I’ll motivate the problem of root-cause analysis by showing how bugs can be diagnosed after the fact, and repeat behaviour avoided, once we know what caused an incident, supported by anonymised examples that we have seen at Cloudera.

    The key to understanding failures is knowing what event caused what – the causal relationship between incidents. Unfortunately, Hadoop and other systems do a poor job of sharing causal relationships, and doing so in general is fundamentally hard due to the lack of perfectly synchronised clocks.

    In lieu of knowing the causal relationships between components, we have to try and infer them from correlations that we see between disparate signals, from log files to user actions to operating system monitoring. This data is readily available, but huge! The challenge is to have this data help us in forming hypotheses about causal links, which we can then validate. This can be cast as a big data analysis problem of searching for the most likely causal relationships between millions of seemingly independent events. I’ll show two ways we can attack this problem: by visualisation and by algorithm.

    Finally I’ll show how the community can help this effort, by building tracing tools that make some causal relationships explicit, and therefore drastically cut down the amount of searching we have to do.

    At 11:30am to 12:10pm, Wednesday 29th February

    In GA J, Santa Clara Convention Center

  • How Crunch Makes Writing, Testing and Running of MapReduce Pipelines Easy, Efficient and Even Fun!

    by Josh Wills

    Tools like Pig, Hive, and Cascading ease the burden of writing MapReduce pipelines by defining Tuple-oriented data models and providing support for filtering, joining and aggregating those records. However, there are many data sets that do not naturally fit into the Tuple model, such as images, time series, audio files and seismograms. To process data in these binary formats, developers often go back to writing MapReduces using the low-level Java APIs.

    In this session, Cloudera Data Scientist Josh Wills will share insights and “how to” tricks about Crunch, a Java library that aims to make writing, testing and running MapReduce pipelines that run over any type of data easy, efficient and even fun. Crunch’s design is modeled after Google’s FlumeJava library and focuses on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution on the Hadoop cluster.

    At 2:20pm to 3:00pm, Wednesday 29th February

    In GA J, Santa Clara Convention Center

  • How to develop Big Data Pipelines for Hadoop

    by Mark Pollack

    Hadoop is not an island. To deliver a complete Big Data solution, a data pipeline needs to be developed that incorporates and orchestrates many diverse technologies.

    A Hadoop focused data pipeline not only needs to coordinate the running of multiple Hadoop jobs (MapReduce, Hive, or Pig), but also encompass real-time data acquisition and the analysis of reduced data sets extracted into relational/NoSQL databases or dedicated analytical engines.

    Using an example of real-time weblog processing, in this session we will demonstrate how the open source Spring Batch and Spring Integration projects can be used to build manageable and robust pipeline solutions around Hadoop.

    At 2:20pm to 3:00pm, Wednesday 29th February

    In GA J, Santa Clara Convention Center

  • Hadoop Plugin for MongoDB: The Elephant in the Room

    by Steve Francia

    Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.

    At 4:00pm to 4:40pm, Wednesday 29th February

    In GA J, Santa Clara Convention Center

  • Analyzing Hadoop Source Code with Hadoop

    by Stefan Groschupf

    Using Hadoop based business intelligence analytics, this session looks at the Hadoop source code and its development over time and illustrates some interesting and fun facts we will share with the audience. This talk will illustrate text and related analytics with Hadoop on Hadoop to reveal the true hidden secrets of the elephant.

    This entertaining session highlights the value of data correlation across multiple datasets and the visualization of those correlations to reveal hidden data relationships.

    At 4:50pm to 5:30pm, Wednesday 29th February

    In GA J, Santa Clara Convention Center

Thursday 1st March 2012

  • Apache Cassandra: NoSQL Applications in the Enterprise Today

    by Jonathan Ellis

    This session will shed light on real-world use cases for NoSQL databases by providing case studies from enterprise production users taking advantage of the massively scalable and highly-available architecture of Apache Cassandra.

    • Netflix – See how, with Cassandra, Netflix achieved cloud-enabled business agility, capacity and application flexibility, and never worried about running out of space or power.
    • Backupify – Cassandra enables reliable, redundant and scalable low-balance data storage, eliminating downtime and ensuring they can backup customer data around the clock.
    • Ooyala – The elastically scalable Cassandra database allows Ooyala to absorb and leverage massive amounts of digital video data by simply adding nodes which can grow to hundreds or thousands.
    • Constant Contact – Cassandra enables Constant Contact to massively scale an operationally simple application that deployed in 3 months for $250k, compared to 9 months for $2.5 million if they used a traditional RDBMS.

    At the end of this session you will have an good understanding of the types of requirements Cassandra can satisfy through a carefully thought-out architecture designed to manage all forms of modern data, that scales to meet the requirements of “big data” management, that offers linear performance scale-out capabilities, and delivers the type of high availability that most every online, 24×7 application needs.

    At 10:40am to 11:20am, Thursday 1st March

    In GA J, Santa Clara Convention Center

  • Storm: distributed and fault-tolerant realtime computation

    by Nathan Marz

    Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it’s fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language. Twitter relies upon Storm for much of its analytics.

    After being open-sourced, Storm instantly attracted a large community. It is by far the most watched JVM project on GitHub and the mailing list is active with over 300 users.

    Storm has a wide range of use cases, from stream processing to continuous computation to distributed RPC. In this talk I’ll introduce Storm and show how easy it is to use for realtime computation.

    At 11:30am to 12:10pm, Thursday 1st March

    In GA J, Santa Clara Convention Center

  • Analytics from 330 million smartphones

    by Sean Byrnes

    Flurry’s analytics and advertising platform for mobile tracks data on over 330 million devices per month. We operate a 500 node Hadoop and HBase cluster to mine and manage all this data. This talk will go over some of the lessons learned, architecture choices, and advantages of running this big data platform. Some of the covered topics include:

    • Mining data for marketing
    • Lessons learned from operating a large production Hadoop cluster
    • Fault tolerance in the Flurry architecture
    • Algorithms for estimating demographics across applications

    At 1:30pm to 2:10pm, Thursday 1st March

    In GA J, Santa Clara Convention Center

  • Connecting Millions of Mobile Devices to the Cloud

    by James Phillips

    Mobile devices are ideal data capture and presentation points. They offer boundless opportunities for data collection and the presentation of temporally- and spatially-relevant data. The most compelling mobile applications will require aggregation, analysis and transformation of data from many devices and users. But intermittent network connectivity and constrained processing, storage, bandwidth and battery resources present significant obstacles. Highlighted with real-world applications, this session will cover challenges and approaches to device data collection; device-device and device-cloud data synchronization; and cloud-based data aggregation, analysis and transformation.

    At 2:20pm to 3:00pm, Thursday 1st March

    In GA J, Santa Clara Convention Center

  • Open Source Ceph Storage– Scaling from Gigabytes to Exabytes with Intelligent Nodes

    by Sage Weil

    As the size and performance requirements of storage systems have increased, file system designers have looked to new architectures to facilitate system scalability.

    Ceph’s architecture consists of an object storage, block storage and a POSIX-compliant file system. It’s in the most significant storage system that has been accepted into the Linux kernel. Ceph has both kernel and userland implementations.The CRUSH algorithm controlled, scalable, decentralized placement of replicated data. In addition, it has a fully leveraged, highly scalable metadata layer. Ceph offers compatibility with S3, Swift and Google Storage and is a drop in replacement for HDFS (and other File Systems).

    Ceph is unique because it’s massively scalable to the exabyte level. The storage system is self-managing and self-healing which means limited system administrator involvement. It runs on commodity hardware, has no single point of failure, leverages an intelligent storage node system and it open source.

    This talk will describe the Ceph architecture and then focus on the current status and future of the project. This will include a discussion of Ceph’s integration with Openstack, the file system, RBD clients in the Linux kernel, RBD support for virtual block devices in Qemu/KVM and libvirt, and current engineering challenges.

    At 4:00pm to 4:40pm, Thursday 1st March

    In GA J, Santa Clara Convention Center

  • Mapping social media networks (with no coding) using NodeXL

    by Marc Smith

    Networks are a data structure common found across all social media services that allow populations to author collections of connections. The Social Media Research Foundation’s (http://www.smrfoundation.org) free and open NodeXL project (http://nodexl.codeplex.com) makes analysis of social media networks accessible to most users of the Excel spreadsheet application. With NodeXL, Networks become as easy to create as pie charts. Applying the tool to a range of social media networks has already revealed the variations present in online social spaces. A review of the tool and images of Twitter, flickr, YouTube, and email networks will be presented.

    We now live in a sea of tweets, posts, blogs, and updates coming from a significant fraction of the people in the connected world. Our personal and professional relationships are now made up as much of texts, emails, phone calls, photos, videos, documents, slides, and game play as by face-to-face interactions. Social media can be a bewildering stream of comments, a daunting fire hose of content. With better tools and a few key concepts from the social sciences, the social media swarm of favorites, comments, tags, likes, ratings, and links can be brought into clearer focus to reveal key people, topics and sub-communities. As more social interactions move through machine-readable data sets new insights and illustrations of human relationships and organizations become possible. But new forms of data require new tools to collect, analyze, and communicate insights.

    At 4:50pm to 5:30pm, Thursday 1st March

    In GA J, Santa Clara Convention Center