Sessions at Hadoop Summit 2012 about Analytics

Your current filters are…

Wednesday 13th June 2012

  • Realtime analytics with Storm and Hadoop

    by Nathan Marz

    Storm is a distributed and fault-tolerant realtime computation system, doing for realtime computation what Hadoop did for batch computation. Storm can be used together with Hadoop to make a potent realtime analytics stack. Although Hadoop is not a realtime system, it can be used to support a realtime system. Building a batch/realtime stack requires solving a lot of sub-problems: – Getting data to both Hadoop and Storm – Exporting views of the Hadoop data into a readable index – Using an appropriate queuing broker to feed Storm – Choosing an appropriate database to serve the realtime indexes updated via Storm – Syncing the views produced independently by Hadoop and Storm when doing queries Come learn how we`ve solved these problems at Twitter to do complex analytics in realtime.

    At 10:30am to 11:10am, Wednesday 13th June

  • Searching Conversations using Hadoop: More than Just Analytics

    by Jacques Nadeau

    How YapMap built a new type of conversational search technology using Hadoop as the backbone. While adoption of Hadoop for big data analytics is becoming very common, substantially fewer companies are using Hadoop as a system of record to directly support their primary business. We’ll discuss our experiences using Hadoop in this second way. We’ll take you through our system architecture and how Hadoop ecosystem components can work side by side with traditional application technologies. We’ll cover:
    Building a distributed targeted crawler using Zookeeper as a coordinated locking service
    How we rely on HBase’s atomicity and optimistic locking to coordinate processing pipelines
    Using Map Reduce, HBase regions and bi-level sharding for truly distributed index creation
    Running diskless index servers that pull multiple reduce outputs directly from the distributed file system into memory to form a single index
    Using Mahout to aid in user exploration
    How we integrated the use of Zookeeper, HBase and MapReduce with application focused technologies including JavaEE6, CDI, SQL, Protobuf and RabbitMQ.
    Deployment decisions including switching from CDH to unsupported MapR, using Inifinband, building a six node research cluster for $1500, using desktop drives & white boxes, utilizing huge pages, gc hell, etc.

    At 10:30am to 11:10am, Wednesday 13th June

  • Unified Big Data Architecture: Integrating Hadoop within an Enterprise Analytical Ecosystem

    by Priyank Patel

    Trending use cases have pointed out the complementary nature of Hadoop and existing data management systems—emphasizing the importance of leveraging SQL, engineering, and operational skills, as well as incorporating novel uses of MapReduce to improve distributed analytic processing. Many vendors have provided interfaces between SQL systems and Hadoop but have not been able to semantically integrate these technologies while Hive, Pig and SQL processing islands proliferate. This session will discuss how Teradata is working with Hortonworks to optimize the use of Hadoop within the Teradata Analytical Ecosystem to ingest, store, and refine new data types, as well as exciting new developments to bridge the gap between Hadoop and SQL to unlock deeper insights from data in Hadoop. The use of Teradata Aster as a tightly integrated SQL-MapReduce® Discovery Platform for Hadoop environments will also be discussed.

    At 11:25am to 12:05pm, Wednesday 13th June

  • 30B events a day with Hadoop Analyzing 90% of the Worldwide Web Population

    by Mike Brown

    This session provides details on how comScore uses Hadoop to process over 30 billion (over 2TB compressed per day) internet and mobile events per day to understand and report on web behavior. This will include the methods to ensure scalability and provide high uptime for this mission critical data. The talk will highlight the use of Hadoop to determine and calculate the metrics used in it’s flagship MediaMetrix product. Details on how disparate information sources can be quickly and efficiently combined and analyzed will be addressed. The talk will also detail how algorithms running on top of Hadoop provide deep insight into user behavior and can be used to develop broad insights into user behavior. The session also touches on comScores best demonstrated practices for large scale processing with Hadoop and the ability to combine profile information on its panelists and their Web activities to develop insights about internet usage.

    At 1:30pm to 2:10pm, Wednesday 13th June

  • Agile Deployment of Predictive Analytics on Hadoop: Faster Insights through Open Standards

    by Michael Zeller and Ulrich Rueckert

    While Hadoop provides an excellent platform for data aggregation and general analytics, it also can provide the right platform for advanced predictive analytics against vast amounts of data, preferably with low latency and in real-time. This drives the business need for comprehensive solutions that combine the aspects of big data with an agile integration of data mining models. Facilitating this convergence is the Predictive Model Markup Language (PMML), a vendor-independent standard to represent and exchange data mining models that is supported by all major data mining vendors and open source tools. This presentation will outline the benefits of the PMML standard as key element of data science best practices and its application in the context of distributed processing. In a live demonstration, we will showcase how Datameer and the Zementis Universal PMML Plug-in take advantage of a highly parallel Hadoop architecture to efficiently derive predictions from very large volumes of data. In this session, the audience will learn: How to leverage predictive analytics in the context of big data — Introduction to the Predictive Model Markup Language (PMML) open standard for data mining — How to reduce cost and complexity of predictive analytics

    At 1:30pm to 2:10pm, Wednesday 13th June

  • Using Hadoop to Expand Data Warehousing

    by Ron Bodkin and Mike Peterson

    Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.

    At 1:30pm to 2:10pm, Wednesday 13th June

  • Blueprint for Integrating Big Data Analytics and BI

    by Abe Taha

    Hadoop is fast becoming an integral part of BI infrastructures to empower everyone in the business to affordably analyze large amounts of multi-structured data not previously possible with traditional OLAP or RDBMS technologies. This session will cover best practices for implementing Big Data Analytics directly on Hadoop. Topics will include architecture, integration and data exchange; self-service data exploration and analysis; openness and extensibility; the roles of Hive and the Hive metastore; integration with BI dashboards, reporting and visualization tools; using existing SAS and SPSS models and advanced analytics; ingesting data with Sqoop, Flume, and ETL.

    At 2:25pm to 3:05pm, Wednesday 13th June

  • Low Latency 'OLAP' with Hadoop and HBase

    by Andrei Dragomir

    We use “SaasBase Analytics” to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in realtime. The requirement we started from was to get large amounts of data available in near realtime (minutes) to large amounts of users for large amounts of (different) queries that take milliseconds to execute. This set our problem apart from classical solutions such as Hive and PIG. In this talk I`ll go through the design of the solution and the strategies (and hacks) to achieve low latency and scalability from theoretical model to the entire process of ETL to warehousing and queries.

    At 2:25pm to 3:05pm, Wednesday 13th June

  • Performing Network & Secruity Analytics with Hadoop

    by Travis Dawson

    This session shows how Hadoop enables deep analytics over massive amounts of network data, and how to extract information and value using Hadoop at the core of a complete analytics system. Narus, a division of Boeing, helps customers unlock the value of their networks with dynamic network traffic intelligence and analysis of information on IP traffic and flow data. This session provides details on how real-time traffic capture and analysis integrates with Hadoop to perform extremely complex analytics over vast quantities of data in a demanding environment to produce actionable information. The uses for these analytics range from simple network analysis to providing complex security detection and mitigation analysis. Terabytes of forensic data of network traffic are processed to isolate suspicious patterns of behavior, allowing further analysis to pinpoint malicious traffic and operators to take action.

    At 3:35pm to 4:15pm, Wednesday 13th June

  • A proof of concept with Hadoop: storage and analytics of electrical time-series

    by Marie-Luce Picard and Bruno Jacquin

    Worldwide, smart-grid projects are being launched and motivated by economic constraints, regulatory aspects, social or environmental needs. In France, a recent law implies the future deployment of about 35 millions of meters. Scalable solutions for storing and processing huge amount of metering data are needed: relational database technologies as well as Big Data approach like Hadoop can be considered. What are the main difficulties for a scalable and efficient storage when different types of queries are to be implemented and run ? – operational and analytics needs – variable levels of complexity with variable scopes and frequencies, – variable acceptable latencies. We will describe a current work for storing and mining large amounts of metering data (about 1800 billions of records for metering measurements, annual volume of about 120 To of raw data) using a Hadoop based solution. We will focus on : – physical and logical architecture, – time-series data modelling, impact of types of compression, as well as HDFS block sizes, – Hive and Pig for analytical queries and HBase for point queries. Finally we will discuss on what can be the added value of using Hadoop as a component in a future Information System for utilities.

    At 4:30pm to 5:10pm, Wednesday 13th June

  • Big Data Real-Tie Applications: A Reference Architecture for Search, Discovery and Analytics

    by Justin Makeig

    In this time of change, Federal, State, and Local governments are turning to Hadoop and related big data projects for assistance in both providing new services to customers as well as cost cutting. This presentation will discuss several high profile projects using Hadoop in the public sector. The U.S. Naval Air Systems Command uses Hadoop for aircraft maintenance information. Tennessee Valley Authority uses Hadoop to store massive amounts of power utility sensor data. Pacific Northwest National Laboratory, a Department of Energy national lab, has applied Hadoop to bioinformatics analysis. These and other public sector examples will be discussed.

    At 4:30pm to 5:10pm, Wednesday 13th June

  • Hadoop and Vertica: The Data Analytics Platform at Twitter

    by Bill Graham

    Twitters Data Analytics Platform uses a number of technologies, including Hadoop, Pig, Vertica, MySQL and ZooKeeper, to process hundreds of terabytes of data per day. Hadoop and Vertica are key components of the platform. The two systems are complementary, but their inherent differences create integration challenges. This talk will give an overview of the overall system architecture focusing on integration details, job coordination and resource management. Attendees will learn about the pitfalls we encountered and the solutions we developed while evolving the platform.

    At 4:30pm to 5:10pm, Wednesday 13th June

  • Hadoop in the Public Sector

    by Tom Plunkett

    In this time of change, Federal, State, and Local governments are turning to Hadoop and related big data projects for assistance in both providing new services to customers as well as cost cutting. This presentation will discuss several high profile projects using Hadoop in the public sector. The U.S. Naval Air Systems Command uses Hadoop for aircraft maintenance information. Tennessee Valley Authority uses Hadoop to store massive amounts of power utility sensor data. Pacific Northwest National Laboratory, a Department of Energy national lab, has applied Hadoop to bioinformatics analysis. These and other public sector examples will be discussed.

    At 4:30pm to 5:10pm, Wednesday 13th June

Thursday 14th June 2012

  • Experiences in Streaming Analytics at Petabyte (or larger) Scale

    by Stephen Sorkin

    How do you keep up with the velocity and variety of data streaming in and get analytics on it even before persistence and replication in Hadoop? In this talk, we’ll look at common architectural patterns being used today at companies such as Expedia, Groupon and Zynga that take advantage of Splunk to provide real-time collection, indexing and analysis of machine-generated big data with reliable event delivery to Hadoop. We’ll also describe how to use Splunk’s advanced search language to access data stored in Hadoop and rapidly analyze, report on and visualize results.

    At 10:30am to 11:10am, Thursday 14th June

  • Analytical Queries with Hive: SQL Windowing and Table Functions

    by Harish Butani

    Hive Query Language (HQL) is excellent for productivity and enables reuse of SQL skills, but falls short in advanced analytic queries. Hive`s Map & Reduce scripts mechanism lacks the simplicity of SQL and specifying new analysis is cumbersome. We developed SQLWindowing for Hive(SQW) to overcome these issues. SQW introduces both Windowing and Table Functions to the Hive user. SQW appears as a HQL extension with table functions and windowing clauses interspersed with HQL. This means the user stays within a SQL-like interface, while simultaneously having these capabilities available. SQW has been published as an open source project. It is available as both a CLI and an embeddable jar with a simple query API. There are pre-built functions for windowing to do Ranking, Aggregation, Navigation and Linear Regression. There are Table functions to do Time Series Analysis, Allocations, and Data Densification. Functions can be chained for more complex analysis. Under the covers MR mechanics are used to partition and order data. The fundamental interface is the tableFunction, whose core job is to operate on data partitions. Function implemenations are isolated from MR mechanics, focus purely on computation logic. Groovy scripting can be used for core implementation and parameterizing behavior. Writing functions typically involves extending one of the existing Abstract functions.

    At 11:25am to 12:05pm, Thursday 14th June

  • Combining Hadoop and RDBMS for Large-Scale Big Data Analytics

    by Shilpa Lawande and Mingsheng Hong

    When working with structured, semi-structured, and unstructured data, there is often a tendency to try and force one tool – either Hadoop or a traditional DBMS – to do all the work. At Vertica, we’ve found that there are reasons to use Hadoop for some analytics projects, and Vertica for others, and the magic comes in knowing when to use which tool and how these two tools can work together. Join us as we walk through some of the use cases for using Hadoop with a purpose-built analytics platform for an effective, combined analytics solution.

    At 1:30pm to 2:10pm, Thursday 14th June

  • It's Not About the "Big" in Big Data

    by Stefan Groschupf

    The compelling aspects of Big Data have little to do with data size. MPP databases already accommodate large datasets. It’s really the variety and velocity of big data that is driving new use cases. With big data analytics, we can now turn analysts loose on all structured and unstructured data, in its raw form, without the need to pre-model anything. This removes any boundaries on what analysts can analyze and dramatically shortens the time to insight, especially as new data sources are added. This session compares and contrasts the architecture and advantages of analytics running natively on Hadoop versus traditional BI and Hive solutions through use cases such as fraud detection, rogue trader identification, competitive analysis and customer behavior analytics.

    At 3:35pm to 4:15pm, Thursday 14th June

  • Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr

    by Grant Ingersoll

    Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the basics of simple batch processing jobs. In many cases, one needs both ad hoc, real time access to the content as well as the ability to discover interesting information based on a variety of features such as recommendations, summaries and other interesting insights. Furthermore, analyzing how users interact with the content can both further enhance the quality of the system as well as deliver much needed insight into both the users and the content for the business. In this talk, we`ll discuss how a platform that enables large scale search, discovery and analytics over a wide variety of content utilizing tools like Solr, Hadoop, Mahout and others.

    At 3:35pm to 4:15pm, Thursday 14th June

  • Paypal Behavioral Analytics on Hadoop

    by Anil Madan

    Online and offline commerce are changing and converging, and technology is dramatically influencing how consumers connect, shop and pay. With over 100 million active accounts in 190 markets and 25 currencies around the world, PayPal is at the forefront of payments innovation. PayPal continues to disrupt traditional means of money exchange and is the faster, safer way to pay and get paid anytime, anywhere and any way. Learn how PayPal is changing the game to do their behavioral, risk and relevance analytics to not only generate advanced insights but also personalize cross channel experiences. Understand how we are leveraging hybrid instrumentation techniques to build data pipelines for measurement and reporting.

    At 3:35pm to 4:15pm, Thursday 14th June

  • Spark and Shark: High-Speed In-Memory Analytics over Hadoop and Hive Data

    by Matei Zaharia

    Spark is an open source cluster computing framework that can outperform Hadoop by 30x by storing datasets in memory across jobs. Shark is a port of Apache Hive onto Spark, which provides a similar speedup for SQL queries, allowing interactive exploration of data in existing Hive warehouses. This talk will cover how both Spark and Shark are being used at various companies to accelerate big data analytics, the architecture of the systems, and where they are heading. In particular, we will show how both systems are used for large-scale machine learning, where the ability to keep data in memory across iterations yields substantial speedups, and for interactive data mining, from Shark’s SQL interface or Spark’s Scala-based console. We will also discuss an upcoming extension, Spark Streaming, that adds support for low-latency stream processing in Spark, giving users a unified interface for batch and online analytics.

    At 4:30pm to 5:10pm, Thursday 14th June

  • SQL-H: A New Way to Enable SQL Analytics on Hadoop

    by Sushil Thomas

    Most connectors to Hadoop focus on providing data connectivity between HDFS and associated systems. While this provides access to HDFS files it requires that the analyst, data scientist, or administrator recreate schemas and track data changes in the associated systems. Now, Teradata Aster’s SQL-H™ provides integration with Apache HCatalog (Hadoop Catalog) for the business analysts and data scientists working with Hadoop-based data sets. Schemas created with varied Hadoop-related projects are automatically mapped into Aster Database to perform interactive SQL analysis. In this session we will discuss how SQL-H and Aster’s 50+ analytical functions benefit business analysts wanting to do analysis of Hadoop data much more easily.

    At 4:30pm to 5:10pm, Thursday 14th June