Sessions at Hadoop Summit 2012 about Big Data

Your current filters are…

Wednesday 13th June 2012

  • Apache Hadoop MapReduce: What's Next?

    by Arun Murthy

    Apache Hadoop has made giant strides since the last Hadoop Summit: the community has released hadoop-1.0 after nearly 6 years and is now on the cusp of the Hadoop.next (think of it as hadoop-2.0). Given the next generation of MR is out with 0.23.0 and 0.23.1, there is a new set of features that have been requested in the community. In this talk we will talk about the next set of features like pre emption, web services and near real time analysis and how we are working on tackling these in the near future. In this talk we will also cover the roadmap for Next Gen Map Reduce and timelines along with the release schedule for Apache Hadoop.

    At 10:30am to 11:10am, Wednesday 13th June

  • Big Data Architecture in the AWS Cloud

    by Adam Gray

    Big Data technologies such as Hadoop, NoSQL, and scalable object stores are an ideal fit for the elasticity and scalability of a cloud deployment. In this talk we will take a look at several common architectural patterns that are being used today on the AWS cloud to take advantage of these synergies while overcoming some of its inherent limitations. This will include the use of Amazon S3 as a common store for multiple transient Hadoop clusters and how this affects disaster recovery, job scheduling, and software upgrades. We’ll also look at how to build a dynamically scalable Hadoop data warehouse that uses Amazon S3, HDFS, and Amazon DynamoDB to create a three-tiered architecture. Finally, we’ll explore how Amazon Elastic MapReduce is being used to perform sophisticated analytics on data stored in Amazon DynamoDB.

    At 10:30am to 11:10am, Wednesday 13th June

  • Tackling Big Data with Hadoop and Open Source Integration

    by Fabrice Bonan

    Enterprises are faced with ever increasing amounts of data, and problems related to volume and usage demands have forced IT managers and developers to seek out new solutions. Fortunately, this has resulted in an explosion of innovation in massively parallel processing and non-relational data storage. Apache Hadoop, an open source software platform, has quickly become the technology of choice for large organizations in need of sophisticated analysis and transformation of petabytes of structured and complex data. Fabrice Bonan will discuss how users can access, manipulate and store huge volumes of data in Hadoop and benefit from high-performance, cost-optimized data integration with ultimate scalability. During this session, attendees will learn how to:
    - Leverage the explosive growth of data
    - Deploy and tap into the powerful architecture of Hadoop
    - Process massive data volumes through a combination of Hadoop and open source architecture

    At 10:30am to 11:10am, Wednesday 13th June

  • Analyzing Multi-Structured Data with Hadoop

    by Justin Borgman

    Given its ability to analyze structured, unstructured, and “multi-structured” data, Hadoop is an increasingly viable option for analytics and business intelligence within the enterprise. Dramatically more scalable and cost-effective than traditional data warehousing technologies, Hadoop is also increasingly used to perform new kinds of analytics that were previously impossible. When it comes to Big Data, retailers are at the forefront of leveraging large volumes of nuanced information about customers, to improve the effectiveness of promotional campaigns, refine pricing models, and lower overall customer acquisition costs. Retailers compete fiercely for consumers’ attention, time, and money, and effective use of analytics can result in sustained competitive advantage. Forward-thinking retailers can now take advantage of all data sources to construct a complete picture of a customer. This invariably consists of both structured data (customer and inventory records, spreadsheets, etc.) and unstructured data (clickstream logs, email archives, customer feedback and comment fields, etc.). This allows, for example, online retailers with structured, transactional sales data to connect that data with unstructured comments from product reviews, providing insight into how reviews affect consumers’ propensity to purchase a particular product. This session will examine several real-world customer use cases applying combined analysis of structured and unstructured data.

    At 11:25am to 12:05pm, Wednesday 13th June

  • Greenplum Database on HDFS

    by Lei Chang

    Greenplum Database (GPDB) is an industry-leading massively parallel processing (MPP) database offering, providing high performance, scalable, and mission-critical analytic processing. In the Greenplum Unified Analytics Platform (UAP), GPDB and Hadoop fuse the co-processing of structured and unstructured data under a single user interface to empower data science teams. At the bottom layers of GPDB and Hadoop, however, are distinct storage systems: GPDB is based on a local Posix compliant file system and most Hadoop systems are based on HDFS. Having two distinct storage systems increases both the capital and operational costs for supporting unified analytics. One possible solution to this problem is to allow GPDB to run natively on HDFS. This talk will give an introduction to how GPDB lives on HDFS by using a pluggable storage layer in the kernel of the GPDB engine. It also introduces the features we added to HDFS to support the full transactional semantics of GPDB, and how we increased the concurrency of HDFS access from C clients. Detailed experimental performance results will be presented to compare GPDB on HDFS (GOH) to other state-of-the-art big data processing frameworks. We will also discuss our experiences building GOH and the opportunities opened up by the fusion of an MPP database and Hadoop.

    At 11:25am to 12:05pm, Wednesday 13th June

  • The Merchant Lookup Service at Intuit

    by Michael Radwin

    The Merchant Lookup Service at Intuit enables users and products to look up business details by:
    Business name (including partial name & misspellings)
    Business location (street address, latitude and longitude)
    Business type (category, SIC)
    User location (IP,GPS-enabled device location)
    This powerful service enables auto-suggest, auto-complete and auto-correct within product. The project aims at providing a more complete, canonical business profile by bringing together data and metadata from across the various information providers as well as merchants from Intuit’s small business customer base. The Business Directory Service is available as a web-service that can be integrated into desktop, web and mobile applications. It is available through a REST API whose response times are minimized because the data is indexed in Solr and distributed. The backend is powered by HBase, which stores this comprehensive,deduplicated, canonical merchant information. Hundreds of millions of records that have duplicates that exist due to sparse, manually entered information by Intuit’s small business customers as well as records from different information providers are de-duplicated through a series of Hadoop jobs resulting in a canonical set of merchants. The deduping pipeline has various components like Reader, Index Generator, various Matchers, Score Combiner and Merchant Splicer.

    At 11:25am to 12:05pm, Wednesday 13th June

  • Unified Big Data Architecture: Integrating Hadoop within an Enterprise Analytical Ecosystem

    by Priyank Patel

    Trending use cases have pointed out the complementary nature of Hadoop and existing data management systems—emphasizing the importance of leveraging SQL, engineering, and operational skills, as well as incorporating novel uses of MapReduce to improve distributed analytic processing. Many vendors have provided interfaces between SQL systems and Hadoop but have not been able to semantically integrate these technologies while Hive, Pig and SQL processing islands proliferate. This session will discuss how Teradata is working with Hortonworks to optimize the use of Hadoop within the Teradata Analytical Ecosystem to ingest, store, and refine new data types, as well as exciting new developments to bridge the gap between Hadoop and SQL to unlock deeper insights from data in Hadoop. The use of Teradata Aster as a tightly integrated SQL-MapReduce® Discovery Platform for Hadoop environments will also be discussed.

    At 11:25am to 12:05pm, Wednesday 13th June

  • 30B events a day with Hadoop Analyzing 90% of the Worldwide Web Population

    by Mike Brown

    This session provides details on how comScore uses Hadoop to process over 30 billion (over 2TB compressed per day) internet and mobile events per day to understand and report on web behavior. This will include the methods to ensure scalability and provide high uptime for this mission critical data. The talk will highlight the use of Hadoop to determine and calculate the metrics used in it’s flagship MediaMetrix product. Details on how disparate information sources can be quickly and efficiently combined and analyzed will be addressed. The talk will also detail how algorithms running on top of Hadoop provide deep insight into user behavior and can be used to develop broad insights into user behavior. The session also touches on comScores best demonstrated practices for large scale processing with Hadoop and the ability to combine profile information on its panelists and their Web activities to develop insights about internet usage.

    At 1:30pm to 2:10pm, Wednesday 13th June

  • Agile Deployment of Predictive Analytics on Hadoop: Faster Insights through Open Standards

    by Michael Zeller and Ulrich Rueckert

    While Hadoop provides an excellent platform for data aggregation and general analytics, it also can provide the right platform for advanced predictive analytics against vast amounts of data, preferably with low latency and in real-time. This drives the business need for comprehensive solutions that combine the aspects of big data with an agile integration of data mining models. Facilitating this convergence is the Predictive Model Markup Language (PMML), a vendor-independent standard to represent and exchange data mining models that is supported by all major data mining vendors and open source tools. This presentation will outline the benefits of the PMML standard as key element of data science best practices and its application in the context of distributed processing. In a live demonstration, we will showcase how Datameer and the Zementis Universal PMML Plug-in take advantage of a highly parallel Hadoop architecture to efficiently derive predictions from very large volumes of data. In this session, the audience will learn: How to leverage predictive analytics in the context of big data — Introduction to the Predictive Model Markup Language (PMML) open standard for data mining — How to reduce cost and complexity of predictive analytics

    At 1:30pm to 2:10pm, Wednesday 13th June

  • iMStor - Hadoop Storage based Tiering Platform

    by Vishal Malik

    Existing storage tiering solutions are mostly hardware based and very expensive. RAID costs are very high and policy based tiering is not transparent to the user and is done mostly at hardware level. iMStor Platform that we have developed allows easy to manage storage systems with policy based management at the user level to control what type of data lives in what type of Hadoop based storage engine ranging from HBase, MongoDB to CouchDB, Riak or Redis. iMStor provides software based storage engine based on Hadoop stack for doing all the intelligent work which today is done in hardware and non-Hadoop based platforms.

    At 1:30pm to 2:10pm, Wednesday 13th June

  • Using Hadoop to Expand Data Warehousing

    by Ron Bodkin and Mike Peterson

    Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.

    At 1:30pm to 2:10pm, Wednesday 13th June

  • Big Data Challenges at NASA

    by Chris Mattmann

    I`ll describe what I see as the most current and exciting big data challenges at NASA, across a number of application domains: * Planetary Science * Next generation Earth science decadal missions * Radio Astronomy and the next generation instruments including the Square Kilometre Array * Snow Hydrology and Climate Impacts The session will focus on defining the problem space, suggesting how technologies like Apache Hadoop can be leveraged, and where Apache Hadoop fits against the Apache OODT technology, pioneered originally by NASA and now another Apache big data technology growing at the foundation. In addition, I will present what I feel to be the key next steps architecturally that the Hadoop community can take to better integrate and apply to the realm of science data systems.

    At 2:25pm to 3:05pm, Wednesday 13th June

  • Blueprint for Integrating Big Data Analytics and BI

    by Abe Taha

    Hadoop is fast becoming an integral part of BI infrastructures to empower everyone in the business to affordably analyze large amounts of multi-structured data not previously possible with traditional OLAP or RDBMS technologies. This session will cover best practices for implementing Big Data Analytics directly on Hadoop. Topics will include architecture, integration and data exchange; self-service data exploration and analysis; openness and extensibility; the roles of Hive and the Hive metastore; integration with BI dashboards, reporting and visualization tools; using existing SAS and SPSS models and advanced analytics; ingesting data with Sqoop, Flume, and ETL.

    At 2:25pm to 3:05pm, Wednesday 13th June

  • Dynamic Reconfiguration of Apache Zookeeper

    by Benjamin Reed

    Zookeeper is an important member of the Hadoop ecosystem. It is widely used in industry to coordinate distributed systems. A common use-case of Zookeeper is to dynamically maintain membership and other configuration metadata for its users. Zookeeper itself is a replicated distributed system. Unfortunately, the membership and all other configuration parameters of Zookeeper are static — they’re loaded during boot and cannot be altered. Operators resort to “rolling restart”, a manually intensive and error-prone method of changing the configuration that has caused data loss and inconsistency in production. Automatic reconfiguration functionality has been requested by operators since 2008 (Zookeeper-107). Several previous proposals were found incorrect and rejected. We implemented a new reconfiguration protocol in Zookeeper and are currently integrating it into the codebase. It fully automates configuration changes: the set of Zookeeper servers, their roles, addresses, etc. can be changed dynamically, without service interruption and while maintaining data consistency. By leveraging the properties already provided by Zookeeper our protocol is considerably simpler than state of the art. Our protocol also encompasses the clients — clients are rebalanced across servers in the new configuration, while keeping the extent of migration proportional to the change in membership.

    At 2:25pm to 3:05pm, Wednesday 13th June

  • Hadoop at @WalmartLabs (from Proof of Concept to Production and beyond)

    by Stephen O'Sullivan and Jeremy King

    Use case – Research project to look at Hadoop for a business case, and moving the results -Became the store for all web site performance / platform stats for operation analytics – 8 different groups start using the the cluster and it went from 8 to 12 nodes (with a total of 64TB) within (7 months) – Started to do more business analytics, but had to start moving data around as space becomes an issue – Business funds a 250+ node cluster is purchased with a total of 1.2PB of space… – …And this cluster will increase by 200+ nodes by the end of the year How Hadoop will be used on the next generation @WalmartLabs / walmart.com platform. – Storing all the fine grain data generated by walmart.com – Realtime analytics for site performance and alerting – Hybrid model * with Data Warehouses (that are using current BI tools) * with noSQL

    At 2:25pm to 3:05pm, Wednesday 13th June

  • Low Latency 'OLAP' with Hadoop and HBase

    by Andrei Dragomir

    We use “SaasBase Analytics” to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in realtime. The requirement we started from was to get large amounts of data available in near realtime (minutes) to large amounts of users for large amounts of (different) queries that take milliseconds to execute. This set our problem apart from classical solutions such as Hive and PIG. In this talk I`ll go through the design of the solution and the strategies (and hacks) to achieve low latency and scalability from theoretical model to the entire process of ETL to warehousing and queries.

    At 2:25pm to 3:05pm, Wednesday 13th June

  • Unleash Insights On All Data with Microsoft Big Data

    by Tim Mallalieu

    Do you plan to extract insights from mountains of data, including unstructured data that is growing faster than ever? Attend this session to learn about Microsoft’s Big Data solution that unlocks insights on all your data, including structured and unstructured data of any size. Accelerate your analytics with a Hadoop service that offers deep integration with Microsoft BI and the ability to enrich your models with publicly available data from outside your firewall. Come and see how Microsoft is broadening access to Hadoop through dramatically simplified deployment, management and programming, including full support for JavaScript.

    At 2:25pm to 3:05pm, Wednesday 13th June

  • Why HBase? Or How I learned to stop worrying and love consistency

    by Shaneal Manek

    There are a number of excellent databases that have proven invaluable when working with `big data` – including Cassandra, Riak, DynamoDB, MongoDB, and HBase. So, how do you decide which is the right solution for you? This talk will start by briefly discussing some of the theory involved with distributed databases – from the oft-cited (and almost-as-often misunderstood) CAP theorem, to vector clocks and the difficulties of eventual consistency, and much more. We will then compare how mainstream distributed databases make use of these concepts, and the tradeoffs they incur. Unfortunately, due to the nature of these tradeoffs, there is no one-size-fits all solution. So, we`ll discuss what classes of problems each of these systems is appropriate for – and include some real world benchmarks – to help you make an informed decisions about your distributed systems.

    At 2:25pm to 3:05pm, Wednesday 13th June

  • Architecting business critical enterprise application

    by Kumar Palaniappan

    NetApp collects 500 TB per year of unstructured data from devices that phone home, sending unstructured auto-support log and configuration data back to centralized data centers. NetApp needed to scale their architecture, to provide timely support data, to extend their data warehouse, and to be able to do ad hoc analysis and build predictive models for device support,product analysis and cross-sales. This presentation presents the challenges and a solution architecture based on Apache Hadoop, Apache HBase, Apache Flume, Apache AVRO, HIVE as well as PIG and Solr. The presentation reviews lessons from implementation.

    At 3:35pm to 4:15pm, Wednesday 13th June

  • Hadoop & Cloud @Netflix: Taming the Social Data Firehose

    by Mohammad Sabah

    In the last two years, Netflix has moved from a data center architecture to a completely cloud-oriented one. We now leverage Amazon EC2 for cloud computing and Amazon S3 for storage. At the same time, Hadoop has become central for carrying out research and analysis in data science. Using Hadoop alleviates our concerns about data sparsity, sampling bias and memory constraints and we can use as much data as is available at our disposal. In this talk, we will present several efficient and scalable in-house MapReduce implementations of algorithms for search retrieval, personalized recommendation, and video auto-tagging. We will discuss how we augment Netflix with social and audience data like Twitter, Facebook, Wikipedia, Nielsen and Rentrak, and why Hadoop is suitable for handling their scale and schema. We will highlight an application on how we set up an automated pipeline to generate interesting visualizations, trends and actionable insights from Netflix and third party data. Outside of training models, we use Hadoop in data preparation, feature engineering, feature selection and model evaluation. We baseline our performance with alternative approaches and report significant lifts in time and space obtained by using Hadoop.

    At 3:35pm to 4:15pm, Wednesday 13th June

  • Hadoop Plugin for MongoDB: The Elephant in the Room

    by Brendan McAdams and Meghan Gill

    Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.

    At 3:35pm to 4:15pm, Wednesday 13th June

  • How Klout is changing the landscape of social media with Hadoop and BI

    by David Mariani and Denny Lee

    In this age of Big Data, data volumes grow exceedingly larger while the technical problems and business scenarios become more complex. Compounding these complexities, data consumers are demanding faster analysis to common business questions asked of their Big Data. This session provides concrete examples of how to address this challenge. We will highlight the use of Big Data technologies—including Hadoop and Hive —with classic BI systems such as SQL Server Analysis Services.
    Session takeaways:
    • Understand the architectural components surrounding Hadoop, Hive, Classic BI, and the Tier-1 BI ecosystem
    • Get strategies for addressing the technical issues when working with extremely large cubes
    • See how to address the technical issues when working with Big Data systems from the DBA perspective

    At 3:35pm to 4:15pm, Wednesday 13th June

  • Scaling Apache ZooKeeper to the next generation applications

    by Mahadev Konar

    Apache ZooKeeper has become a de facto standard for distributed coordination. _Its design has proven to be flexible enough that it can be applied to a variety of needs of distributed applications. It has been used for leader election, service discovery, status monitoring, dynamic configuration etc. Recently new use cases have come up where ZooKeeper is being used as a discovery service with thousands of clients. Couple of examples include Hadoop Namenode HA and Yarn HA. This has led to a new set of requirements that need to be addressed. There is a need for session-less read-only client creation to address startup latency issues of thousands of clients . Also, such scale creates a need for reducing memory footprint of watch management in ZooKeeper. In this talk we will discuss the various new use cases that are coming up in Apache ZooKeeper and the work that is being done in the community to address these issues. We will also discuss the future roadmap for ZooKeeper.

    At 3:35pm to 4:15pm, Wednesday 13th June

  • A new generation of data transfer tools for Hadoop: Sqoop 2

    by Kathleen Ting and Bilung Lee

    Apache Sqoop (incubating) was created to efficiently transfer big data between Hadoop related systems (such as HDFS, Hive, and HBase) and structured data stores (such as relational databases, data warehouses, and NoSQL systems). The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. In the meantime, we have encountered many new challenges that have outgrown the abilities of the current infrastructure. To fulfill more data integration use cases as well as become easier to manage and operate, a new generation of Sqoop, also known as Sqoop 2, is currently undergoing development to address several key areas, including ease of use, ease of extension, and security. This session will talk about Sqoop 2 from both the development and operations perspectives.

    At 4:30pm to 5:10pm, Wednesday 13th June

  • A proof of concept with Hadoop: storage and analytics of electrical time-series

    by Marie-Luce Picard and Bruno Jacquin

    Worldwide, smart-grid projects are being launched and motivated by economic constraints, regulatory aspects, social or environmental needs. In France, a recent law implies the future deployment of about 35 millions of meters. Scalable solutions for storing and processing huge amount of metering data are needed: relational database technologies as well as Big Data approach like Hadoop can be considered. What are the main difficulties for a scalable and efficient storage when different types of queries are to be implemented and run ? – operational and analytics needs – variable levels of complexity with variable scopes and frequencies, – variable acceptable latencies. We will describe a current work for storing and mining large amounts of metering data (about 1800 billions of records for metering measurements, annual volume of about 120 To of raw data) using a Hadoop based solution. We will focus on : – physical and logical architecture, – time-series data modelling, impact of types of compression, as well as HDFS block sizes, – Hive and Pig for analytical queries and HBase for point queries. Finally we will discuss on what can be the added value of using Hadoop as a component in a future Information System for utilities.

    At 4:30pm to 5:10pm, Wednesday 13th June

  • Bayesian Counters

    by Alex Kozlov

    Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.

    At 4:30pm to 5:10pm, Wednesday 13th June

  • Big Data Real-Tie Applications: A Reference Architecture for Search, Discovery and Analytics

    by Justin Makeig

    In this time of change, Federal, State, and Local governments are turning to Hadoop and related big data projects for assistance in both providing new services to customers as well as cost cutting. This presentation will discuss several high profile projects using Hadoop in the public sector. The U.S. Naval Air Systems Command uses Hadoop for aircraft maintenance information. Tennessee Valley Authority uses Hadoop to store massive amounts of power utility sensor data. Pacific Northwest National Laboratory, a Department of Energy national lab, has applied Hadoop to bioinformatics analysis. These and other public sector examples will be discussed.

    At 4:30pm to 5:10pm, Wednesday 13th June

  • Deployment and Management of Hadoop Clusters with AMBARI

    by Matt Foley

    Deploying, configuring, and managing large Hadoop and HBase clusters can be quite complex. Just upgrading one Hadoop component on a 2000-node cluster can take a lot of time and expertise, and there have been few tools specialized for Hadoop cluster administrators. AMBARI is an Apache incubator project to deliver Monitoring and Management functionality for Hadoop clusters. This paper presents the AMBARI tools for cluster management, specifically: Cluster pre-configuration and validation; Hadoop software deployment, installation, and smoketest; Hadoop configuration and re-config; and a basic set of management ops including start/stop service, add/remove node, etc. In providing these capabilities, AMBARI seeks to integrate with (rather than replace) existing open-source packaging and deployment technology available in most data centers, such as Puppet and Chef, Yum, Apt, and Zypper.

    At 4:30pm to 5:10pm, Wednesday 13th June

  • Hadoop and Vertica: The Data Analytics Platform at Twitter

    by Bill Graham

    Twitters Data Analytics Platform uses a number of technologies, including Hadoop, Pig, Vertica, MySQL and ZooKeeper, to process hundreds of terabytes of data per day. Hadoop and Vertica are key components of the platform. The two systems are complementary, but their inherent differences create integration challenges. This talk will give an overview of the overall system architecture focusing on integration details, job coordination and resource management. Attendees will learn about the pitfalls we encountered and the solutions we developed while evolving the platform.

    At 4:30pm to 5:10pm, Wednesday 13th June

  • Hadoop in the Public Sector

    by Tom Plunkett

    In this time of change, Federal, State, and Local governments are turning to Hadoop and related big data projects for assistance in both providing new services to customers as well as cost cutting. This presentation will discuss several high profile projects using Hadoop in the public sector. The U.S. Naval Air Systems Command uses Hadoop for aircraft maintenance information. Tennessee Valley Authority uses Hadoop to store massive amounts of power utility sensor data. Pacific Northwest National Laboratory, a Department of Energy national lab, has applied Hadoop to bioinformatics analysis. These and other public sector examples will be discussed.

    At 4:30pm to 5:10pm, Wednesday 13th June