Your current filters are…
by Nish Parikh
Temporal dynamics not only play an important role in stock markets, traffic, weather and network systems but also in web search, recommender systems and eCommerce. In eCommerce, temporal variations are great indicators of the business. Performing temporal data mining tasks like burst detection, event prediction and temporal similarity search on large scale data is challenging. Most challenges are related to de-noising large volume data and approximating algorithms which if applied naively are quadratic. eBay is one of the leading online marketplaces with ~100M users, 100+M items and 100+M daily queries. We talk about how we used Hadoop to generate a 7 year noise free corpus of eCommerce queries at eBay. We describe the use of MapReduce over Hadoop for burst detection. Burst detection  enables us to find daily trends and spikes in demand. Year over year burst patterns are analyzed to detect seasonality and predict events. We also discuss the power of Hadoop in scaling up temporal similarity search over billions of queries. The mined information has uses in Temporal Recommender Systems, Query Suggestion Systems and Pop-up Commerce.  N Parikh, N Sundaresan. Scalable and near real-time burst detection from eCommerce queries. KDD`08  N Sundaresan. Popup Commerce, Towards Building Transient and Thematic Stores
by Nathan Marz
Storm is a distributed and fault-tolerant realtime computation system, doing for realtime computation what Hadoop did for batch computation. Storm can be used together with Hadoop to make a potent realtime analytics stack. Although Hadoop is not a realtime system, it can be used to support a realtime system. Building a batch/realtime stack requires solving a lot of sub-problems: – Getting data to both Hadoop and Storm – Exporting views of the Hadoop data into a readable index – Using an appropriate queuing broker to feed Storm – Choosing an appropriate database to serve the realtime indexes updated via Storm – Syncing the views produced independently by Hadoop and Storm when doing queries Come learn how we`ve solved these problems at Twitter to do complex analytics in realtime.
How YapMap built a new type of conversational search technology using Hadoop as the backbone. While adoption of Hadoop for big data analytics is becoming very common, substantially fewer companies are using Hadoop as a system of record to directly support their primary business. We’ll discuss our experiences using Hadoop in this second way. We’ll take you through our system architecture and how Hadoop ecosystem components can work side by side with traditional application technologies. We’ll cover:
Building a distributed targeted crawler using Zookeeper as a coordinated locking service
How we rely on HBase’s atomicity and optimistic locking to coordinate processing pipelines
Using Map Reduce, HBase regions and bi-level sharding for truly distributed index creation
Running diskless index servers that pull multiple reduce outputs directly from the distributed file system into memory to form a single index
Using Mahout to aid in user exploration
How we integrated the use of Zookeeper, HBase and MapReduce with application focused technologies including JavaEE6, CDI, SQL, Protobuf and RabbitMQ.
Deployment decisions including switching from CDH to unsupported MapR, using Inifinband, building a six node research cluster for $1500, using desktop drives & white boxes, utilizing huge pages, gc hell, etc.
by Fabrice Bonan
Enterprises are faced with ever increasing amounts of data, and problems related to volume and usage demands have forced IT managers and developers to seek out new solutions. Fortunately, this has resulted in an explosion of innovation in massively parallel processing and non-relational data storage. Apache Hadoop, an open source software platform, has quickly become the technology of choice for large organizations in need of sophisticated analysis and transformation of petabytes of structured and complex data. Fabrice Bonan will discuss how users can access, manipulate and store huge volumes of data in Hadoop and benefit from high-performance, cost-optimized data integration with ultimate scalability. During this session, attendees will learn how to:
- Leverage the explosive growth of data
- Deploy and tap into the powerful architecture of Hadoop
- Process massive data volumes through a combination of Hadoop and open source architecture
Given its ability to analyze structured, unstructured, and “multi-structured” data, Hadoop is an increasingly viable option for analytics and business intelligence within the enterprise. Dramatically more scalable and cost-effective than traditional data warehousing technologies, Hadoop is also increasingly used to perform new kinds of analytics that were previously impossible. When it comes to Big Data, retailers are at the forefront of leveraging large volumes of nuanced information about customers, to improve the effectiveness of promotional campaigns, refine pricing models, and lower overall customer acquisition costs. Retailers compete fiercely for consumers’ attention, time, and money, and effective use of analytics can result in sustained competitive advantage. Forward-thinking retailers can now take advantage of all data sources to construct a complete picture of a customer. This invariably consists of both structured data (customer and inventory records, spreadsheets, etc.) and unstructured data (clickstream logs, email archives, customer feedback and comment fields, etc.). This allows, for example, online retailers with structured, transactional sales data to connect that data with unstructured comments from product reviews, providing insight into how reviews affect consumers’ propensity to purchase a particular product. This session will examine several real-world customer use cases applying combined analysis of structured and unstructured data.
Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 0.23 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc. We describe how to take advantages of the new features and their benefits. We also discuss some of the misconceptions and myths about HDFS. The second half of the talk describes our plans for HDFS over the next year. This includes improvements such as Snapshots, Disaster recovery, RAID, performance, scaling, etc.
by Jimmy Lin
The success of data-driven solutions to difficult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in large-scale machine learning. We presents a case study of Twitter`s integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. Prior to this work, the platform has already demonstrated value in “traditional” data warehousing and business intelligence tasks. Here, we focus on extensions to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and user-defined functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment, as well as access to rich libraries of user-defined functions and the output of other scripts. This talk is based on a forthcoming paper in SIGMOD 2012.
Trending use cases have pointed out the complementary nature of Hadoop and existing data management systems—emphasizing the importance of leveraging SQL, engineering, and operational skills, as well as incorporating novel uses of MapReduce to improve distributed analytic processing. Many vendors have provided interfaces between SQL systems and Hadoop but have not been able to semantically integrate these technologies while Hive, Pig and SQL processing islands proliferate. This session will discuss how Teradata is working with Hortonworks to optimize the use of Hadoop within the Teradata Analytical Ecosystem to ingest, store, and refine new data types, as well as exciting new developments to bridge the gap between Hadoop and SQL to unlock deeper insights from data in Hadoop. The use of Teradata Aster as a tightly integrated SQL-MapReduce® Discovery Platform for Hadoop environments will also be discussed.
by Mike Brown
This session provides details on how comScore uses Hadoop to process over 30 billion (over 2TB compressed per day) internet and mobile events per day to understand and report on web behavior. This will include the methods to ensure scalability and provide high uptime for this mission critical data. The talk will highlight the use of Hadoop to determine and calculate the metrics used in it’s flagship MediaMetrix product. Details on how disparate information sources can be quickly and efficiently combined and analyzed will be addressed. The talk will also detail how algorithms running on top of Hadoop provide deep insight into user behavior and can be used to develop broad insights into user behavior. The session also touches on comScores best demonstrated practices for large scale processing with Hadoop and the ability to combine profile information on its panelists and their Web activities to develop insights about internet usage.
by Michael Zeller and Ulrich Rueckert
While Hadoop provides an excellent platform for data aggregation and general analytics, it also can provide the right platform for advanced predictive analytics against vast amounts of data, preferably with low latency and in real-time. This drives the business need for comprehensive solutions that combine the aspects of big data with an agile integration of data mining models. Facilitating this convergence is the Predictive Model Markup Language (PMML), a vendor-independent standard to represent and exchange data mining models that is supported by all major data mining vendors and open source tools. This presentation will outline the benefits of the PMML standard as key element of data science best practices and its application in the context of distributed processing. In a live demonstration, we will showcase how Datameer and the Zementis Universal PMML Plug-in take advantage of a highly parallel Hadoop architecture to efficiently derive predictions from very large volumes of data. In this session, the audience will learn: How to leverage predictive analytics in the context of big data — Introduction to the Predictive Model Markup Language (PMML) open standard for data mining — How to reduce cost and complexity of predictive analytics
The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (HDFS-1623). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.
by Vishal Malik
Existing storage tiering solutions are mostly hardware based and very expensive. RAID costs are very high and policy based tiering is not transparent to the user and is done mostly at hardware level. iMStor Platform that we have developed allows easy to manage storage systems with policy based management at the user level to control what type of data lives in what type of Hadoop based storage engine ranging from HBase, MongoDB to CouchDB, Riak or Redis. iMStor provides software based storage engine based on Hadoop stack for doing all the intelligent work which today is done in hardware and non-Hadoop based platforms.
by Oscar Boykin
Scala is a functional programming language on the JVM. Hadoop uses a functional programming model to represent large-scale distributed computation. Scala is thus a very natural match for Hadoop. We will present Scalding, which is built on top of Cascading. Scalding brings an API very similar to Scala`s collection API to allow users to write jobs as they might locally and run those Jobs at scale. Scalding differs from Pig in that there is a single syntax for your entire job. There is no concept of UDFs, instead, users can inline functions, or use their own java or scala classes directly in the Scalding job. Twitter uses Scalding for data analysis and machine learning, particularly in cases where we need more than sql-like queries on the logs, for instance fitting models and matrix processing. It scales beautifully from simple, grep-like jobs all the way up to jobs with hundreds of map-reduce pairs. This talk will present the Scalding DSL and show some example jobs for common use cases.
by Eric C Yang and Kan Zhang
To deploy the whole Hadoop stack and manage its full life cycle is no small feat. This is partly due to the interdependency among services and partly due to the large number of nodes in a Hadoop cluster. For example, when upgrading an HDFS cluster, one needs to make sure NameNode is upgraded successfully before starting up the DataNodes. Such cross-node dependencies are difficult to specify in existing configuration management systems like Puppet. Hadoop clusters can have thousands of nodes and a desirable configuration management system needs to handle multiple such clusters. For example, a use case may be decommissioning a set of nodes from one cluster and adding them to another. This calls for scalable cluster status monitoring and action status tracking. In this session, we present a highly scalable configuration management solution that supports arbitrary state-based cross-node dependencies.
by Ron Bodkin and Mike Peterson
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
I`ll describe what I see as the most current and exciting big data challenges at NASA, across a number of application domains: * Planetary Science * Next generation Earth science decadal missions * Radio Astronomy and the next generation instruments including the Square Kilometre Array * Snow Hydrology and Climate Impacts The session will focus on defining the problem space, suggesting how technologies like Apache Hadoop can be leveraged, and where Apache Hadoop fits against the Apache OODT technology, pioneered originally by NASA and now another Apache big data technology growing at the foundation. In addition, I will present what I feel to be the key next steps architecturally that the Hadoop community can take to better integrate and apply to the realm of science data systems.
by Abe Taha
Hadoop is fast becoming an integral part of BI infrastructures to empower everyone in the business to affordably analyze large amounts of multi-structured data not previously possible with traditional OLAP or RDBMS technologies. This session will cover best practices for implementing Big Data Analytics directly on Hadoop. Topics will include architecture, integration and data exchange; self-service data exploration and analysis; openness and extensibility; the roles of Hive and the Hive metastore; integration with BI dashboards, reporting and visualization tools; using existing SAS and SPSS models and advanced analytics; ingesting data with Sqoop, Flume, and ETL.
by Benjamin Reed
Zookeeper is an important member of the Hadoop ecosystem. It is widely used in industry to coordinate distributed systems. A common use-case of Zookeeper is to dynamically maintain membership and other configuration metadata for its users. Zookeeper itself is a replicated distributed system. Unfortunately, the membership and all other configuration parameters of Zookeeper are static — they’re loaded during boot and cannot be altered. Operators resort to “rolling restart”, a manually intensive and error-prone method of changing the configuration that has caused data loss and inconsistency in production. Automatic reconfiguration functionality has been requested by operators since 2008 (Zookeeper-107). Several previous proposals were found incorrect and rejected. We implemented a new reconfiguration protocol in Zookeeper and are currently integrating it into the codebase. It fully automates configuration changes: the set of Zookeeper servers, their roles, addresses, etc. can be changed dynamically, without service interruption and while maintaining data consistency. By leveraging the properties already provided by Zookeeper our protocol is considerably simpler than state of the art. Our protocol also encompasses the clients — clients are rebalanced across servers in the new configuration, while keeping the extent of migration proportional to the change in membership.
by Stephen O'Sullivan and Jeremy King
Use case – Research project to look at Hadoop for a business case, and moving the results -Became the store for all web site performance / platform stats for operation analytics – 8 different groups start using the the cluster and it went from 8 to 12 nodes (with a total of 64TB) within (7 months) – Started to do more business analytics, but had to start moving data around as space becomes an issue – Business funds a 250+ node cluster is purchased with a total of 1.2PB of space… – …And this cluster will increase by 200+ nodes by the end of the year How Hadoop will be used on the next generation @WalmartLabs / walmart.com platform. – Storing all the fine grain data generated by walmart.com – Realtime analytics for site performance and alerting – Hybrid model * with Data Warehouses (that are using current BI tools) * with noSQL
We use “SaasBase Analytics” to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in realtime. The requirement we started from was to get large amounts of data available in near realtime (minutes) to large amounts of users for large amounts of (different) queries that take milliseconds to execute. This set our problem apart from classical solutions such as Hive and PIG. In this talk I`ll go through the design of the solution and the strategies (and hacks) to achieve low latency and scalability from theoretical model to the entire process of ETL to warehousing and queries.
by Tim Mallalieu
There are a number of excellent databases that have proven invaluable when working with `big data` – including Cassandra, Riak, DynamoDB, MongoDB, and HBase. So, how do you decide which is the right solution for you? This talk will start by briefly discussing some of the theory involved with distributed databases – from the oft-cited (and almost-as-often misunderstood) CAP theorem, to vector clocks and the difficulties of eventual consistency, and much more. We will then compare how mainstream distributed databases make use of these concepts, and the tradeoffs they incur. Unfortunately, due to the nature of these tradeoffs, there is no one-size-fits all solution. So, we`ll discuss what classes of problems each of these systems is appropriate for – and include some real world benchmarks – to help you make an informed decisions about your distributed systems.
NetApp collects 500 TB per year of unstructured data from devices that phone home, sending unstructured auto-support log and configuration data back to centralized data centers. NetApp needed to scale their architecture, to provide timely support data, to extend their data warehouse, and to be able to do ad hoc analysis and build predictive models for device support,product analysis and cross-sales. This presentation presents the challenges and a solution architecture based on Apache Hadoop, Apache HBase, Apache Flume, Apache AVRO, HIVE as well as PIG and Solr. The presentation reviews lessons from implementation.
In the last two years, Netflix has moved from a data center architecture to a completely cloud-oriented one. We now leverage Amazon EC2 for cloud computing and Amazon S3 for storage. At the same time, Hadoop has become central for carrying out research and analysis in data science. Using Hadoop alleviates our concerns about data sparsity, sampling bias and memory constraints and we can use as much data as is available at our disposal. In this talk, we will present several efficient and scalable in-house MapReduce implementations of algorithms for search retrieval, personalized recommendation, and video auto-tagging. We will discuss how we augment Netflix with social and audience data like Twitter, Facebook, Wikipedia, Nielsen and Rentrak, and why Hadoop is suitable for handling their scale and schema. We will highlight an application on how we set up an automated pipeline to generate interesting visualizations, trends and actionable insights from Netflix and third party data. Outside of training models, we use Hadoop in data preparation, feature engineering, feature selection and model evaluation. We baseline our performance with alternative approaches and report significant lifts in time and space obtained by using Hadoop.
Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.
by David Mariani and Denny Lee
In this age of Big Data, data volumes grow exceedingly larger while the technical problems and business scenarios become more complex. Compounding these complexities, data consumers are demanding faster analysis to common business questions asked of their Big Data. This session provides concrete examples of how to address this challenge. We will highlight the use of Big Data technologies—including Hadoop and Hive —with classic BI systems such as SQL Server Analysis Services.
• Understand the architectural components surrounding Hadoop, Hive, Classic BI, and the Tier-1 BI ecosystem
• Get strategies for addressing the technical issues when working with extremely large cubes
• See how to address the technical issues when working with Big Data systems from the DBA perspective
by Travis Dawson
This session shows how Hadoop enables deep analytics over massive amounts of network data, and how to extract information and value using Hadoop at the core of a complete analytics system. Narus, a division of Boeing, helps customers unlock the value of their networks with dynamic network traffic intelligence and analysis of information on IP traffic and flow data. This session provides details on how real-time traffic capture and analysis integrates with Hadoop to perform extremely complex analytics over vast quantities of data in a demanding environment to produce actionable information. The uses for these analytics range from simple network analysis to providing complex security detection and mitigation analysis. Terabytes of forensic data of network traffic are processed to isolate suspicious patterns of behavior, allowing further analysis to pinpoint malicious traffic and operators to take action.
Apache ZooKeeper has become a de facto standard for distributed coordination. _Its design has proven to be flexible enough that it can be applied to a variety of needs of distributed applications. It has been used for leader election, service discovery, status monitoring, dynamic configuration etc. Recently new use cases have come up where ZooKeeper is being used as a discovery service with thousands of clients. Couple of examples include Hadoop Namenode HA and Yarn HA. This has led to a new set of requirements that need to be addressed. There is a need for session-less read-only client creation to address startup latency issues of thousands of clients . Also, such scale creates a need for reducing memory footprint of watch management in ZooKeeper. In this talk we will discuss the various new use cases that are coming up in Apache ZooKeeper and the work that is being done in the community to address these issues. We will also discuss the future roadmap for ZooKeeper.
by Kathleen Ting and Bilung Lee
Apache Sqoop (incubating) was created to efficiently transfer big data between Hadoop related systems (such as HDFS, Hive, and HBase) and structured data stores (such as relational databases, data warehouses, and NoSQL systems). The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. In the meantime, we have encountered many new challenges that have outgrown the abilities of the current infrastructure. To fulfill more data integration use cases as well as become easier to manage and operate, a new generation of Sqoop, also known as Sqoop 2, is currently undergoing development to address several key areas, including ease of use, ease of extension, and security. This session will talk about Sqoop 2 from both the development and operations perspectives.
by Marie-Luce Picard and Bruno Jacquin
Worldwide, smart-grid projects are being launched and motivated by economic constraints, regulatory aspects, social or environmental needs. In France, a recent law implies the future deployment of about 35 millions of meters. Scalable solutions for storing and processing huge amount of metering data are needed: relational database technologies as well as Big Data approach like Hadoop can be considered. What are the main difficulties for a scalable and efficient storage when different types of queries are to be implemented and run ? – operational and analytics needs – variable levels of complexity with variable scopes and frequencies, – variable acceptable latencies. We will describe a current work for storing and mining large amounts of metering data (about 1800 billions of records for metering measurements, annual volume of about 120 To of raw data) using a Hadoop based solution. We will focus on : – physical and logical architecture, – time-series data modelling, impact of types of compression, as well as HDFS block sizes, – Hive and Pig for analytical queries and HBase for point queries. Finally we will discuss on what can be the added value of using Hadoop as a component in a future Information System for utilities.
13th–14th June 2012