Your current filters are…
This tutorial provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop, plus its associated ecosystem. This session is intended for those who are new to Hadoop and are seeking to understand where Hadoop is appropriate and how it fits with existing systems.
The agenda will include:
This tutorial will explain how to leverage a Hadoop cluster to do data analysis using Java MapReduce, Apache Hive and Apache Pig. It is recommended that participants have experience with some programming language. Topics include:
by Sam Shah
Collborative filtering is a method of making predictions about a user’s interests based on the preferences of many other users. It’s used to make recommendations on many Internet sites, including LinkedIn. For instance, there’s a “Viewers of this profile also viewed” module on a user’s profile that shows other covisited pages. This “wisdom of the crowd” recommendation platform, built atop Hadoop, exists across many entities on LinkedIn, including jobs, companies, etc., and is a significant driver of engagement.
During this talk, I will build a complete, scalable item-to-item collaborative filtering MapReduce flow in front of the audience. We’ll then get into some performance optimizations, model improvements, and practical considerations: a few simple tweaks can result in an order of magnitude performance improvement and a substantial increase in clickthroughs from the naive approach. This simple covisitation method gets us more than 80% of the way to the more sophisticated algorithms we have tried.
This is a practical talk that is accessible to all.
by Asad Khan
The second one is how to enable simple experiences directly through an HTML5-based interface. The lightweight Web interface gives developer the same experience as they would get on the Server. The web interface provides a zero installation experience to the developer across all client platforms. This also allowed us to use HTML5 support in the browsers to give some basic data visualization support for quick data analysis and charting.
Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.
In video surveillance, hundreds of hours of video recordings are culled from multiple cameras. Within this video are hours of recordings that do not change from one minute to the next, one hours to the next and in some cases, one day to the next. Identifying information that is interesting and that can be shared, analyzed and viewed by a larger community from this video is a time-consuming task that often requires human intervention assisted by digital processing tools.
Using Map/Reduce we can harness parallel processing and clusters of graphical processors to identify and tag useful periods of time for faster analysis. The result is an aggregate video file that contains metadata tags that link back to the start of those scenes in the original file. In essence, creating an index into hundreds-of-thousands of hours of recording that can be reviewed, shared and analyzed by a much larger group of individuals.
This session will review examples where this is being done in the real world and discuss the process for developing a Hadoop process that can break a video down into scenes that are analyzed by maps to determine interest and then reduced into a single index file that contains 30 seconds of recording around that scene. Moreover, the file will contain the necessary metadata to jump back into the original at the start point and allow the viewer to view the scene in context of the entire recording.
by Ed Kohlwey
While Map/Reduce is an excellent environment for some parallel computing tasks, there are many ways to use a cluster beyond Map/Reduce. Within the last year, the YARN and NextGen Map/Reduce has been contributed into the Hadoop trunk, Mesos has been released as an open source project, and a variety of new parallel programming environments have emerged such as Spark, Giraph, Golden Orb, Accumulo, and others.
We will discuss the features of YARN and Mesos, and talk about obvious yet relatively unexplored uses of these cluster schedulers as simple work queues. Examples will be provided in the context of machine learning. Next, we will provide an overview of the Bulk-Synchronous-Parallel model of computation, and compare and contrast the implementations that have emerged over the last year. We will also discuss two other alternative environments: Spark, an in-memory version of Map/Reduce which features a Scala-based interpreter; and Accumulo, a BigTable-style database that implements a novel model for parallel computation and was recently released by the NSA.
28th February to 1st March 2012