by Vipul Sharma
Recommendation systems have become critical for delivering relevant and personalized content to your users. Such systems not only drive revenues and generate significant user engagement for web companies but also are a great discovery tool for users. Facebook’s newsfeed, Linkedin’s people you may know and Eventbrite’s event recommendations are some great examples of recommendation systems.
During this talk we will share the architecture and design of Eventbrite’s data platform and recommendation engine. We will describe how we mined a massive social graph of 18M users and 6B first degree connections to provide relevant event recommendations. We will provide details of our data platform, which supports processing more than 2 TB social graph data daily. We intent to describe how Hadoop is becoming the most important tool to do data mining and also discuss how machine learning is changing in presence of Hadoop and big data.
We hope to provide enough details that folks can learn from our experiences while building their data platform and recommendation systems.
Once social media and web companies discovered Hadoop as the good enough solution for any data analytics problem that did not fit into mysql, Hadoop was on a rapid rise on the financial industry. The reasons the financial industry is adopting Hadoop very fast are very different than in other industries. Banks typically are not engineering driven organizations and terms like agile development, shared root key or cron tab scheduling are no go’s in a bank but standard around Hadoop.
This entertaining talk for bankers and other financial services managers with technical experience or engineers discusses four business intelligence platform deployments on Hadoop:
1. Long-term storage and analytics of transactions and the huge cost saves Hadoop can provide;
2. Identifying cross and up sell opportunities by analyzing web log files in combination with customer profiles;
3. Value-at-risk analytics; and
4. Understanding the SLA issues and identifying problems in a thousands-of-nodes, big services oriented architecture.
This session discusses the different use cases and the challenges to overcome in building and using BI on Hadoop.
In video surveillance, hundreds of hours of video recordings are culled from multiple cameras. Within this video are hours of recordings that do not change from one minute to the next, one hours to the next and in some cases, one day to the next. Identifying information that is interesting and that can be shared, analyzed and viewed by a larger community from this video is a time-consuming task that often requires human intervention assisted by digital processing tools.
Using Map/Reduce we can harness parallel processing and clusters of graphical processors to identify and tag useful periods of time for faster analysis. The result is an aggregate video file that contains metadata tags that link back to the start of those scenes in the original file. In essence, creating an index into hundreds-of-thousands of hours of recording that can be reviewed, shared and analyzed by a much larger group of individuals.
This session will review examples where this is being done in the real world and discuss the process for developing a Hadoop process that can break a video down into scenes that are analyzed by maps to determine interest and then reduced into a single index file that contains 30 seconds of recording around that scene. Moreover, the file will contain the necessary metadata to jump back into the original at the start point and allow the viewer to view the scene in context of the entire recording.
by Ed Kohlwey
While Map/Reduce is an excellent environment for some parallel computing tasks, there are many ways to use a cluster beyond Map/Reduce. Within the last year, the YARN and NextGen Map/Reduce has been contributed into the Hadoop trunk, Mesos has been released as an open source project, and a variety of new parallel programming environments have emerged such as Spark, Giraph, Golden Orb, Accumulo, and others.
We will discuss the features of YARN and Mesos, and talk about obvious yet relatively unexplored uses of these cluster schedulers as simple work queues. Examples will be provided in the context of machine learning. Next, we will provide an overview of the Bulk-Synchronous-Parallel model of computation, and compare and contrast the implementations that have emerged over the last year. We will also discuss two other alternative environments: Spark, an in-memory version of Map/Reduce which features a Scala-based interpreter; and Accumulo, a BigTable-style database that implements a novel model for parallel computation and was recently released by the NSA.
NetApp is a fast growing provider of storage technology. Its devices “phone home” regularly, sending unstructured auto-support log and configuration data back to centralized data centers. This data is used to provide timely support, to improve sales, and to plan product improvements. To allow this, data is collected, organized, and analyzed. The system currently ingests 5 TB of compressed data per week, which is growing 40% per year. NetApp was previously storing flat files on disk volumes and keeping summary data in relational databases. Now NetApp is working with Think Big Analytics, deploying Hadoop, HBase and related technologies to ingest, organize, transform and present auto-support data. This will enable business users to make decisions and provide timely response, and will enable automated response based on predictive models. Key requirements include:
In this session we look at the the lessons learned while designing and implementing a system to:
by Paul Brown
Scientists dealt with big data and big analytics for at least a decade before the business world precipitated buzz-words like ‘Big Data’, ‘Data Tsunami’ and ‘the Industrial Revolution of data’ from the strange broth of their marketing solution and came to realize they had the same problems. Both the scientific world and the commercial world share requirements for a high performance informatics platform supporting the collection, curation, collaboration, exploration, and analysis of massive datasets.
In this talk we will sketch the design of SciDB and explain how it differs from hadoop-based systems, SQL DBMS products, and NoSQL platforms, and explain why that matters. We will present benchmarking data and present a computational genomics use case that showcase SciDB’s massively scalable parallel analytics.
SciDB is an emerging open source analytical database that runs on a commodity hardware grid or in the cloud. SciDB natively supports:
• An array data model – a flexible, compact, extensible data model for rich, highly dimensional data
• Massively scale math – non-embarassingly parallel operations like linear algebra operations on matrices too large to fit in memory as well as transparently scalable R, MatLab, and SAS style analytics without requiring code for data distribution or parallel computation
• Versioning and Provenance – Data is updated, but never overwritten. The raw data, the derived data, and the derivation are kept for reproducibility, what-if modeling, back-testing, and re-analysis
• Uncertainty support – data carry error bars, probability distribution or confidence metrics that can be propagated through calculations
• Smart storage – compact storage for both dense and sparse data that is efficient for location-based, time-series, and instrument data
28th February to 1st March 2012