Learn how to leverage data exhaust, the digital byproduct of our online activities, to solve problems and discover insights about the world around you. We will walk through a real world example which combines several datasets and statistical techniques to discover insights and make predictions about attendees at O'Reilly Strata.
by Sam Shah
How do you go about building a product around data using Hadoop? This talk will present how LinkedIn builds and maintains such features as People You May Know. We will present our architecture for doing so (open-sourced) as well as knowledge we've gained in the process.
by Rod Cope
Hadoop and HBase make it easy to store terabytes of data, but how do you scale your search mechanism to sift through these mountains of bits and retrieve large result sets in a matter of milliseconds? Careful use of the Solr search server, based on Lucene, made these requirements come to life in our production environment. Come learn how we query terabytes of data in a highly available system.
by Doug Cutting
Apache Avro provides an expressive, efficient standard for representing large data sets. Avro data is programming-language neutral and MapReduce-friendly. Hopefully it can replace gzipped CSV-like formats as a dominant format for data.
by Isabel Drost
With growing amounts of digital data at the fingertips of software developers the need for a scalable, easy to use framework is tremendous. This talk introduces Apache Mahout - a project with the goal of implementing scalable machine learning algorithms for the masses.
1st–3rd February 2011