by Doug Cutting
Apache Hadoop forms the kernel of an operating system for Big Data. This ecosystem of interdependent projects enables institutions to affordably explore ever vaster quantities of data. The platform is young, but it is strong and vibrant, built to evolve.
by Mike Olson
Tools for attacking big data problems originated at consumer internet companies, but the number and variety of big data problems have spread across industries and around the world. I’ll present a brief summary of some of the critical social and business problems that we’re attacking with the open source Apache Hadoop platform.
Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.
rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.
During the last 12 months, Apache Hadoop has received an enormous amount of attention for its ability to transform the way organizations capitalize on their data in a cost effective manner. The technology has evolved to a point where organizations of all sizes and industries are testing its power as a potential solution to their own data management challenges.
However, there are still technology and knowledge gaps hindering adoption of Apache Hadoop as an enterprise standard. Among these gaps are the complexity of the system, the lack of technical content that exists to assist with its usage, and that it requires intensive developer and data scientist skills to be used properly. With virtually every Fortune 500 company constructing their Hadoop strategy today, many in the IT community are wondering what the future of Hadoop will look like.
In this session, Hortonworks CEO Eric Baldeschwieler will look at the current state of Apache Hadoop, how the ecosystem is evolving by working together to close the existing technological and knowledge gaps, and present a roadmap for the future of the project.
by Asad Khan
In this session we will discuss two key aspects of using JavaScript in the Hadoop environment. The first one is how we can reach to a much broader set of developers by enabling JavaScript support on Hadoop. The JavaScript fluent API that works on top of other languages like PigLatin let developers define MapReduce jobs in a style that is much more natural; even to those who are unfamiliar to the Hadoop environment.
The second one is how to enable simple experiences directly through an HTML5-based interface. The lightweight Web interface gives developer the same experience as they would get on the Server. The web interface provides a zero installation experience to the developer across all client platforms. This also allowed us to use HTML5 support in the browsers to give some basic data visualization support for quick data analysis and charting.
During the session we will also share how we used other open source projects like Rhino to enable JavaScript on top of Hadoop.
by Rohit Valia
The Hadoop framework is an established solution for big data management and analysis. In practice, Hadoop applications vary significantly. Your data center infrastructure is used by multiple lines of business and multiple differing workloads.
This session looks at the requirements for a multi-tenant big data cluster: one where different lines of businesses, different projects, and multiple applications can be run with assured SLAs, resulting in higher utilization and ROI for these clusters.
This session is sponsored by Platform Computing
Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.
Using Hadoop based business intelligence analytics, this session looks at the Hadoop source code and its development over time and illustrates some interesting and fun facts we will share with the audience. This talk will illustrate text and related analytics with Hadoop on Hadoop to reveal the true hidden secrets of the elephant.
This entertaining session highlights the value of data correlation across multiple datasets and the visualization of those correlations to reveal hidden data relationships.
United States United States, Santa Clara
28th February to 1st March 2012