Your current filters are…
by James Dixon and Chris Deptula
The big data world is extremely chaotic based on technology in its infancy. Learn how to tame this chaos, integrate it within your existing data environments (RDBMS, analytic databases, applications), manage the workflow, orchestrate jobs, improve productivity and make using big data technologies accessible to a much wider spectrum of developers, analysts and data scientists. Learn how you can actually leverage Hadoop and NoSQL stores via an intuitive, graphical big data IDE – eliminating the need for deep developer skills such as Hadoop MapReduce, Pig scripting, or NoSQL queries.
In this hands-on tutorial, you’ll learn how to install and use Hive for Hadoop-based data warehousing. You’ll also learn some tricks of the trade and how to handle known issues.
Using the Hive Tutorial Tools
We’ll email instructions to you before the tutorial so you can come prepared with the necessary tools installed and ready to go. This prior preparation will let us use the whole tutorial time to learn Hive’s query language and other important topics. At the beginning of the tutorial we’ll show you how to use these tools.
Writing Hive Queries
We’ll spend most of the tutorial using a series of hands-on exercises with actual Hive queries, so you can learn by doing. We’ll go over all the main features of Hive’s query language, HiveQL, and how Hive works with data in Hadoop.
Hive is very flexible about the formats of data files, the “schema” of records and so forth. We’ll discuss options for customizing these and other aspects of your Hive and data cluster setup. We’ll briefly examine how you can write Java user defined functions (UDFs) and other plugins that extend Hive for data formats that aren’t supported natively.
Hive in the Hadoop Ecosystem
We’ll conclude with a discussion of Hive’s place in the Hadoop ecosystem, such as how it compares to other available tools. We’ll discuss installation and configuration issues that ensure the best performance and ease of use in a real production cluster. In particular, we’ll discuss how to create Hive’s separate “metadata” store in a traditional relational database, such as MySQL. We’ll offer tips on data formats and layouts that improve performance in various scenarios.
This tutorial provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop, plus its associated ecosystem. This session is intended for those who are new to Hadoop and are seeking to understand where Hadoop is appropriate and how it fits with existing systems.
The agenda will include:
This tutorial will explain how to leverage a Hadoop cluster to do data analysis using Java MapReduce, Apache Hive and Apache Pig. It is recommended that participants have experience with some programming language. Topics include:
We examine the effectiveness of a statistical technique known as survival analysis to optimize the cache time-to-live for hotel rates in a hotel rate cache. We describe how we collect and prepare nearly a billion records per day utilizing MongoDB and Hadoop. Finally, we show how this analysis is improving the operation of our hotel rate cache.
by Doug Cutting
Apache Hadoop forms the kernel of an operating system for Big Data. This ecosystem of interdependent projects enables institutions to affordably explore ever vaster quantities of data. The platform is young, but it is strong and vibrant, built to evolve.
by Mike Olson
Tools for attacking big data problems originated at consumer internet companies, but the number and variety of big data problems have spread across industries and around the world. I’ll present a brief summary of some of the critical social and business problems that we’re attacking with the open source Apache Hadoop platform.
Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.
rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.
During the last 12 months, Apache Hadoop has received an enormous amount of attention for its ability to transform the way organizations capitalize on their data in a cost effective manner. The technology has evolved to a point where organizations of all sizes and industries are testing its power as a potential solution to their own data management challenges.
However, there are still technology and knowledge gaps hindering adoption of Apache Hadoop as an enterprise standard. Among these gaps are the complexity of the system, the lack of technical content that exists to assist with its usage, and that it requires intensive developer and data scientist skills to be used properly. With virtually every Fortune 500 company constructing their Hadoop strategy today, many in the IT community are wondering what the future of Hadoop will look like.
In this session, Hortonworks CEO Eric Baldeschwieler will look at the current state of Apache Hadoop, how the ecosystem is evolving by working together to close the existing technological and knowledge gaps, and present a roadmap for the future of the project.
by Asad Khan
The second one is how to enable simple experiences directly through an HTML5-based interface. The lightweight Web interface gives developer the same experience as they would get on the Server. The web interface provides a zero installation experience to the developer across all client platforms. This also allowed us to use HTML5 support in the browsers to give some basic data visualization support for quick data analysis and charting.
by Rohit Valia
The Hadoop framework is an established solution for big data management and analysis. In practice, Hadoop applications vary significantly. Your data center infrastructure is used by multiple lines of business and multiple differing workloads.
This session looks at the requirements for a multi-tenant big data cluster: one where different lines of businesses, different projects, and multiple applications can be run with assured SLAs, resulting in higher utilization and ROI for these clusters.
This session is sponsored by Platform Computing
Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.
Using Hadoop based business intelligence analytics, this session looks at the Hadoop source code and its development over time and illustrates some interesting and fun facts we will share with the audience. This talk will illustrate text and related analytics with Hadoop on Hadoop to reveal the true hidden secrets of the elephant.
This entertaining session highlights the value of data correlation across multiple datasets and the visualization of those correlations to reveal hidden data relationships.
Once social media and web companies discovered Hadoop as the good enough solution for any data analytics problem that did not fit into mysql, Hadoop was on a rapid rise on the financial industry. The reasons the financial industry is adopting Hadoop very fast are very different than in other industries. Banks typically are not engineering driven organizations and terms like agile development, shared root key or cron tab scheduling are no go’s in a bank but standard around Hadoop.
This entertaining talk for bankers and other financial services managers with technical experience or engineers discusses four business intelligence platform deployments on Hadoop:
1. Long-term storage and analytics of transactions and the huge cost saves Hadoop can provide;
2. Identifying cross and up sell opportunities by analyzing web log files in combination with customer profiles;
3. Value-at-risk analytics; and
4. Understanding the SLA issues and identifying problems in a thousands-of-nodes, big services oriented architecture.
This session discusses the different use cases and the challenges to overcome in building and using BI on Hadoop.
In video surveillance, hundreds of hours of video recordings are culled from multiple cameras. Within this video are hours of recordings that do not change from one minute to the next, one hours to the next and in some cases, one day to the next. Identifying information that is interesting and that can be shared, analyzed and viewed by a larger community from this video is a time-consuming task that often requires human intervention assisted by digital processing tools.
Using Map/Reduce we can harness parallel processing and clusters of graphical processors to identify and tag useful periods of time for faster analysis. The result is an aggregate video file that contains metadata tags that link back to the start of those scenes in the original file. In essence, creating an index into hundreds-of-thousands of hours of recording that can be reviewed, shared and analyzed by a much larger group of individuals.
This session will review examples where this is being done in the real world and discuss the process for developing a Hadoop process that can break a video down into scenes that are analyzed by maps to determine interest and then reduced into a single index file that contains 30 seconds of recording around that scene. Moreover, the file will contain the necessary metadata to jump back into the original at the start point and allow the viewer to view the scene in context of the entire recording.
NetApp is a fast growing provider of storage technology. Its devices “phone home” regularly, sending unstructured auto-support log and configuration data back to centralized data centers. This data is used to provide timely support, to improve sales, and to plan product improvements. To allow this, data is collected, organized, and analyzed. The system currently ingests 5 TB of compressed data per week, which is growing 40% per year. NetApp was previously storing flat files on disk volumes and keeping summary data in relational databases. Now NetApp is working with Think Big Analytics, deploying Hadoop, HBase and related technologies to ingest, organize, transform and present auto-support data. This will enable business users to make decisions and provide timely response, and will enable automated response based on predictive models. Key requirements include:
In this session we look at the the lessons learned while designing and implementing a system to:
28th February to 1st March 2012