by Todd Lipcon
Optimizing MapReduce job performance is often seen as something of a black art. In order to maximize performance, developers need to understand the inner workings of the MapReduce execution framework and how they are affected by various configuration parameters and MR design patterns. The talk will illustrate the underlying mechanics of job and task execution, including the map side sort/spill, the shuffle, and the reduce side merge, and then explain how different job configuration parameters and job design strategies affect the performance of these operations. Though the talk will cover internals, it will also provide practical tips, guidelines, and rules of thumb for better job performance. The talk is primarily targeted towards developers directly using the MapReduce API, though will also include some tips for users of higher level frameworks.
This presentation will share a new approach to computing a histogram from a data stream in a distributed environment for a continuous attribute. Based on the PiD algorithm and the map/reduce framework, we will demonstrate an approach which combines histograms that is associative and thus can leverage the combiner function refinement of the map/reduce framework. This approach is approximate, and experiments indicate the error is always less than 1%. We tested this for different distributions and can demonstrate through this approach, that performance does not suffer from distributing computations. We will also demonstrate that this approach does not need any parameters and can be applied without any user interaction requirements. A use case is data import where all data is passed (once) anyway and histograms can be generated automatically. The number of bins can be a preset to a constant.
by Josh Wills
Branch-and-bound is a widely used technique for efficiently searching for solutions to combinatorial optimization problems. In this session, we will introduce BranchReduce, an open-source Java library for performing distributed branch-and-bound on a Hadoop cluster under YARN. Applications only need to write code that is specific to their optimization problem (namely the branching rule, the lower bound computation, and the upper bound computation), and BranchReduce handles deploying the application to the cluster, managing the execution, and periodically rebalancing the search space across the machines. We will give an overview of how BranchReduce works and then walk through an example that solves a scheduling problem with a near-linear speedup over a single machine implementation.
by Richard Cole
Amazon Elastic MapReduce is one of the largest operators of Hadoop in the world, with over one million Hadoop clusters run on the Amazon Web Services infrastructure over the last year. Since its launch over three years ago, the Amazon Elastic MapReduce team has helped users of all sizes manage the wide variety of Hadoop failure conditions, hardware and network issues, and data errors that make operating Hadoop clusters so challenging. These failures and/or performance degradations have kindled the team to develop a number of tools and best practices to help its customers more efficiently operate and troubleshoot their Hadoop clusters. This talk will detail the team’s findings, including a number of Hadoop best practices and general troubleshooting tips.
by Farhan Hussain and Saad Patel
Passenger dissatisfaction is a growing concern for major airlines. According to an American Customer Satisfaction Index report in 2011, from among 47 industries in a study of customer satisfaction, U.S. airlines fell to last place. In a low-margin sector like airlines, customer satisfaction matters a lot to yield profits. This can have a direct impact on investor confidence and stock prices for these publicly traded airlines. Independent studies usually use surveys for customer satisfaction data. These surveys often are the core piece of sample data used to draw conclusions for that entire customer base. Surveys are predefined sets of questions, often eliciting deterministic responses. By using real data from passenger feedback sites, we can sample a larger group of passengers, versus a randomly selected sample or in some cases, a biased survey. The data we collect is then correlated with real airline statistics to draw conclusions about what leaves passengers dissatisfied with an airline. This large data set is analyzed using Hadoop. These findings provide valuable data, which can be used to improve the overall customer experience, resulting in better profit margins and investor confidence for the airline industry.
by Avery Ching
While MapReduce provides a simple programing model for big data, native graph processing has also become immensely popular, primarily due to the rise of growing social networks such as Facebook, LinkedIn, and Twitter. Evidence of this trend can be found in the use of the Pregel/BSP (bulk synchronous processing) approach in projects such as Apache Hama, Bagel, Goldenorb, Phoebus, and the focus of this talk, Apache Giraph. Graph databases such as Neo4j and Bigdata and graph query languages such as Gremlin and Cypher are early indications of the eventuality that a graph processing platform analogous to the big data software stack will arise. Apache Giraph was incubated in September of 2011 and the project has grown into a dedicated group of “Giraphers” that have made many significant improvements and a 0.1 release in February 2012. Notably, the graph distribution of Giraph was rewritten to be more flexible (random and/or locality based) and major memory improvements were made to support larger graphs. In this talk we will present the current state of Giraph in a series of expanded graph processing benchmarks at the scale of over a billion edges in the graph and discuss future directions for the project.
by Bruce Tolley
Because MapReduce collocates data with the compute note, the data access is local (data locality), which conserves network bandwidth. Lack of network bandwidth and inefficient server I/O can become a bottleneck and cause compute nodes to become idle which wastes the processing power available in with Westmere and Romley class servers can increase performance. Tolley will review technology options for building high performance 10GbE clusters including tuning parameters to optimize server I/O with multicore servers, and present test results conducted with Hadoop ecosystem technology partners. The session will provide the information needed to achieve faster throughput and higher bandwidth, while supporting scale-out data centers. In particular, some of the latest Layer 2/3 congestion management features will be discussed which show how to match network performance through the server to the specific requirements of the customer application (sometimes called flow affinity or flow steering).
The MapReduce programming model lets developers without experience with parallel and distributed systems utilize the resources of a large, multi-CPU system. The Oracle RDBMS has had support for the MapReduce paradigm for years through SQL analytics, user defined pipelined table functions and aggregation objects. Apache Hadoop implements the MapReduce model. In this session, we describe a prototype of Oracle in-database Hadoop implementation — which leverages the database resident Java VM, the parallel execution engine, cluster architecture (RAC), pipelined table functions, — and lets you execute Hadoop applications directly in the database against data already stored in Oracle database. The major advantages of our implementation include: (1) no data shipping to an external Hadoop cluster (2) source compatibility with Hadoop, (3) minimal dependency on the Apache Hadoop infrastructure, (4) seamless integration of MapReduce functionality in Oracle SQL (5) better parallelism and efficiency due to data pipelining (i.e., table functions) and no intermediate materialization.
by Hitesh Shah
Hadoop YARN is the next generation computing platform in Apache Hadoop with support for programming paradigms besides MapReduce. In the world of Big Data, one cannot solve all the problems wholly using the Map Reduce programming model. Typical installations run separate programming models like MR, MPI, graph-processing frameworks on individual clusters. Running fewer larger clusters is cheaper than running more small clusters. Therefore,_leveraging YARN to allow both MR and non-MR applications to run on top of a common cluster becomes more important from an economical and operational point of view. This talk will cover the different APIs and RPC protocols that are available for developers to implement new application frameworks on top of YARN. We will also go through a simple application which demonstrates how one can implement their own Application Master, schedule requests to the YARN resource-manager and then subsequently use the allocated resources to run user code on the NodeManagers.