Your current filters are…
by Andrew Ryan
The Hadoop Distributed Filesystem, or HDFS, provides the storage layer to a variety of critical services at Facebook. The HDFS Namenode is often singled out as a particularly weak aspect of the design of HDFS, because it represents a single point of failure within an otherwise redundant system. To address this weakness, Facebook has been developing a highly available Namenode, known as Avatarnode. The objective of this study was to determine how much effect Avatarnode would have on overall service reliability and durability. To analyze this, we categorized, by root cause, the last two years` of operational incidents in the Data Warehouse and Messages services at Facebook, a total of 66 incidents. We were able to show that approximately 10% of each service`s incidents would have been prevented had Avatarnode been in place. Avatarnode would have prevented none of our incidents that involved data loss, and all of the most severe data loss incidents were a result of human error or software bugs. Our conclusions is that Avatarnode will improve the reliability of services that use HDFS, but that the HDFS Namenode represents only a small portion of overall operational incidents in services that use HDFS as a storage layer.
by Joanthan Hsieh and Jeff Bean
Apache HBase is a rapidly-evolving random-access distributed data store built on top of Apache Hadoop’s HDFS and Apache ZooKeeper. Drawing from real-world support experiences, this talk provides administrators insight into improving HBase’s availability and recovering from situations where HBase is not available. We share tips on the common root causes of unavailability, explain how to diagnose them, and prescribe measures for ensuring maximum availability of an HBase cluster. We discuss new features that improve recovery time such as distributed log splitting as well as supportability improvements. We will also describe utilities including new failure recovery tools that we have developed and contributed that can be used to diagnose and repair rare corruption problems on live HBase systems.
The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (HDFS-1623). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.
by Greg Bruno
The California Gold Rush ended in 1855, but today it feels like we are at the cusp of a new Gold Rush of sorts. Only this time the prize is Big Data, and IT departments are flocking to it seeking their fortune. Some will succeed wildly — while others will fail miserably. In this session we will describe a reference architecture for Hadoop that will take your Big Data project from proof-of-concept to full-scale deployment while avoiding the missteps and mistakes that could get in the way of your project. Our reference architecture is based on industry standard Apache Hadoop, and built on a rock-solid deployment and management infrastructure derived from the Rocks cluster management software. Greg will share his expertise in big infrastructure deployment and management, showing you how to design for deployment from day one. Let us be your guide as you explore the Big Data frontier, and we will lead you to success. With the right methods, and the right tools for the job, your Hadoop project will be pure gold.
by Josh Wills
Branch-and-bound is a widely used technique for efficiently searching for solutions to combinatorial optimization problems. In this session, we will introduce BranchReduce, an open-source Java library for performing distributed branch-and-bound on a Hadoop cluster under YARN. Applications only need to write code that is specific to their optimization problem (namely the branching rule, the lower bound computation, and the upper bound computation), and BranchReduce handles deploying the application to the cluster, managing the execution, and periodically rebalancing the search space across the machines. We will give an overview of how BranchReduce works and then walk through an example that solves a scheduling problem with a near-linear speedup over a single machine implementation.
13th–14th June 2012