Your current filters are…
by Joanthan Hsieh and Jeff Bean
Apache HBase is a rapidly-evolving random-access distributed data store built on top of Apache Hadoop’s HDFS and Apache ZooKeeper. Drawing from real-world support experiences, this talk provides administrators insight into improving HBase’s availability and recovering from situations where HBase is not available. We share tips on the common root causes of unavailability, explain how to diagnose them, and prescribe measures for ensuring maximum availability of an HBase cluster. We discuss new features that improve recovery time such as distributed log splitting as well as supportability improvements. We will also describe utilities including new failure recovery tools that we have developed and contributed that can be used to diagnose and repair rare corruption problems on live HBase systems.
We use “SaasBase Analytics” to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in realtime. The requirement we started from was to get large amounts of data available in near realtime (minutes) to large amounts of users for large amounts of (different) queries that take milliseconds to execute. This set our problem apart from classical solutions such as Hive and PIG. In this talk I`ll go through the design of the solution and the strategies (and hacks) to achieve low latency and scalability from theoretical model to the entire process of ETL to warehousing and queries.
There are a number of excellent databases that have proven invaluable when working with `big data` – including Cassandra, Riak, DynamoDB, MongoDB, and HBase. So, how do you decide which is the right solution for you? This talk will start by briefly discussing some of the theory involved with distributed databases – from the oft-cited (and almost-as-often misunderstood) CAP theorem, to vector clocks and the difficulties of eventual consistency, and much more. We will then compare how mainstream distributed databases make use of these concepts, and the tradeoffs they incur. Unfortunately, due to the nature of these tradeoffs, there is no one-size-fits all solution. So, we`ll discuss what classes of problems each of these systems is appropriate for – and include some real world benchmarks – to help you make an informed decisions about your distributed systems.
by Kathleen Ting and Bilung Lee
Apache Sqoop (incubating) was created to efficiently transfer big data between Hadoop related systems (such as HDFS, Hive, and HBase) and structured data stores (such as relational databases, data warehouses, and NoSQL systems). The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. In the meantime, we have encountered many new challenges that have outgrown the abilities of the current infrastructure. To fulfill more data integration use cases as well as become easier to manage and operate, a new generation of Sqoop, also known as Sqoop 2, is currently undergoing development to address several key areas, including ease of use, ease of extension, and security. This session will talk about Sqoop 2 from both the development and operations perspectives.
by Alex Kozlov
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
by Plamen Jeliazkov and Konstantin Shvachko
HDFS is based on the decoupled namespace from data architecture. Its namespace operations are performed on a designated server NameNode and data is subsequently streamed from/to data servers DataNodes. While the data layer of HDFS is highly distributed, the namespace is maintained by a single NameNode, making it a SPOF and a bottleneck for its scalability and availability. HBase is a scalable metadata store, which can be used for storing objects composing the files directly in it, but this would lack the ability to separate namespace operations from data streaming. Giraffa is an experimental file system, which uses HBase to maintain the file system namespace in a distributed way and serves data directly from DataNodes. Giraffa is built from the existing in HDFS and HBase components. Giraffa is intended to maintain very large namespaces. HBase automatically partitions large tables into horizontal slices Regions. The partitioning is dynamic, so that if a region grows too big or becomes too small the table is automatically repartitioned. The partitioning is based on row ordering. In order to optimize the access to the file system objects Giraffa preserves the locality of objects adjacent in the namespace tree. The presentation will explain the Giraffa architecture, the principles behind row key definitions for namespace partitioning, and will address the atomic rename problem.
This application serves as a tutorial for the use of Big Data in the cloud. We start with the commoncrawl.org crawl of approximately 5 Billion web pages. We use a Map Reduce program to scour the commoncrawl corpus for web pages that contain mentions of a brand or keyword of interest, say, `Citibank`, and additionally, have a `Follow me on twitter` link. We harvest this twitter handle, and store it in HBase. Once we have harvested about 5000 twitter handles, we write and run a program to subscribe to the twitter streaming API for public status updates of these folks. As the twitter status updates pour in, we use a natural language processing library to evaluate the sentiment of these tweets, and store the sentiment score back in HBase. Finally, we use a program written in R, and the rhbase connector to do a real time statistical evaluation of the sentiment expressed by the twitterverse towards this brand or keyword. This presentation includes full details on installing and operating all necessary software in the cloud.
by Mark Davis
Semantic zooming involves providing the right type of information depending on the resolution of viewer. A canonical example is the map viewer, where country outlines are visible at one level and, as the user zooms in, provinces and roadways become increasingly visible. High-performance zooming technologies are critically dependent on the efficient materialization of views from the data resources and, for big data resources like sensor data, econometrics, social networks, biological databases, and networking performance data, they are impeded by the scale of the data and the need to preprocess the information into aggregate views in advance, reducing the granularity and timeliness of the insights that can be obtained from the zooming technology. Through parallelization, however, semantic zooming that operates directly on the data becomes possible. In this highly visual presentation and demo, we will show our ZettaZoom visualization engine that provides a protocol for Hadoop and HBase marshaling of data signals into visual representations that preserve the relationships present within the data, enabling semantic zooming over massive data collections.
13th–14th June 2012