by Plamen Jeliazkov and Konstantin Shvachko
HDFS is based on the decoupled namespace from data architecture. Its namespace operations are performed on a designated server NameNode and data is subsequently streamed from/to data servers DataNodes. While the data layer of HDFS is highly distributed, the namespace is maintained by a single NameNode, making it a SPOF and a bottleneck for its scalability and availability. HBase is a scalable metadata store, which can be used for storing objects composing the files directly in it, but this would lack the ability to separate namespace operations from data streaming. Giraffa is an experimental file system, which uses HBase to maintain the file system namespace in a distributed way and serves data directly from DataNodes. Giraffa is built from the existing in HDFS and HBase components. Giraffa is intended to maintain very large namespaces. HBase automatically partitions large tables into horizontal slices Regions. The partitioning is dynamic, so that if a region grows too big or becomes too small the table is automatically repartitioned. The partitioning is based on row ordering. In order to optimize the access to the file system objects Giraffa preserves the locality of objects adjacent in the namespace tree. The presentation will explain the Giraffa architecture, the principles behind row key definitions for namespace partitioning, and will address the atomic rename problem.
This application serves as a tutorial for the use of Big Data in the cloud. We start with the commoncrawl.org crawl of approximately 5 Billion web pages. We use a Map Reduce program to scour the commoncrawl corpus for web pages that contain mentions of a brand or keyword of interest, say, `Citibank`, and additionally, have a `Follow me on twitter` link. We harvest this twitter handle, and store it in HBase. Once we have harvested about 5000 twitter handles, we write and run a program to subscribe to the twitter streaming API for public status updates of these folks. As the twitter status updates pour in, we use a natural language processing library to evaluate the sentiment of these tweets, and store the sentiment score back in HBase. Finally, we use a program written in R, and the rhbase connector to do a real time statistical evaluation of the sentiment expressed by the twitterverse towards this brand or keyword. This presentation includes full details on installing and operating all necessary software in the cloud.
by Mark Davis
Semantic zooming involves providing the right type of information depending on the resolution of viewer. A canonical example is the map viewer, where country outlines are visible at one level and, as the user zooms in, provinces and roadways become increasingly visible. High-performance zooming technologies are critically dependent on the efficient materialization of views from the data resources and, for big data resources like sensor data, econometrics, social networks, biological databases, and networking performance data, they are impeded by the scale of the data and the need to preprocess the information into aggregate views in advance, reducing the granularity and timeliness of the insights that can be obtained from the zooming technology. Through parallelization, however, semantic zooming that operates directly on the data becomes possible. In this highly visual presentation and demo, we will show our ZettaZoom visualization engine that provides a protocol for Hadoop and HBase marshaling of data signals into visual representations that preserve the relationships present within the data, enabling semantic zooming over massive data collections.
13th–14th June 2012