Your current filters are…
Virtual Machines are a mainstay in the enterprise. Apache Hadoop is normally run on bare machines. This talk walks through the convergence and the use of virtual machines for running ApacheHadoop. We describe the results from various tests and benchmarks which show that the overhead of using VMs is small. This is a small price to pay for the advantages offered by virtualization. The second half of talk compares multi-tenancy with VMs versus multi-tenancy of with Hadoop`s Capacity scheduler. We follow on with a comparison of resource management in V-Sphere and the finer grained resource management and scheduling in NextGen MapReduce. NextGen MapReduce supports a general notion of a container (such as a process, jvm, virtual machine etc) in which tasks are run;. We compare the role of such first class VM support in Hadoop.
by Shyam Sarkar
The Dodd-Frank Act signifies the biggest US regulatory change in several decades. According to experts, Dodd-Frank will have a substantial influence over an estimated 8,500 investment managers, all the 10-12 US exchanges and alternative execution networks. This presentation describes implementation perspective about the new regulations that specifically relate to the central clearing of OTC derivatives, and the repercussions for confirmation/settlement flows, real-time reporting and risk management. A trading platform and repository will need direct access to exchanges. This eliminates layers of risk by removing redundant data keying and duplication. Straight-through processing facilitates integration of front-to-back office systems and has the additional benefit of helping prevent illegal market manipulative practices providing the necessary audit trail. Big Data Analytics with cloud based Hadoop, HBase, Hive along with BI tools will be necessary for straight-through processing and realtime reporting.
How do you keep up with the velocity and variety of data streaming in and get analytics on it even before persistence and replication in Hadoop? In this talk, we’ll look at common architectural patterns being used today at companies such as Expedia, Groupon and Zynga that take advantage of Splunk to provide real-time collection, indexing and analysis of machine-generated big data with reliable event delivery to Hadoop. We’ll also describe how to use Splunk’s advanced search language to access data stored in Hadoop and rapidly analyze, report on and visualize results.
by Tony Baer
With Hadoop entering the mainstream, can — and should — it benefit from best practices from the world of Data Warehousing. Should the same ground rules developed for capacity-constrained internal enterprise DWs apply to Hadoop data stores designed for scale out, or for harvesting data from the Internet? We will pinpoint 3 key areas: data quality, privacy & confidentiality, and lifecycle management, addressing issues such as: 1. Does it make sense to apply traditional data cleansing practices to Hadoop data? Or will removing “errors” remove the possibility for discovering new insights? 2. Do different standards for privacy protection apply when harvesting sources such as social media that are already public? Should enterprises track their customers on Facebook or Twitter? 3. Will Hadoop make conventional data archiving practices obsolete? Is it cost effective to “move” petabytes of data offline? Just because the Googles & Yahoos of the world retain all their data, should mainstream enterprises? Should Hadoop be considered the new tape?
by Todd Lipcon
Optimizing MapReduce job performance is often seen as something of a black art. In order to maximize performance, developers need to understand the inner workings of the MapReduce execution framework and how they are affected by various configuration parameters and MR design patterns. The talk will illustrate the underlying mechanics of job and task execution, including the map side sort/spill, the shuffle, and the reduce side merge, and then explain how different job configuration parameters and job design strategies affect the performance of these operations. Though the talk will cover internals, it will also provide practical tips, guidelines, and rules of thumb for better job performance. The talk is primarily targeted towards developers directly using the MapReduce API, though will also include some tips for users of higher level frameworks.
by Alan Gates
The initial work in HCatalog has allowed users to share their data in Hadoop regardless of the tools they use and relieved them of needing to know where and how their data is stored. But there is much more to be done to deliver on the full promise of providing metadata and table management for Hadoop clusters. It should be easy to store and process semi-structured and unstructured data via HCatalog. We need interfaces and simple implementations of data life cycle management tools. We need to deepen the integration with NoSQL and MPP data stores. And we need to be able to store larger metadata such as partition level statistics and user generated metadata. This talk will cover these areas of growth and give an overview of how they might be approached.
by Supreeth Rao
In display advertising at Yahoo, we have an attribution problem, wherein, we need to attribute each ad shown with the corresponding ad-serve data on a regular basis. While the attribution needs to happen on about million of ads every few minutes, the attribution needs to lookup tens of terabytes of compressed data. Processing such volumes of data every few minutes is a considerable challenge. We use Hadoop/Pig for our advertising data processing ETLs at Yahoo. While Pig provides many variants of join optimizations such as replicated-join, skewed join and merge joins, there is no inbuilt efficient sparse join that can handle the scale specified above. In this context, we propose a skewed join query with an optimized query plan using a partitioned-join, with dynamic partitioning of data and effective partition pruning. We have conducted experiments to evaluate our query optimizations. Based on our results, we observe up to 3x improvement over the best performing inbuilt join. Further, the amount of compute/io resources needed for our approach are significantly lower than the best performing inbuilt pig join. In this presentation, we will detail our design choices, reasons, benefits, along with our evaluation w.r.t other join strategies available within PIG.
by Harish Butani
Hive Query Language (HQL) is excellent for productivity and enables reuse of SQL skills, but falls short in advanced analytic queries. Hive`s Map & Reduce scripts mechanism lacks the simplicity of SQL and specifying new analysis is cumbersome. We developed SQLWindowing for Hive(SQW) to overcome these issues. SQW introduces both Windowing and Table Functions to the Hive user. SQW appears as a HQL extension with table functions and windowing clauses interspersed with HQL. This means the user stays within a SQL-like interface, while simultaneously having these capabilities available. SQW has been published as an open source project. It is available as both a CLI and an embeddable jar with a simple query API. There are pre-built functions for windowing to do Ranking, Aggregation, Navigation and Linear Regression. There are Table functions to do Time Series Analysis, Allocations, and Data Densification. Functions can be chained for more complex analysis. Under the covers MR mechanics are used to partition and order data. The fundamental interface is the tableFunction, whose core job is to operate on data partitions. Function implemenations are isolated from MR mechanics, focus purely on computation logic. Groovy scripting can be used for core implementation and parameterizing behavior. Writing functions typically involves extending one of the existing Abstract functions.
by Barry Livingston and Dani Abel Rayan
Riot Games, the Santa Monica-based developer and publisher of League of Legends, aims to be the most player-focused company in the world. More than four million hardcore gamers play an average of 2.5 hours every day, requiring thousands of globally distributed game, platform, and high-volume transaction servers that generate hundreds of gigabytes of data daily. Riot relies on data for continuously improving player engagement, game design, and e-commerce. About six months ago, our data infrastructure reached a point at which it had difficulty managing our exponential growth. Rather than use commercial packages, Riot created only what was needed by leveraging open-source and off-the-shelf tools from the Hadoop ecosystem. Incorporating this new ecosystem into our existing enterprise data architecture while dealing with the ever-increasing deluge of data was challenging, but we succeeded by focusing heavily on reliability and continuous improvement. This session focuses on how Riot understood the “signals” and made a successful, rapid transition to a sophisticated MySQL-Hadoop ETL pipeline for offline analytics, leveraging Hive, Sqoop, Azkaban, Pentaho, and Tableau. This presentation details what we did, what we achieved, and some of the hard lessons we learned
Hadoop workloads can be broadly divided into two types: large aggregation queries that involve scans through massive amounts of data, and selective “needle in a haystack” queries that significantly restrict the number of records under consideration. Secondary indexes can greatly increase processing speed for queries of the second type. Several Hadoop indexing schemes have been proposed, none of which are entirely satisfactory. In this talk, we present Twitter`s generic, extensible in-situ indexing framework: unlike “trojan layouts,” no data copying is necessary, and unlike Hive, our integration at the Hadoop API level means that all layers in the stack above can benefit from indexes. Our framework is file format agnostic, compression agnostic, and modular, allowing alternate indexing schemes to be explored with minimal development effort. As a specific case study, we describe Pig scripts that analyze relatively rare user behaviors from logs, illustrating significant increases in performance and throughput compared to a brute force baseline that scans terabytes of logs. Source code will be open-sourced *prior* to the Summit.
For years, Hadoop and associated tools like HBase and Hive have been providing companies with valuable storage and analysis capabilities. Increasingly though, organizations are looking to integrate Hadoop into their existing data infrastructures to support new data processing and analysis that was impractical with traditional data management systems. They’re also looking to go beyond command line interfaces to facilitate analysis of data through richer toolsets. Although still a young market, vendors have taken note of these trends, and we’re seeing rapid development of tools to provide data access, integration, and analysis for Hadoop. In this talk we’ll look at common patterns being applied to leverage Hadoop with traditional data management systems, and then look at the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
by Greg Bruno
The California Gold Rush ended in 1855, but today it feels like we are at the cusp of a new Gold Rush of sorts. Only this time the prize is Big Data, and IT departments are flocking to it seeking their fortune. Some will succeed wildly — while others will fail miserably. In this session we will describe a reference architecture for Hadoop that will take your Big Data project from proof-of-concept to full-scale deployment while avoiding the missteps and mistakes that could get in the way of your project. Our reference architecture is based on industry standard Apache Hadoop, and built on a rock-solid deployment and management infrastructure derived from the Rocks cluster management software. Greg will share his expertise in big infrastructure deployment and management, showing you how to design for deployment from day one. Let us be your guide as you explore the Big Data frontier, and we will lead you to success. With the right methods, and the right tools for the job, your Hadoop project will be pure gold.
by Daniel Dai and Thejas Nair
The power of Hadoop lies in its ability to help users cost effectively analyze all kinds of data. We are now seeing the emergence of a new class of analytic applications that can only be enabled by a comprehensive big data platform. Such a platform extends the Hadoop framework with built-in analytics, robust developer tools, and the integration, reliability, and security capabilities that enterprises demand for complex, large scale analytics. In this session, we will share innovative analytics use cases from actual customer implementations using an enterprise-class big data analytics platform.
Big Data and virtualization are two of the most exciting trends in the industry today. In this session you will learn about the components of Big Data systems, and how real-time, interactive and distributed processing systems like Hadoop integrate with existing applications and databases. The combination of Big Data systems with virtualization gives Hadoop and other Big Data technologies the key benefits of cloud computing: elasticity, multi-tenancy and high availability. A new open source project that VMware will announce at the Hadoop Summit will make it easy to deploy, configure and manage Hadoop on a virtualized infrastructure. We will discuss reference architectures for key Hadoop distributions anddiscuss future directions of this new open source project.
When working with structured, semi-structured, and unstructured data, there is often a tendency to try and force one tool – either Hadoop or a traditional DBMS – to do all the work. At Vertica, we’ve found that there are reasons to use Hadoop for some analytics projects, and Vertica for others, and the magic comes in knowing when to use which tool and how these two tools can work together. Join us as we walk through some of the use cases for using Hadoop with a purpose-built analytics platform for an effective, combined analytics solution.
This presentation will share a new approach to computing a histogram from a data stream in a distributed environment for a continuous attribute. Based on the PiD algorithm and the map/reduce framework, we will demonstrate an approach which combines histograms that is associative and thus can leverage the combiner function refinement of the map/reduce framework. This approach is approximate, and experiments indicate the error is always less than 1%. We tested this for different distributions and can demonstrate through this approach, that performance does not suffer from distributing computations. We will also demonstrate that this approach does not need any parameters and can be applied without any user interaction requirements. A use case is data import where all data is passed (once) anyway and histograms can be generated automatically. The number of bins can be a preset to a constant.
by Konstantin Shvachko and Plamen Jeliazkov
HDFS is based on the decoupled namespace from data architecture. Its namespace operations are performed on a designated server NameNode and data is subsequently streamed from/to data servers DataNodes. While the data layer of HDFS is highly distributed, the namespace is maintained by a single NameNode, making it a SPOF and a bottleneck for its scalability and availability. HBase is a scalable metadata store, which can be used for storing objects composing the files directly in it, but this would lack the ability to separate namespace operations from data streaming. Giraffa is an experimental file system, which uses HBase to maintain the file system namespace in a distributed way and serves data directly from DataNodes. Giraffa is built from the existing in HDFS and HBase components. Giraffa is intended to maintain very large namespaces. HBase automatically partitions large tables into horizontal slices Regions. The partitioning is dynamic, so that if a region grows too big or becomes too small the table is automatically repartitioned. The partitioning is based on row ordering. In order to optimize the access to the file system objects Giraffa preserves the locality of objects adjacent in the namespace tree. The presentation will explain the Giraffa architecture, the principles behind row key definitions for namespace partitioning, and will address the atomic rename problem.
Apache Hive provides a convenient SQL-based query language to data stored in HDFS. HDFS provides highly scaleable bandwidth to the data, but does not support arbitrary writes. One of Hortonworks` customers needs to store a high volume of customer data (> 1 TB/day) and that data contains a high percentage (15%) of record updates distributed across years. In many high-update use-cases, HBase would suffice, but the current lack of push down filters from Hive into HBase and HBase`s single level keys make it too expensive. Our solution is to use a custom record reader that stores the edit records as separate HDFS files and synthesizes the current set of records dynamically as the table is read. This provides an economical solution to their need that works within the framework provided by Hive. We believe this use case applies to many Hive users and plan to develop and open source a reusable solution.
Zions Banks’ Hadoop based security data warehouse is a massive minable database used to aggregate event data across their entire enterprise; for long term large-scale security, fraud and forensic related analytics. The utility of this system is realized once the data is normalized into a common format and mined by experts with intimate understanding of the data itself. By using the SDW powered by Hadoop the seasoned IT professional can now get creative on new ways to look at their data, there are no more limits to what can be explored in their enterprise, no data source is off limits due to the expense. Historically, doing this type of storage and analysis would be impossible due to cost and resource constraints, those handcuffs are gone now. Tackling big data can be intimidating but the rewards are worth the work. The data deluge is never going to slow down and getting your arms around your enterprise data and how that data can be inspected and analyzed delivers specific answers to many of the common threats we see today. APT’s, Malware, exploited tokens, all leave a trace, now you just need to start asking the questions.
Upgrading from Hadoop 0.20.x to 0.23.x is challenging in several ways. This presentation will include details on how Yahoo! handles these challenges. It will outline the process we use to upgrade existing 0.20.x clusters to 0.23. It will show specific changes we`ve made to startup scripts and monitoring infrastructure. It will show how we manage new configurations for viewfs and hierarchical queues.
This application serves as a tutorial for the use of Big Data in the cloud. We start with the commoncrawl.org crawl of approximately 5 Billion web pages. We use a Map Reduce program to scour the commoncrawl corpus for web pages that contain mentions of a brand or keyword of interest, say, `Citibank`, and additionally, have a `Follow me on twitter` link. We harvest this twitter handle, and store it in HBase. Once we have harvested about 5000 twitter handles, we write and run a program to subscribe to the twitter streaming API for public status updates of these folks. As the twitter status updates pour in, we use a natural language processing library to evaluate the sentiment of these tweets, and store the sentiment score back in HBase. Finally, we use a program written in R, and the rhbase connector to do a real time statistical evaluation of the sentiment expressed by the twitterverse towards this brand or keyword. This presentation includes full details on installing and operating all necessary software in the cloud.
by Biraja Ghoshal and Lee Weinstein
Financial services firms are struggling with managing and analyzing massive volumes of complex data from many sources to comply with regulations, manage risk, identify fraud and improve efficiency. This session describes how Hadoop is being used to build central data integration hub to store and process all the client, account, transaction, asset data and balance related data at the lowest level of granularity to enable data warehousing and advanced analytics platform and build the rules of the road for adoption of Hadoop eco-systems across different business areas such as treasury, risk, securities services and investment banking.
by Josh Wills
Branch-and-bound is a widely used technique for efficiently searching for solutions to combinatorial optimization problems. In this session, we will introduce BranchReduce, an open-source Java library for performing distributed branch-and-bound on a Hadoop cluster under YARN. Applications only need to write code that is specific to their optimization problem (namely the branching rule, the lower bound computation, and the upper bound computation), and BranchReduce handles deploying the application to the cluster, managing the execution, and periodically rebalancing the search space across the machines. We will give an overview of how BranchReduce works and then walk through an example that solves a scheduling problem with a near-linear speedup over a single machine implementation.
by Ken Krugler
Adbeat helps advertisers target publishers/networks and improve ad results by analyzing millions of web pages every day. Adbeat has been able to cut monthly costs by more than 50%, improve response time by 4x, and quickly add new features by switching from a traditional DB-centric approach to one based on Hadoop & Solr. This analysis is handled by a complex Hadoop-based workflow, where the end result is a set of unique, highly optimized Solr indexes. The data processing platform provided by Hadoop also enables scalable machine learning using Mahout. This talk will cover some of the unique challenges in switching the web site from relying on slow, expensive real-time analytics using database queries to fast, affordable batch analytics and search using Hadoop and Solr.
by Richard Cole
Amazon Elastic MapReduce is one of the largest operators of Hadoop in the world, with over one million Hadoop clusters run on the Amazon Web Services infrastructure over the last year. Since its launch over three years ago, the Amazon Elastic MapReduce team has helped users of all sizes manage the wide variety of Hadoop failure conditions, hardware and network issues, and data errors that make operating Hadoop clusters so challenging. These failures and/or performance degradations have kindled the team to develop a number of tools and best practices to help its customers more efficiently operate and troubleshoot their Hadoop clusters. This talk will detail the team’s findings, including a number of Hadoop best practices and general troubleshooting tips.
Passenger dissatisfaction is a growing concern for major airlines. According to an American Customer Satisfaction Index report in 2011, from among 47 industries in a study of customer satisfaction, U.S. airlines fell to last place. In a low-margin sector like airlines, customer satisfaction matters a lot to yield profits. This can have a direct impact on investor confidence and stock prices for these publicly traded airlines. Independent studies usually use surveys for customer satisfaction data. These surveys often are the core piece of sample data used to draw conclusions for that entire customer base. Surveys are predefined sets of questions, often eliciting deterministic responses. By using real data from passenger feedback sites, we can sample a larger group of passengers, versus a randomly selected sample or in some cases, a biased survey. The data we collect is then correlated with real airline statistics to draw conclusions about what leaves passengers dissatisfied with an airline. This large data set is analyzed using Hadoop. These findings provide valuable data, which can be used to improve the overall customer experience, resulting in better profit margins and investor confidence for the airline industry.
by Nimish Desai
Hadoop is a popular framework for web 2.0 and enterprise businesses that are challenged to store, process and analyze large amounts of data as part of their business requirements. Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. The goal of the session is to provide the reference network architecture for deploying Hadoop application. This session will cover the Hadoop application behavior in the context of building a network infrastructure and its design consideration. The session also discusses network architecture tradeoffs, and the advantages of close integration between compute and networking. The session goes into detail network device architecture regarding Nexus family series (7000, 5500/FEX, and 3000) specifically 1G vs 10G servers connectivity trade-off, dual-attached server NIC impact on failure of network components , various topologies, buffer usage and network bandwidth characterization(burst and QoS) and consideration of network design for multi-user/multi-workload cluster.
by Yuvaraj "Yuva" Athur Raghuvir
As enterprises race to embrace Hadoop in its next-generation data foundation layer, IT organizations are reexamining their current service-level commitments and looking for ways to achieve similar levels of service with Hadoop-based landscapes. As enterprises across industries move forward into the Hadoop led scale-first system landscapes, SAP is also examining some fundamental must-haves that can drive the adoption of Hadoop in the enterprise. This presentation will discuss the dominant drivers of big data that need to be addressed in a comprehensive way. Furthermore, this session will illustrate how SAP is bringing the value of Hadoop and big data management through a broad portfolio from SAP /Sybase that encompasses SAP HANA, Sybase IQ, SAP BusinessObjects EIM, SAP BusinessObjects BI and other relevant products. It is important to recognize that there is more work to be done to accelerate the adoption of Hadoop and big data technologies into the traditional enterprises, and this presentation concludes with some directions that are worth pondering further.
13th–14th June 2012