Building a Large-Scale Data Collection System Using Flume NG

A session at Strata New York 2012

  • Will McQueen
  • Prasad Mujumdar

Tuesday 23rd October, 2012

1:30pm to 5:00pm (EST)

Hadoop HDFS is typically adopted in situations where traditional storage and database systems are either reaching their limits or have already surpassed them. This usually implies that there are one or more large streams of events that need to be collected, such as log data streams. Flume NG was designed from the ground-up to tackle this problem in a straightforward, scalable, reliable way, and empirical results support the success of its approach.

At a high level, Flume NG has a simple well-designed architecture consisting of a set of agents with each agent running any number of sources, channels (event buffers), and sinks. Flume agents can easily be chained across the network to provide a configurable pipeline through which discrete events flow reliably from source (i.e., an application server) to destination (i.e., a Hadoop HDFS cluster). Flume can be configured to support arbitrary data flows, including fan-in (data aggregation) and fan-out (data replication) designs. Such designs are primarily an artifact of the generality of the agent-based architecture.

In this tutorial, a group of people closely involved with Flume walk participants through setting up a typical data collection infrastructure using Flume. We first describe the basic architecture of Flume including its design, the transactional semantics it supports for reliability, and the sources, channels, and sinks included with the Flume core. We then move on to a brief description of common data flow architectures, and choose a typical data collection scenario for which we use Flume to do the heavy lifting. Next we come to the main body of this tutorial session, which is a walkthrough of installing, configuring, and tuning a scalable, reliable, and fault-tolerant Flume-based data collection system for storing events into a Hadoop system in real time.

Throughout this presentation we also cover: (1) how to configure Flume to store data on a secure HDFS cluster, (2) configuration options used to trade off between performance and fault tolerance, (3) Avro support, (4) Flume extension points, plugins, and hooks, (5) Flume compatibility with various versions of Hadoop, (6) performance benchmarks, and (7) general best practices for using Flume NG effectively.

About the speakers

This person is speaking at this event.
Will McQueen

Software Engineer, Cloudera Inc.

This person is speaking at this event.
Hari Shreedharan
This person is speaking at this event.
Mike Percy
This person is speaking at this event.
Arvind Prabhakar
This person is speaking at this event.
Prasad Mujumdar

Software Engineer, Cloudera Inc.

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 1:30pm5:00pm EST

Date Tue 23rd October 2012

Short URL


View the schedule


See something wrong?

Report an issue with this session