Currently users of the Apache Falcon system are forced to define their application as Apache Oozie workflow. While users are hidden from the scheduler and its working via Falcon, they still end up learning about Oozie as the application are to be defined as Oozie workflows. Objective of this is to provide a pipeline designer user interface through which users can author their processes and provision them on Falcon. This should make building applications on Falcon over Hadoop fairly trivial. Falcon has the ability to operate with HCatalog tables natively. This means that there is a one to one correspondence between a Falcon feed and an HCatalog table. Between the feed definition in Falcon and the underlying table definition in HCatalog, there is adequate metadata about the data stored underneath. This data (sets of them) can then be operated over by a collection of transformations to extract more refined dataset/feed. This logic (currently represented via Oozie workflow / pig scripts / map-reduce jobs) is typically represented through the Falcon process. In this talk we walk through the details of the pipeline designer and the current state of this feature.
Apache Falcon is a platform for simplifying managing data jobs for Hadoop. We delve into the motivation behind Falcon, use cases, how it aims to simplify standard functions such as data motion (import, export), lifecycle (replication, eviction, DR/BCP) and process orchestration (data pipelines, late data handling, etc.). The presentation covers detailed design and architecture along with case studies on the usage of Falcon in production. We also look at how this compares against solutions if we took a silo-ed approach. User intent is systemically collected and used for seamless management alleviating much of the pains of folks operating or developing data processing application on hadoop.
Apache Falcon Birds of Feather Session
3rd–5th June 2014