by Simone Leo
Hadoop is the leading open source implementation of MapReduce,
Google's large scale distributed computing paradigm. Hadoop's native
API is in Java, and its built-in options for Python programming --
Streaming and Jython -- have several drawbacks: the former allows to
access only a small subset of Hadoop's features, while the latter
carries with it all of the limitations of Jython with respect to
CPython.
Pydoop (http://pydoop.sourceforge.net) is an API for Hadoop that makes
most of its features available to Python programmers while allowing
CPython development. Its core consists of Boost.Python wrappers for
Hadoop's C/C++ interface.
The talk consists of a MapReduce/Hadoop tutorial and a presentation of
the Pydoop API, with the main goal of bridging the gap between the
Hadoop and Python communities. A basic knowledge of distributed
programming is helpful but not strictly required.