Twitter's new scalable, fault-tolerant, and simple(ish) stream programming system... with Python!
Storm is a high-volume, continuous, reliable stream processing system developed at BackType and recently open-sourced by Twitter. Though most of the system (and it's documentation) is written in Java-based languages, it is possible to use in a Python environment with Python-based analysis code. At DotCloud (our application-platform-as-a-service) we're doing just that, and we'll be showing how you can too.
We collect a lot of data: we have tens of thousands of customers, many of whom have dozens of services running on our platform, each of which in turn produces dozens of metrics every second. All in all, we're dealing with millions of datapoints per minute. Storm will be the third iteration of our metrics system, an attempt at standardizing a number of previously-distinct pieces of our infrastructural software, to enable automated, real-time reactions to changes in the platform's state.
We'll start by touching on what problems Storm is (and isn't) trying to solve and why it's model is so powerful, informed by our previous attempts to solve the stream processing problem. We'll then move on to a deep dive into how to get Storm up and running with the most Python and least Java-enduced-pain possible and finish up with tips to solve some of the challenges we've encountered while adopting Storm into our Python-based development process.
What is stream processing?
High volume, Continuous, Reliable data analysis
How do people solve this today?
Storm's overall model
Why is this solution better?
The Hard Part, Made Simple:
Build a topology
Code your processors
The Simple Part, Made Hard (made simple):
Clojure (less ugh)
7th–15th March 2012