Your current filters are…
by Erik Meijer
For the past decade, I have been on a quest to democratize developing data-intensive distributed applications. My secret weapon to slay the complexity dragon has been category theory and monads, but in particular the concept of duality. As it turns out, the data domain is an extremely rich source of all kinds of interesting dualities. These dualities are not just theoretical curiosities, but actually solve many practical problems and help to uncover deep similarities between concepts that at first look totally unrelated.In this talk I will illustrate several of the dualities I have encountered during my journey, and show how this resulted in a novel “A co-Relational Model of Data for Large Shared Data Banks”.
by Nathan Marz
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it’s fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.
Storm has a wide range of use cases. The basic use case is “stream processing”: processing a stream of new data and updating databases in realtime. Unlike the standard approach of doing stream processing with queues and workers, Storm is fault-tolerant and scalable.
Another use case is “continuous computation”: streaming the results of a query to clients to visualize in realtime. An example is streaming trending topics on Twitter into browsers.
A third use case is “distributed RPC”: computing an intense query on the fly in parallel. With distributed RPC, a Storm topology is a distributed function that you can invoke like a normal function.
In this talk, I’ll release Storm as open-source. I’ll show how Storm’s simple programming model makes realtime computation easy, robust, and even fun.
The R programming language has become a standard environment for statistical computing, but out of the box R is restricted to analysis on data sets that fit in memory. Hadoop has become a popular platform for storing and analyzing data sets that are too large to fit on a single machine. Not surprisingly, there’s significant interest in bringing these two platforms together to perform sophisticated analysis on data that’s too large to fit in memory on a single machine. Although there are several systems being developed by the R community to support this such as Ricardo and RHIPE, as well as newer interfaces such as Segue and Hadoop InteractiVE, there’s still considerable confusion as to how to effectively use these two systems together. This talk will provide a survey of available R/Hadoop interfaces and use an example use case to provide a comparison between systems. We’ll also discuss problems that are a good fit for distributed analysis with R, and those that aren’t.
18th–20th September 2011