Sessions at Strange Loop 2011 about Big Data

Your current filters are…

Monday 19th September 2011

  • Category Theory, Monads, and Duality in (Big) Data

    by Erik Meijer

    For the past decade, I have been on a quest to democratize developing data-intensive distributed applications. My secret weapon to slay the complexity dragon has been category theory and monads, but in particular the concept of duality. As it turns out, the data domain is an extremely rich source of all kinds of interesting dualities. These dualities are not just theoretical curiosities, but actually solve many practical problems and help to uncover deep similarities between concepts that at first look totally unrelated.In this talk I will illustrate several of the dualities I have encountered during my journey, and show how this resulted in a novel “A co-Relational Model of Data for Large Shared Data Banks”.

    At 8:30am to 9:20am, Monday 19th September

    In Grand Ballroom, Hilton St. Louis at the Ballpark

    Coverage write-up

  • Storm: Twitter's scalable realtime computation system

    by Nathan Marz

    Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it’s fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.

    Storm has a wide range of use cases. The basic use case is “stream processing”: processing a stream of new data and updating databases in realtime. Unlike the standard approach of doing stream processing with queues and workers, Storm is fault-tolerant and scalable.

    Another use case is “continuous computation”: streaming the results of a query to clients to visualize in realtime. An example is streaming trending topics on Twitter into browsers.

    A third use case is “distributed RPC”: computing an intense query on the fly in parallel. With distributed RPC, a Storm topology is a distributed function that you can invoke like a normal function.

    In this talk, I’ll release Storm as open-source. I’ll show how Storm’s simple programming model makes realtime computation easy, robust, and even fun.

    At 10:30am to 11:20am, Monday 19th September

    In Grand Ballroom, Hilton St. Louis at the Ballpark

Tuesday 20th September 2011

  • Distributed Data Analysis with Hadoop and R

    by Jonathan Seidman and Ramesh Venkataramaiah

    The R programming language has become a standard environment for statistical computing, but out of the box R is restricted to analysis on data sets that fit in memory. Hadoop has become a popular platform for storing and analyzing data sets that are too large to fit on a single machine. Not surprisingly, there’s significant interest in bringing these two platforms together to perform sophisticated analysis on data that’s too large to fit in memory on a single machine. Although there are several systems being developed by the R community to support this such as Ricardo and RHIPE, as well as newer interfaces such as Segue and Hadoop InteractiVE, there’s still considerable confusion as to how to effectively use these two systems together. This talk will provide a survey of available R/Hadoop interfaces and use an example use case to provide a comparison between systems. We’ll also discuss problems that are a good fit for distributed analysis with R, and those that aren’t.

    At 9:30am to 10:20am, Tuesday 20th September

    In Lindbergh, Hilton St. Louis at the Ballpark

    Coverage slide deck