Sessions at Strata 2012 about R

Your current filters are…

  • Topic: R

Tuesday 28th February 2012

  • Introduction to R for Data Mining

    by Joseph B Rickert

    This tutorial will enable anyone with some programming experience to begin analyzing data with the R programming language

    Syllabus

    • Where did R come from?
    • What makes R different from other statistical software?
    • Data structures in R
    • Reading and writing data sets
    • Manipulating Data
    • Basic statistics in R
    • Exploratory Data Analysis
    • Multiple Regression
    • Logistic Regression
    • Data mining in R
    • Cluster analysis
    • Classification algorithms
    • Working with Big Data
    • Challenges
    • Extensions to R for big data
    • Where to go from here?
    • The R community
    • Resources for learning R
    • Getting help

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom G, Santa Clara Convention Center

  • The Two Most Important Algorithms in Predictive Modeling Today

    by Mike Bowles and Jeremy Howard

    When doing predictive modelling, there are two situations in which you might find yourself:

    You need to fit a well-defined parameterised model to your data, so you require a learning algorithm which can find those parameters on a large data set without over-fitting
    You just need a “black box” which can predict your dependent variable as accurately as possible, so you need a learning algorithm which can automatically identify the structure, interactions, and relationships in the data
    For case (1), lasso and elastic-net regularized generalized linear models are a set of modern algorithms which meet all these needs. They are fast, work on huge data sets, and avoid over-fitting automatically. They are available in the “glmnet” package in R.

    For case (2), ensembles of decision trees (often known as “Random Forests”) have been the most successful general-purpose algorithm in modern times. For instance, most Kaggle competitions have at least one top entry that heavily uses this approach. This algorithm is very simple to understand, and is fast and easy to apply. It is available in the “randomForest” package in R.

    Mike and Jeremy will explain in simple terms, using no complex math, how these algorithms work, and will also explain using numerous examples how to apply them using R. They will also provide advice on how to select from these algorithms, and will show how to prepare the data, and how to use the trained models in practice.

    At 1:30pm to 5:00pm, Tuesday 28th February

    In Ballroom CD, Santa Clara Convention Center

Wednesday 29th February 2012

  • RHadoop, R meets Hadoop

    by Antonio Piccolboni

    Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.

    • rhdfs provides file level manipulation for HDFS, the Hadoop file system
    • rhbase provides access to HBASE, the hadoop database
    • rmr allows to write mapreduce programs in R

    rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.

    At 10:40am to 11:20am, Wednesday 29th February

    In GA J, Santa Clara Convention Center

  • Building a Data Narrative: Discovering Haight Street

    by Jesper Andersen

    Data isn’t just for supporting decisions and creating actionable interfaces. Data can create nuance, giving new understandings that lead to further questioning—rather than just actionable decisions. In particular, curiosity, and creative thinking can be driven by combining different data sets and techniques to develop a narrative around a set of data sets that tells the story of a place—the emotions, history, and change embedded in the experience of the place.

    In this session, we’ll see how far we can go in exploring one street in San Francisco, Haight Street, and see how well we can understand it’s geography, ebbs and flows, and behavior by combining as many data sources as possible. We’ll integrate basic public data from the city, street and mapping data from Open Street Maps, real estate and rental listings data, data from social services like Foursquare, Yelp and Instagram, and analyze photographs of streets from mapping services to create a holistic view of one street and see what we can understand from this. We’ll show how you can summarize this data numerically, textually, and visually, using a number of simple techniques.

    We’ll cover how traditional data analysis tools like R and NumPy can be combined with tools more often associated with robotics like OpenCV (computer-vision) to create a more complete data set. We’ll also cover how traditional data visualization techniques can be combined with mapping and augmented reality to present a more complete picture of any place, including Haight Street.

    At 1:30pm to 2:10pm, Wednesday 29th February

    In Ballroom AB, Santa Clara Convention Center

    Coverage slide deck