Sessions at Strata 2012 about Machine Learning on Tuesday 28th February

Your current filters are…

Clear
  • The Model and the Train Wreck: A Training Data How-to

    by Monica Rogati

    Getting training data for a recommender system is easy: if users clicked it, it’s a positive – if they didn’t, it’s a negative. … Or is it? In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.

    At 11:00am to 11:30am, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

    Coverage slide deck

  • The Importance of Importance: An Introduction to Feature Selection

    by Ben Gimpert

    Twenty-first century big data is being used to train predictive models of emotional sentiment, customer churn, patient health, and other behavioral complexities. Variable importance and feature selection reduces the dimensionality of our models, so an unfeasible and complex problem may become somewhat more predictable.

    At 12:00pm to 12:30pm, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

  • The Two Most Important Algorithms in Predictive Modeling Today

    by Mike Bowles and Jeremy Howard

    When doing predictive modelling, there are two situations in which you might find yourself:

    You need to fit a well-defined parameterised model to your data, so you require a learning algorithm which can find those parameters on a large data set without over-fitting
    You just need a “black box” which can predict your dependent variable as accurately as possible, so you need a learning algorithm which can automatically identify the structure, interactions, and relationships in the data
    For case (1), lasso and elastic-net regularized generalized linear models are a set of modern algorithms which meet all these needs. They are fast, work on huge data sets, and avoid over-fitting automatically. They are available in the “glmnet” package in R.

    For case (2), ensembles of decision trees (often known as “Random Forests”) have been the most successful general-purpose algorithm in modern times. For instance, most Kaggle competitions have at least one top entry that heavily uses this approach. This algorithm is very simple to understand, and is fast and easy to apply. It is available in the “randomForest” package in R.

    Mike and Jeremy will explain in simple terms, using no complex math, how these algorithms work, and will also explain using numerous examples how to apply them using R. They will also provide advice on how to select from these algorithms, and will show how to prepare the data, and how to use the trained models in practice.

    At 1:30pm to 5:00pm, Tuesday 28th February

    In Ballroom CD, Santa Clara Convention Center