A Tutorial Introduction to Best Practices for Building and Deploying Predictive Models over Big Data

A session at Strata New York 2012

  • Collin Bennett

Tuesday 23rd October, 2012

1:30pm to 5:00pm (EST)

In this tutorial, we show how open source tools can be used for the entire life cycle of a predictive model built over big data. Specifically, for anyone who has built a model, we show how to: 1) perform an exploratory data analysis (EDA) of data managed by Hadoop using R and other open source tools; 2) leverage the EDA to build analytic and statistical models over data managed by Hadoop; 3) deploy these models into operational systems; and 4) measure the performance of the models and continuously improve them.

We cover the following topics:

  • Three simple techniques for exploratory data analysis (EDA) over Hadoop
  • Four ways to interoperate Hadoop and R, including RHIPE and R+Hadoop
  • Building analytic models over Hadoop using R and other open source tools
  • Why you should use multiple models (segmented models and ensembles of models) when building models over Hadoop
  • Languages for describing predictive models, including the Predictive Model Markup Language
  • Model producers and model consumers (scoring engines)
  • Integrating scoring engines into operational systems
  • Evaluating the effectiveness of a model
  • The continuous improvement of a model

About the speakers

This person is speaking at this event.
Collin Bennett

Principal, Open Data Group

This person is speaking at this event.
Robert L. Grossman

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 1:30pm5:00pm EST

Date Tue 23rd October 2012

Short URL


View the schedule


See something wrong?

Report an issue with this session