Building Large Scale Machine Learning Applications with Pipelines

A session at Spark Summit 2015

  • Shivaram Venkataraman

Tuesday 16th June, 2015

3:00pm to 3:30pm

Real world machine learning applications typically consist of many components in a data processing pipeline. For example, in text classification, preprocessing steps like n-gram extraction, and TF-IDF feature weighting are often necessary before training of a classification model like an SVM. We describe a framework for constructing these ML Pipelines and show how it can help us construct end-to-end workflows with a toolbox of off-the-shelf components which we have developed for text, image classification and a high-performance linear algebra library that we use for training models. We show that with this framework we can get state-of-the-art results in many machine learning tasks. Our scalable implementation on Spark outperforms supercomputing installations and can match deep learning error rates on speech recognition in less than 1 hour on EC2 for $20. Finally, we discuss research in the AMPLab to support common iterative machine learning workflows by careful resource estimation and checkpoint planning.

About the speakers

This person is speaking at this event.
Evan Sparks

Graduate Student Researcher at UC Berkeley bio from LinkedIn

This person is speaking at this event.
Shivaram Venkataraman

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 3:00pm3:30pm PST

Date Tue 16th June 2015

Short URL


Official event site


View the schedule


See something wrong?

Report an issue with this session