Data Science with Apache Spark (Advanced)

A session at Spark Summit 2015

Wednesday 17th June, 2015

9:00am to 6:00pm

Data Science applications with Apache Spark combine the scalability of Spark and the distributed machine learning algorithms. This material expands on the “Intro to Apache Spark” workshop. Lessons focus on industry use cases for machine learning at scale, coding examples based on public data sets, and leveraging cloud-based notebooks within a team context. Includes limited free accounts on Databricks Cloud.

Topics covered include:

  • Data transformation techniques based on both Spark SQL and functional programming in Scala and Python
  • Predictive analytics based on MLlib, clustering with KMeans, building classifiers with a variety of algorithms and text analytics – all with emphasis on an iterative cycle of feature engineering, modeling, evaluation
  • Visualization techniques (matplotlib, ggplot2, D3, etc.) to surface insights
  • Understand how the primitives like Matrix Factorization are implemented in a distributed parallel framework from the designers of MLlib
  • Several hands-on exercises using datasets


  • Intro to Apache Spark workshop or equivalent (e.g., Spark Developer Certificate)
  • Experience coding in Scala, Python, SQL
  • Have some familiarity with Data Science topics (e.g., business use cases)

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 9:00am6:00pm PST

Date Wed 17th June 2015

Short URL


Official session page


View the schedule


See something wrong?

Report an issue with this session