Data Storage Tips for Optimal Spark Performance

A session at Spark Summit 2015

Tuesday 16th June, 2015

5:00pm to 5:30pm

Spark can analyze data stored on files in many different formats: plain text, JSON, XML, Parquet, and more. But just because you can get a Spark job to run on a given data input format doesn't mean you'll get the same performance with all of them. Actually, the performance difference can be quite substantial. This talk will cover some common data input formats and nuances about working with that format. The goal for the talk is to help Spark programmers make more conscientious and smart decisions about how to store their data. Here is an example of the topics that will be covered in the talk:

  • Issues you'll encounter when processing excessively large XML input files
  • Why choose parquet files for Spark SQL?
  • How coalescing many small files may give you better performance

About the speaker

This person is speaking at this event.
Vida Ha

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 5:00pm5:30pm PST

Date Tue 16th June 2015

Short URL


Official event site


View the schedule


See something wrong?

Report an issue with this session