Any data scientist who works with real data will tell you that the hardest part of any data science task is the data preparation. Everything from cleaning dirty data to understanding where your data is missing and how your data is shaped, the care and feeding of your data is a prime task for the working data scientist.
I will describe my experiences in the field and present an open source utility written with Apache Spark to automate some of the necessary but insufficient things that I do every time I'm presented new data. In particular, we'll talk about discovering missing values, values with skewed distributions and discovering likely errors within your data.
Principal Architect, HortonWorks
I am a committer and PMC member on the Apache Metron project in the engineering team at Hortonworks. In the past, I've worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist in the Oil & Gas industry.
I specialize in writing software and solving problems where there are either scalability concerns due to large amounts of traffic or large amounts of data. I have a particular passion for data science problems or any thing mathematical.
Sign in to add slides, notes or videos to this session