by Q Ethan McCallum
Welcome to data science’s dirty little secret: data is messy. and it’s your problem.
It’s bad enough that data comes from myriad sources and in a dizzying variety of formats. Malformed files, missing values, inconsistent and arcane formats, and a host of other issues all conspire to keep you away from your intended purpose: getting meaningful insight out of your data. Before you can touch any algorithms, before you feed any regressions, you’re going to have to roll up your sleeves and whip that data into shape.
Q Ethan McCallum, technology consultant and author of Parallel R (O’Reilly), will explore common pitfalls of this data munging and share solutions from his personal playbook. Most of all, he’ll show you how to do this quickly and effectively, so you can get back to the real work of analyzing your data.
Where are all the coffee shops in my neighborhood?
Seemingly easy questions can become complex when you consider ambiguity. This one sounds simple until you consider that folks may define “coffee shop” differently and the boundaries of your “neighborhood” differently. One person’s Central Austin, may be someone else’s South Dallas.
How about instead of working too hard to define the parameters in an attempt to completely remove the ambiguity, we instead look at what people do, interact with and talk about. We can watch what people do and decide from there what a coffee shop is and where the boundaries of your neighborhood are. It might not be the “truth”, but it can be darn close.
When we learn to embrace ambiguity, not only can we still find the answers to our questions, but we can also find answers to questions we hadn’t even thought to ask.
28th February to 1st March 2012