Your current filters are…
by Q Ethan McCallum
Welcome to data science’s dirty little secret: data is messy. and it’s your problem.
It’s bad enough that data comes from myriad sources and in a dizzying variety of formats. Malformed files, missing values, inconsistent and arcane formats, and a host of other issues all conspire to keep you away from your intended purpose: getting meaningful insight out of your data. Before you can touch any algorithms, before you feed any regressions, you’re going to have to roll up your sleeves and whip that data into shape.
Q Ethan McCallum, technology consultant and author of Parallel R (O’Reilly), will explore common pitfalls of this data munging and share solutions from his personal playbook. Most of all, he’ll show you how to do this quickly and effectively, so you can get back to the real work of analyzing your data.
Where are all the coffee shops in my neighborhood?
Seemingly easy questions can become complex when you consider ambiguity. This one sounds simple until you consider that folks may define “coffee shop” differently and the boundaries of your “neighborhood” differently. One person’s Central Austin, may be someone else’s South Dallas.
How about instead of working too hard to define the parameters in an attempt to completely remove the ambiguity, we instead look at what people do, interact with and talk about. We can watch what people do and decide from there what a coffee shop is and where the boundaries of your neighborhood are. It might not be the “truth”, but it can be darn close.
When we learn to embrace ambiguity, not only can we still find the answers to our questions, but we can also find answers to questions we hadn’t even thought to ask.
In a research environment, under the current operating system, most data and figures collected or generated during your work is lost, intentionally tossed aside or classified as “junk”, or at worst trapped in silos or locked behind embargo periods. This stifles and limits scientific research at its core, making it much more difficult to validate experiments, reproduce experiments or even stumble upon new breakthroughs that may be buried in your null results.
Changing this reality not only takes the right tools and technology to store, sift and publish data, but also a shift in the way we think of and value data as a scientific contribution in the research process. In the digital age, we’re not bound by the physical limitations of analog medium such as the traditional scientific journal or research paper, nor should our data be locked into understandings based off that medium.
This session will look at the socio-cultural context of data science in the research environment, specifically at the importance of publishing negative results through tools like FigShare – an open data project that fosters data publication, not only for supplementary information tied to publication, but all of the back end information needed to reproduce and validate the work, as well as the negative results. We’ll hear about the broader cultural shift needed in how we incentivise better practices in the lab and how companies like Digital Science are working to use technology to push those levers to address the social issue. The session will also include a look at the real-world implications in clinical research and medicine from Ben Goldacre, an epidemiologist who has been looking at not only the ethical consequences but issues in efficacy and validation.
28th February to 1st March 2012