by Solon Barocas, Betsy Masiello and Jane Yakowitz
Analytics can push the frontier of knowledge well beyond the useful facts that already reside in big data, revealing latent correlations that empower organizations to make statistically motivated guesses—inferences—about the character, attributes, and future actions of their stakeholders and the groups to which they belong.
This is cause for both celebration and caution. Analytic insights can add to the stock of scientific and social scientific knowledge, significantly improve decision-making in both the public and private sector, and greatly enhance individual self-knowledge and understanding. They can even lead to entirely new classes of goods and services, providing value to institutions and individuals alike. But they also invite new applications of data that involve serious hazards.
This panel considers these hazards, asking how analytics implicate:
The panel will also debate the appropriate response to these issues, reviewing the place of norms, policies, legal frameworks, regulation, and technology.
by Ken Farmer
While most of the focus within data science is on the rapid analysis of vast volumes of data, the hardest part of most solutions is the data acquisition, movement, transformation, and loading – the “data logistics”.
Since the early days of data mining and data warehousing (in the late 80s and early 90s) it has been understood that 90% of the effort of these projects will be spent on data acquisition, cleansing, transformation and consolidation. The challenges include:
The data warehousing domain refers to data logistics as “ETL” for Extract, Transform, and Load. Some best practices and methods have developed to address these challenges, but little effort has been put into reusable patterns – more effort has gone into mostly commercial products. But in spite of a lack of formal patterns, a sense of what works and what doesn’t work has emerged – and can be read “between the lines” if someone knows what to look for.
This presentation will describe what the challenges look like when trying to deliver data insights that out of necessity span many sets of data. It will explain these in both business and technical terms. And then will procede to address some of the common solutions – and their strengths and weaknesses.
As CTO of DoubleClick, we scaled to serve 400,000 ads/second. We developed and used many custom data stores long before “nosql” was a buzzword. Over the years, I’ve seen companies I’ve worked with struggle with both scalability and agility. Writing the first lines of MongoDB code in 2007, we drew upon these experiences building large scale, high availability, robust systems. We wanted MongoDB to be a new kind of database that tackled the challenges we were trying to solve at DoubleClick.
This session will focus on internet infrastructure scaling and also cover the history and philosophy of MongoDB.
22nd–23rd September 2011