Get Lanyrd on your mobile (iPhone, Android and more) - check it out here

Strata New York 2011 schedule

Friday 23rd September 2011

  • Hazarding a Guess: ethical, legal, and policy issues in analytics and big data applications

    by Solon Barocas, Betsy Masiello and Jane Yakowitz

    Analytics can push the frontier of knowledge well beyond the useful facts that already reside in big data, revealing latent correlations that empower organizations to make statistically motivated guesses—inferences—about the character, attributes, and future actions of their stakeholders and the groups to which they belong.

    This is cause for both celebration and caution. Analytic insights can add to the stock of scientific and social scientific knowledge, significantly improve decision-making in both the public and private sector, and greatly enhance individual self-knowledge and understanding. They can even lead to entirely new classes of goods and services, providing value to institutions and individuals alike. But they also invite new applications of data that involve serious hazards.

    This panel considers these hazards, asking how analytics implicate:

    • Privacy — What are the privacy concerns involved in the kinds of inferences and applications that analytics enable? Are these concerns sufficiently well understood and accounted for?
    • Autonomy — What are the ethical stakes of applications that draw on analytic findings to selectively (and perhaps inadvertently) influence or limit individuals’ choices or decision-making?
    • Fairness — If organizations rely on certain discoveries to set criteria for unequal treatment or access, do analytics implicate questions of fairness and due process? More specifically, what if organizations draw on analytics to individualize risks or engage in adverse selection or cream skimming?
    • Fragmentation — Do attempts to personalize and customize goods and services (including media content) to individuals on the basis of inferred preferences shield individuals from certain views and issues and thus undermine social belonging and the functioning of the public sphere?

    The panel will also debate the appropriate response to these issues, reviewing the place of norms, policies, legal frameworks, regulation, and technology.

    At 5:00pm to 5:40pm, Friday 23rd September

    In Murray Hill Suite A, Hilton New York

  • Taming Data Logistics - the Hardest Part of Data Science

    by Ken Farmer

    While most of the focus within data science is on the rapid analysis of vast volumes of data, the hardest part of most solutions is the data acquisition, movement, transformation, and loading – the “data logistics”.

    Since the early days of data mining and data warehousing (in the late 80s and early 90s) it has been understood that 90% of the effort of these projects will be spent on data acquisition, cleansing, transformation and consolidation. The challenges include:

    • undocumented source systems
    • source systems that change business rules without notice
    • source systems that cannot handle frequent extracts of data without encountering concurrency problems
    • source system constraints on languages, network connections, and products
    • the management of thousands of daily processes
    • the management of data logistics code that manages dozens of feeds
    • the rapid loading of data into the consolidated server – without impacting concurrency or creating temporary data inconsistencies

    The data warehousing domain refers to data logistics as “ETL” for Extract, Transform, and Load. Some best practices and methods have developed to address these challenges, but little effort has been put into reusable patterns – more effort has gone into mostly commercial products. But in spite of a lack of formal patterns, a sense of what works and what doesn’t work has emerged – and can be read “between the lines” if someone knows what to look for.

    This presentation will describe what the challenges look like when trying to deliver data insights that out of necessity span many sets of data. It will explain these in both business and technical terms. And then will procede to address some of the common solutions – and their strengths and weaknesses.

    At 5:00pm to 5:40pm, Friday 23rd September

    In Sutton South, Hilton New York

  • Why MongoDB Was Created: What I Wish I Knew at DoubleClick

    by Dwight Merriman

    As CTO of DoubleClick, we scaled to serve 400,000 ads/second. We developed and used many custom data stores long before “nosql” was a buzzword. Over the years, I’ve seen companies I’ve worked with struggle with both scalability and agility. Writing the first lines of MongoDB code in 2007, we drew upon these experiences building large scale, high availability, robust systems. We wanted MongoDB to be a new kind of database that tackled the challenges we were trying to solve at DoubleClick.

    This session will focus on internet infrastructure scaling and also cover the history and philosophy of MongoDB.

    At 5:00pm to 5:40pm, Friday 23rd September

    In Sutton North, Hilton New York