Sessions at Strata New York 2011 on Friday 23rd September

Your current filters are…

  • Doing Good With Data: Data Without Borders

    by Jake Porway and Drew Conway

    Data scientists and technology companies are rapidly recognizing the immense power of data for drawing insights about their impact and operations, yet NGOs and non-profits are increasingly being left behind with mounting data and few resources to make use of it. Data Without Borders seeks to bridge this data divide by matching underserved NGOs with pro bono data scientists so that they can collect, manage, and analyze their data together in the service of humanity, creating a more open environment for socially conscious data and bringing greater change to the world.

    At 8:50am to 9:05am, Friday 23rd September

    In Sutton Parlors, New York Hilton

    Coverage video

  • First, firster, firstest

    by Mark Madsen

    History seems irrelevant in the software world, particular when dealing with lots of information. It isn’t. Information explosions are not new. They’ve happened repeatedly throughout human history. A little looking will turn up prior incarnations of information management patterns and concepts that can be repurposed using today’s technologies.

    The first person to conceive of something is usually not the first. They’re the first to re-conceive at a point where the current technology caught up to someone else’s idea. We’re at a point today where many old ideas are being reinvented. Come hear why looking to the past beyond your core field of interest is worthwhile.

    At 9:05am to 9:20am, Friday 23rd September

    In Sutton Parlors, New York Hilton

    Coverage video

  • Announcing the Winner of the First Heritage Health Progress Prize

    by Richard Merkin

    Dr. Richard Merkin, President and CEO of Heritage Provider Network, is pleased to announce the winner of the first $3 million dollar Heritage Health Progress Prize. Responding to our country’s $2 trillion dollar health care crises, Dr. Merkin created, developed and sponsored the $3 million dollar Heritage Health Prize for predictive modeling to save more than $30 billion in avoidable hospitalizations. It is the largest predictive modeling prize in the world, larger than the Nobel Prize for Medicine and the Gates Prize for Health. Dr. Merkin is genuinely excited to bring new minds to the healthcare table with the prize and believes data miners hold great potential for not only bringing a winning algorithm, but also to grab the attention of data miners globally and raise awareness about competitive innovation, changing the world through healthcare delivery. Dr. Merkin will present the top two teams with $50,000 in the first progress prize, split as $30,000 and $20,000.

    At 9:20am to 9:25am, Friday 23rd September

    In Sutton Parlors, New York Hilton

    Coverage video

  • Health Empowerment through Self-Tracking

    by Anne Wright

    The BodyTrack project has interviewed a number of people who have improved their health by discovering certain foods or environmental exposures to avoid, or learning other types of behavioral changes. Many describe greatly improved quality of life, overcoming in some cases chronic problems in areas such as sleep, pain, gastrointestinal function, and energy levels. In some cases, a doctor or specialist’s diagnosis led to treatment which mitigated symptoms (e.g. asthma or migraine headache), but where discovery of triggers required self-tracking and self-experimentation.

    Importantly, the act of starting to search for one’s sensitivities or triggers appears to be empowering: people who embarked on this path changed their relationship to their health situation even before making the discoveries that helped lead to symptom improvement.

    The BodyTrack Project is building tools, both technological and cultural, to empower more people to embrace an “investigator” role in their own lives. The core of the BodyTrack system is an open source web service which allows users to aggregate, visualize, and analyze data from a myriad of sources—physiological metrics from wearable sensors, image and self-observation capture from smart phones, local environmental measures such as bedroom light and noise levels and in-house air quality monitoring, and regional environmental measures such as pollen/mold counts and air particulates. We believe empowering a broader set of people with these tools will help individuals and medical practitioners alike to better address health conditions with complex environmental or behavioral components.

    At 9:25am to 9:40am, Friday 23rd September

    In Sutton Parlors, New York Hilton

    Coverage video

  • Big Data, Big Opportunity

    by Ken Bado

    Big Data is more than just volume and velocity. MarkLogic CEO Ken Bado will address why complexity is the key gotcha for organizations trying to outflank their competition by managing Big Data in real time. Learn how winners today are using MarkLogic to manage the complexity of their unstructured information to drive revenue and results.

    At 9:40am to 9:45am, Friday 23rd September

    In Sutton Parlors, New York Hilton

    Coverage video

  • Short URLs, Big Data: Learning About the World in Realtime

    by Hilary Mason

    The flow of data across the social web tells us what people, around the world, are paying attention to at any given moment. Understanding this flow is both a mathematical and a human problem, as we develop and adapt techniques to find stories in the data.

    Come hear about the expected and the surprises in the bitly data, as well as generalized techniques that apply to any ‘realtime’ data system.

    At 9:45am to 10:00am, Friday 23rd September

    In Sutton Parlors, New York Hilton

  • Calling for a New Paradigm: Machines Plus Humans

    by Arnab Gupta

    In 1964, The Twilight Zone aired an episode titled “The Brain Center at Whipple’s,” in which factory owner Wallace Whipple completely eliminates his human workforce in favor of automated machinery. Mr. Whipple’s employees, clearly far ahead of their time, argue to him that human insights far outweigh the advantages provided by mechanical labor. Ironically, at the end of the episode, Mr. Whipple, too, is replaced by a machine.

    It’s a well-known dichotomy: man versus machine—and, depending on who’s doing the talking, good (human) versus evil (machine). Today, as technology continues to evolve and machines are capable of ever more advanced processes and functions, the dichotomy is becoming even more pronounced. Look no further than IBM’s Watson, an advanced artificial intelligence machine that squared off against Jeopardy’s best human contestants in 2011—and won.

    But, as Opera Solutions’ CEO Arnab Gupta proposes to explore in remarks at Strata, the man-vs.- machine dichotomy is a false one. A far better contest would have been a three-way one, pitting man versus machine versus man-plus-machine. It is almost a certainty that the latter combination would have won.

    Consider: nowhere has the machine-vs.-human conflict been played out more fully than in the realm of chess, starting in 1997 with IBM’s Deep Blue vs. Garry Kasparov. Today, chess-playing computers routinely beat the strongest human players. One might conclude that the machines have won. But there’s a twist: as Kasparov has recently stated, a machine plus just an average player can beat all comers, humans or computers. Humans’ ability to think abstractly and creatively, to bring in new ideas, to apply history, to understand irony, opportunity, possibilities—all this, when paired with machines’ abilities to process huge amounts of data flows and bring to light hidden patters and connections that elude human understanding, make the machine/mind connection unbeatable.

    In short, it is not humans vs. machines, but rather humans plus machines, which must become the new paradigm for scientists, business people, and others—particularly in the Big Data era. Combining human insight with machine intelligence overcomes the weaknesses of each while delivering never-before-seen strengths.

    How can this be accomplished, particularly when machines and people speak different languages and, in truth, “think” differently? How can we create and foster a productive pairing of two very different types of “minds?” Arnab will address the need to create a new language—one mostly visual in nature— to allow humans and machines to work together and realize the full potential of their collaboration. Finding a common language is a pursuit that goes far beyond prosaic “UI” development, and instead forces us to examine how humans can (and might learn to) best understand what machines are saying.

    At 10:00am to 10:15am, Friday 23rd September

    In Sutton Parlors, New York Hilton

    Coverage video

  • Big Data, Emergency Management and Business Continuity

    by Jeannie Stamberger

    Information technology has been meeting disaster head on with new software, crowdsourcing inputs, and mapping tools gaining incredible potential since the Haiti earthquake. How big data really fits into benefiting disaster response both from a humanitarian relief and business continuity side has yet to mature. I will discuss needs (filtering, interfaces, real-time data processing) specifically for the unique sociological and extreme environment constraints in professional disaster response, and untapped potential for business continuity.

    At 10:40am to 11:20am, Friday 23rd September

    In Sutton North, New York Hilton

  • Chart Wars: The Political Power of Data Visualization

    by Alex Lundry

    Political campaigns and causes have added another powerful weapon to their messaging arsenal: graphs, charts, infographics and other forms of data visualization. Over just the last year, Barack Obama urged voters to distribute and share a bar graph of job losses, a line graph of labor costs by a New York Times columnist prompted an official graphical response from the government of Spain, and an organizational chart of a health care reform bill became the subject of a Congressional investigation in the United States. To be sure, a good graph has been used as an advocacy tool for years, but only recently, with the rise of the Internet, blogs, hardware and software advances, and freely available machine readable data, political data visualizations have exploded into political discourse. Conveying objective authority, yet the product of dozens of subjective design decisions, political infographics imply hard truths despite their inherently editorial nature.

    This talk, given by a political data scientist who has built persuasive data visualizations for political organizations, will dissect some of the most extraordinary and powerful examples of political data visualization used over the last election cycle, focusing upon the methods that make them work so well.

    At 10:40am to 11:20am, Friday 23rd September

    In Murray Hill Suite A, New York Hilton

  • LexisNexis: Reinventing New Business with Big Data

    by Ron Avnur and Mark Rodgers

    Ron Avnur, SVP Engineering, MarkLogic, and Mark Rodgers, Sr. Director of Product Engineering, LexisNexis will reveal how LexisNexis is rebuilding its business platform to handle Big Data in real-time. LexisNexis is renowned for the technical solutions it has been building for 40+ years. It is well aware of the challenges of Big Data as it has gathered a huge amount of content. Avnur will explain how Big Data and unstructured information is slowly overtaking organizations. Rodgers will discuss the challenges LexisNexis faced as a global organization that was building new products to remain on the cutting edge of Big Data. Together, Avnur and Rodgers will give a brief overview of the technical implementation that enabled LexisNexis to address those challenges. Finally, Rodgers will detail the business benefits LexisNexis is experiencing as a result of its new Big Data business platform.

    This session is sponsored by MarkLogic

    At 10:40am to 11:20am, Friday 23rd September

    In Murray Hill Suite B, New York Hilton

  • Optimising scarce resources using real-time decision making

    by Alasdair Allan

    In the last few years the ubiquitous availability of high bandwidth networks has changed the way both robotic and non-robotic telescopes operate, with single isolated telescopes being integrated into expanding smart telescope networks that can span continents and respond to transient events in seconds. At the same time the rise of data warehousing has made data mining more practical, and correlations between new and existing data can be drawn in real time. These changes have led to fundamental shifts in the way astronomers pursue their science. Astronomy, once a data-poor science, has become data-rich.

    For many applications it is practical to extend data warehousing to real-time assets such as telescopes. There are few real intrinsic differences between a database and a telescope other than the access time for your data and the time stamps on the data itself. Inside astronomy architectures are emerging which present both static and real-time data resources using the same interface, inherited from a superset of the functionality possessed by both types of resource.

    In these architectures all the components of the system, including the software controlling the science programmes, are thought of as agents. A negotiation takes place between these agents in which each of the resources bids to carry out the work, with the science agent scheduling the work with the agent embedded at the resource that promises to return the best result.

    Effectively these architectures can be viewed as a general way to co-ordinate distributed (sensor) platforms, preserving inherent platform autonomy, using collective decision making to allocate resources. Such architectures are applicable to many (geographical) distributed sensors problems, or more generally to problems where you must optimise output from a distributed system in the face of scarce resources.

    This talk explores the emergence of these architectures in the astronomical community from the viewpoint of one of the people intimately involved in the process. The talk will walk attendees through the pitfalls faced by developers hoping to implement such novel architectures and discuss how the deployment of these architectures in the field has prompted the interesting and increasing use of scientists as mechanical turks by their own software.

    At 10:40am to 11:20am, Friday 23rd September

    In Sutton South, New York Hilton

    Coverage video

  • Big Data Revolution: Benefit from MapReduce Without the Risk

    by Ted Dunning

    Map-reduce and Hadoop provide new scaling opportunities for analyzing data. As a result organizations are beginning to analyze and derive business value from large amounts of data that, in many cases, were previously simply being discarded. In some cases such as on-line advertising, the ability to analyze these previously impenetrable volumes of data have disrupted entire industries such as is the case with on-line advertising.

    Such green field opportunities are rare, however, and few companies can afford to build an entirely new analytics pipeline. Integrating big data analytics systems like Apache Hadoop into existing analytics systems can be very difficult, however, because there are huge differences in the fundamental approaches being taken to the basic problems of how data should be accessed and analyzed.

    These differences are exactly what makes these new technologies hugely effective, but they are also what makes integration between conventional and new approaches so difficult.

    This talk will provide detailed descriptions of how to use new technologies to

    • Get data into and out of the Hadoop cluster as quickly as possible
    • Allow real-time components to easily access cluster data
    • Use well-known and understood standard tools to access cluster data
    • Make Hadoop easier to use and operate
    • Capitalize on existing code in map-reduce settings
    • Integrate map-reduce systems into existing analytic systems

    These descriptions will be taken from real-life customer situations. Each will describe the problems faced and the solutions that solved these problems.

    This session is sponsored by MapR Technologies

    At 11:30am to 12:10pm, Friday 23rd September

    In Murray Hill Suite B, New York Hilton

  • Designing Data Visualizations: Telling Stories With Data

    by Noah Iliinsky

    This is a talk aimed at people who know their data, and want to learn how to visualize it most effectively. If you have data, a need for answers, and a blank page, this is a great place to start.

    We’ll start briefly addressing the value of visualization, and discuss the differences between visualization for analysis and presentation.

    From there we’ll figure out what story to tell with your visualization by examining the holy visualization trinity:

    • your goals
    • your customer’s needs
    • the shape of your data

    Once the story has been selected, we need to construct it. We’ll discuss key considerations to make good choices about:

    • selecting appropriate data
    • selecting appropriate axes
    • visually encoding the data

    We’ll end with a brief discussion of some current tools, and look at some classic and innovative visualization examples.

    At 11:30am to 12:10pm, Friday 23rd September

    In Murray Hill Suite A, New York Hilton

  • Extracting Microbial Threats From Big Data

    by Robert Munro

    Pandemics are the greatest current threat to humanity. Many unidentified pathogens are already hiding out in the open, reported in local online media as sudden clusters of ‘influenza-like’ or ‘pneumonia-like’ clinical cases many months or even years before careful lab tests confirm a new microbial scourge. For some current epidemics like HIV, SARS, and H1N1, the microbial enemies were anonymously in our midst for decades. With each new infection, viruses and bacteria mutate and evolve into ever more harmful strains, and so we are in a race to identify and isolate new pathogens as quickly as possible.

    Until now, no organization has succeeded in the task of tracking every global outbreak and epidemic. The necessary information is spread across too many locations, languages and formats: a field report in Spanish, a news article in Chinese, an email in Arabic, a text-message in Swahili. Even among open data, simple key-word or white-list based searches tend to fall short as they are unable to separate the signal (an outbreak of influenza) from the noise (a new flu remedy). In a project called EpidemicIQ, the Global Viral Forecasting Initiative has taken on the challenge of tracking all outbreaks. We are complementing existing field surveillance efforts in 23 countries with a new initiative that leverages large-scale processing of outbreak reports across a myriad of formats, utilizing machine learning, natural language processing and microtasking coupled with advanced epidemiological analysis.

    EpidemicIQ intelligently mines open web-based reports, social media, transportation networks and direct reports from healthcare providers globally. Machine-learning and natural language processing allows us to track epidemic-related information across several orders of magnitude more data than any prior health efforts, even across languages that we do not ourselves speak. By leveraging a scalable workforce of microtaskers we are able to quickly adapt our machine-learning models to new sources, languages and even diseases of unknown origin. During peak times, the use of a scalable microtasking workforce also takes much of the information processing burden off the professional epidemic intelligence officers and field scientists, allowing them to apply their full domain knowledge when needed most.

    At Strata, we propose to introduce EpidemicIQ’s architecture, strategies, successes and challenges in big-data to date.

    At 11:30am to 12:10pm, Friday 23rd September

    In Sutton North, New York Hilton

  • Navigating the Data Pipeline

    by Tim Moreton

    At the heart of every system that harnesses big data is a pipeline that comprises collecting large volumes of raw data, extract value from it through analytics or data transformations, then delivering that condensed set of results back out—potentially to millions of users.

    This talk examines the challenges of building manageable, robust pipelines—a great simplifying paradigm that will help participants looking to architect their own big data systems.

    I’ll look at what you want from each of these stages—using Google Analytics as a canonical big data example, as well as case studies of systems deployed at LinkedIn. I’ll look at how collecting, analyzing and serving data pose conflicting demands on the storage and compute components of the underlying hardware. I’ll talk about what available tools do to address these challenges.

    I’ll move on to consider two holy grails: real-time analytics, and dual data center support. The pipeline metaphor highlights a challenge in deriving real-time value from huge datasets: I’ll explore what happens when you compose multiple, segregated platforms into a single pipeline, and how you can dodge the issue with a ‘fast’ and ‘slow’ two-tier architecture. Then I’ll look at how you can figure dual data center support into the design, particularly important for highly available deployments on the cloud.

    In summary, this talk will present a useful metaphor for architecting big data systems, and describe using deployed examples how to go about fitting together the tools available to fit a range of settings.

    At 11:30am to 12:10pm, Friday 23rd September

    In Sutton South, New York Hilton

  • Friday Lunchtime BoF Sessions

    Birds of a Feather (BoF) sessions provide face to face exposure to those interested in the same projects and concepts. BoFs can be organized for individual projects or broader topics (best practices, open data, standards). BoF topics are entirely up to you.

    BoFs at Strata will happen during lunch on Thursday, September 22 and Friday, September 23, where lunch is served on the Mezzanine level of the hotel.

    Visit the BoF signup board near registration to claim a reserved table and schedule your BoF.

    At 12:10pm to 1:40pm, Friday 23rd September

    In Rhinelander Gallery, New York Hilton

  • Beyond BI – Transforming Your Business with Big Data Analytics

    by Steven Hillion

    Do you use all the information you should when you make your most important decisions? Is your organization prepared to go beyond BI to enable breakthrough insights and decisions that transform the way you do business?

    Increasingly organizations realize that data intensive predictive analytics is a necessary tool for a company to compete and succeed – even if the organization has already deployed a full-blown BI and DW stack. Armed with advanced analytics insights, business users can make well-informed decisions to support their organizations’ tactical and strategic goals – and create competitive advantage.

    Steven Hillion, VP of EMC Greenplum’s Data Analytics Lab lends insight into emerging technologies to take advantage of the big data opportunity and how big data challenges today’s BI architectures and approaches to data management.

    This session is sponsored by EMC Greenplum

    At 1:40pm to 2:20pm, Friday 23rd September

    In Murray Hill Suite B, New York Hilton

  • Big Data Use Cases in the Cloud

    by Peter Sirota

    By pairing the elasticity and pay-as-you-go nature of the cloud with the flexibility and scalability of Hadoop, Amazon Elastic MapReduce has brought Big Data analytics to an even wider array of companies looking to maximize the value of their data. Each day, thousands of Hadoop clusters are run on the Amazon Elastic MapReduce infrastructure by users of every size—from University students to Fortune 50 companies—exposing the Elastic MapReduce team to an unparalleled number of use cases. In this session, we will contrast how three of these users, Amazon.com, Yelp, and Etsy, leverage the marriage of Hadoop and the cloud to drive their businesses in the face of explosive growth, including generating customer insights, powering recommendations, and managing core operations.

    At 1:40pm to 2:20pm, Friday 23rd September

    In Murray Hill Suite A, New York Hilton

  • HunchWorks: Combining human expertise and big data

    by Dane Petersen, Chris van der Walt and Sara-Jayne Farmer

    Global Pulse is a United Nations innovation initiative that is developing a new approach to crisis impact monitoring. One of the key outputs of the project is HunchWorks, a place where experts can post hypotheses—or hunches—that may warrant further exploration and then crowdsource data and verification. HunchWorks will be a key global platform for rapidly detecting emerging crises and their impacts on vulnerable communities. Using it, experts will be able to quickly surface ground truth and detect anomalies in data about collective behavior for further analysis, investigation and action.

    The presentation will open with an introduction by Chris van der Walt (Project Lead, Global Pulse) to the problem that HunchWorks is being designed to address: How to detect the emerging impacts of global crises in real-time? A short discussion of the design thinking behind HunchWorks will follow plus an overview of the HunchWorks feature set.

    Dane Petersen (Experience Designer, Adaptive Path) will then discuss some of the complex user experience design challenges that emerged as the team started to wrestle with developing HunchWorks and the approaches used to address them.

    Sara Farmer (Chief Platform Architect, Global Pulse) will follow up with a discussion of the technology powering HunchWorks, which is based on autonomy, uncertain reasoning and human-machine team theories, and is designed to to allow users and automated tools to work collaboratively to reduce the uncertainty and missing data issues inherent in hunch formation and management.

    The presentation will conclude with 10 minutes of Q&A from the audience.

    At 1:40pm to 2:20pm, Friday 23rd September

    In Sutton North, New York Hilton

    Coverage video

  • The Accidental Chief Privacy Officer

    by Jim Adler

    The first generation of chief privacy officers were typically attorneys, charged with the formulation and enforcement of privacy policies.  Times have changed.  Given the speed and complexity of technology, the privacy policy is necessary but hardly sufficient.  Because we live much of our lives in public, both online and offline, the Internet is transforming the anonymity of our cities into the familiarity of small towns.   Privacy is deeply ingrained within the technology that manages this personal data.  The products and services driving this transformation must consider privacy from the earliest design sessions.

    Today’s engineer CPO, and I’m one, must deeply involve themselves with the technology and product design process to bake-in privacy.  This new breed of CPO is comfortable in an engineering scrum, product focus group, reviewing pending regulations, or analyzing A/B test results.  They have the historical awareness, frontier spirit, regulatory caution, technical chops, and innovator’s curiosity to work through the toughest data issues. The promise of the engineer CPO is that products, not only safeguard privacy, but compete on it.

    At 1:40pm to 2:20pm, Friday 23rd September

    In Sutton South, New York Hilton

  • Assembling Data to Fight Breast Cancer

    by Abdul R Shaikh, Anthony Goldbloom, Nuala O'Connor Kelly, Roger Magoulas and Trajan Bayly

    Panel Discussion on Assembling Data to Fight Breast Cancer

    This session sponsored by GE

    At 2:30pm to 3:10pm, Friday 23rd September

    In Murray Hill Suite B, New York Hilton

  • Creating a fact-based decision making culture in organizations

    by Amaresh Tripathy

    Analytical culture is the last mile problem of organizations. More data and analytics frequently lead to decision ambiguity. Insights are either not actionable and when they are actionable, they are not widely adopted at an operational level.

    There has been a lot of emphasis on technology and data quality aspects of analytics; however without the analytical culture most organizations will not be able to take advantage of the benefits.

    After partnering with more than 100 client organizations as a consultant, from small point solution pilots to deploying large decision support systems, I have developed a series of principles which I think are critical to create and foster an analytical culture. I want to introduce the framework and highlight the organizational principles with some real life war stories.

    Some of the organizational principles that I will speak about include:

    • Top Down and not Bottom Up: Analytical culture starts in C-suite
    • Human as a hero vs. Human as a hazard: Design of decision architectures
    • Carrot or Stick: Overcoming Man vs. Machine perception
    • What goes around comes around: Importance of feedback loops
    • Saltines before the steak: Quick wins and Analytical evangelists
    • Journalists before data scientists: Role of communication
    • People, People, People: The talent gap

    I hope that the audience will embrace some of the principles and implement them as they build their analytical organizations and solutions.

    At 2:30pm to 3:10pm, Friday 23rd September

    In Sutton North, New York Hilton

  • Google Cloud for Data Crunchers

    by Chris Schalk and Ryan Boyd

    Google is a Data business: over the past few years, many of the tools Google created to store, query, analyze, visualize its data, have been exposed to developers as services.

    This talk will give you an overview of Google services for Data Crunchers:

    • Google Storage for developers: get your data in Google Cloud
    • BigQuery, fast interactive queries on Terabytes of data
    • Prediction API: Machine Learning made easy
    • Google App Engine:platform as a service to build web apps or expose APIs
    • Visualization API: many cool visualization components
    • Fusion Tables: collaborate and visualize your data on a Map
    • Google Public Data Explorer, to expose and visualize public data
    • Services that have not been announced as of the writing of this proposal but may be available when the conference happens:-)

    At 2:30pm to 3:10pm, Friday 23rd September

    In Murray Hill Suite A, New York Hilton

  • Journey or Destination: Using Models to Explore Big Data

    by Ben Gimpert

    Most people in our community are accustomed to thinking of a “model” as the end result of a properly functioning big data architecture. Once you have an EC2 cluster reserved, after the database is distributed across some Hadoop nodes, and once a clever MapReduce machine learning algorithm has done its job, the system spits out a predictive model. The model hopefully allows an organization to conduct its business better.

    This waterfall approach to modeling is embedded in the hiring process and technical culture of most contemporary big data organizations. When the business users sit in one room and the data scientists sit in another, we preclude one of the most important benefits of having on-demand access to big data. Models themselves are powerful exploratory tools! However, data sparsity, non-linear interactions and the resultant model’s quirks must be interpreted through the lens of domain expertise. All big data models are wrong but some are useful, to paraphrase the statistician George Box.

    A data scientist working in isolation could train a predictive model with perfect in-sample accuracy, but only an understanding of how the business will use the model lets her balance the crucial bias / variance trade-off. Put more simply, applied business knowledge is how we can assume a model trained on historical data will do decently with situations we have never seen.

    Models can also reveal predictors in our data we never expected. The business can learn from the automatic ranking of predictor importance with statistical entropy and multicollinearity tools. In the extreme, a surprisingly important variable that turns up during the modeling of a big data set could be the trigger of an organizational pivot. What if a movie recommendation model reveals a strange variable for predicting gross at the box office?

    My presentation introduces exploratory model feedback in the context of big (training) data. I will use a real-life case study from Altos Research that forecasts a complex system: real estate prices. Rapid prototyping with Ruby and an EC2 cluster allowed us to optimize human time, but not necessarily computing cycles. I will cover how exploratory model feedback blurs the line between domain expert and data scientist, and also blurs the distinction between supervised and unsupervised learning. This is all a data ecology, in which a model of big data can surprise us and suggest its own future enhancement.

    At 2:30pm to 3:10pm, Friday 23rd September

    In Sutton South, New York Hilton

  • Big Data Architectures 2.0: Beyond the Elephant Ride

    by Vineet Tyagi

    Businesses today are moving beyond the buzz and experimentation with batch processing options of Hadoop and MapReduce, stretching the limits for cutting edge performance & scalability. This session will talk about emerging trends of a new generation of NoHadoop (Not Only Hadoop) architectures for future proof big data scalability and prepare you for life beyond the elephant ride!

    This session is sponsored by Impetus Technologies, Inc.

    At 4:10pm to 4:50pm, Friday 23rd September

    In Murray Hill Suite B, New York Hilton

  • Data as the Building Block at Foursquare

    by Justin Moore

    Foursquare stores and processes everything from check-ins to screen views using a combination of home grown and open source tools. This talk covers an overview of our stack, highlighting specific examples of how, and why, it grew to what it is today and continues with the many ways that this infrastructure is employed.

    One such example is our data-driven product development with the recently launched recommendations engine, named “Explore.” Explore recycles past check-in data into signals like venue similarity and time-sensitive popularity measures, resulting in intelligent recommendations building upon past user behavior as well as social and bookmarking features.

    This talk takes a closer look at how Explore, and other features, emerged from our data analysis as well as the iterative process of monitoring and improvement that is critical for making such features a success.

    At 4:10pm to 4:50pm, Friday 23rd September

    In Sutton South, New York Hilton

  • Data Environmentalism

    by Trevor Hughes

    Data fuels 21st century business and society. Thanks to the rapid pace of innovation and widespread adoption of information technologies, data has become both a strategic asset and a potentially crippling liability. As consumers grow increasingly concerned about the stewardship of their data, policymakers, academics and advocates around the world are questioning boundaries and considering risks:

    • What is private and what is not?
    • How should organizations explain what they’re doing with data?
    • What should happen when data is stolen or misused?
    • And, in an era of globalization, how do we manage the diverse social and legal expectations?

    These questions are urgent in the current business climate where trust in our most basic institutions has been eroded. As organizations cope with growing tension between innovation, privacy and security, they are discovering that appropriate use and protection of data has broad impact on their reputations and bottom lines—a new, holistic ethos of data environmentalism is necessary.

    At 4:10pm to 4:50pm, Friday 23rd September

    In Murray Hill Suite A, New York Hilton

  • Gaining New Insights from Massive Amounts of Machine Data

    by Denise Hemke and Jake Flomenberg

    Many enterprises are being overwhelmed by the proliferation of machine data. Websites, communications, networking and complex IT infrastructures constantly generate massive streams of data in highly variable and unpredictable formats that are difficult to process and analyze by traditional methods or in a timely manner. Yet this data holds a definitive record of all activity and behavior, including user transactions, customer behavior, system behavior, security threats and fraudulent activity. Quickly understanding and using this data can provide added value to a companies services, customer sat, revenue growth and profitability. This session examines the challenges and approaches for collecting, organizing and deriving real-time insights from terabytes to petabytes of data, with examples from Salesforce.com, the nation’s leading enterprise cloud computing company.

    At 4:10pm to 4:50pm, Friday 23rd September

    In Sutton North, New York Hilton

  • Hazarding a Guess: ethical, legal, and policy issues in analytics and big data applications

    by Betsy Masiello, Jane Yakowitz and Solon Barocas

    Analytics can push the frontier of knowledge well beyond the useful facts that already reside in big data, revealing latent correlations that empower organizations to make statistically motivated guesses—inferences—about the character, attributes, and future actions of their stakeholders and the groups to which they belong.

    This is cause for both celebration and caution. Analytic insights can add to the stock of scientific and social scientific knowledge, significantly improve decision-making in both the public and private sector, and greatly enhance individual self-knowledge and understanding. They can even lead to entirely new classes of goods and services, providing value to institutions and individuals alike. But they also invite new applications of data that involve serious hazards.

    This panel considers these hazards, asking how analytics implicate:

    • Privacy — What are the privacy concerns involved in the kinds of inferences and applications that analytics enable? Are these concerns sufficiently well understood and accounted for?
    • Autonomy — What are the ethical stakes of applications that draw on analytic findings to selectively (and perhaps inadvertently) influence or limit individuals’ choices or decision-making?
    • Fairness — If organizations rely on certain discoveries to set criteria for unequal treatment or access, do analytics implicate questions of fairness and due process? More specifically, what if organizations draw on analytics to individualize risks or engage in adverse selection or cream skimming?
    • Fragmentation — Do attempts to personalize and customize goods and services (including media content) to individuals on the basis of inferred preferences shield individuals from certain views and issues and thus undermine social belonging and the functioning of the public sphere?

    The panel will also debate the appropriate response to these issues, reviewing the place of norms, policies, legal frameworks, regulation, and technology.

    At 5:00pm to 5:40pm, Friday 23rd September

    In Murray Hill Suite A, New York Hilton

  • Taming Data Logistics - the Hardest Part of Data Science

    by Ken Farmer

    While most of the focus within data science is on the rapid analysis of vast volumes of data, the hardest part of most solutions is the data acquisition, movement, transformation, and loading – the “data logistics”.

    Since the early days of data mining and data warehousing (in the late 80s and early 90s) it has been understood that 90% of the effort of these projects will be spent on data acquisition, cleansing, transformation and consolidation. The challenges include:

    • undocumented source systems
    • source systems that change business rules without notice
    • source systems that cannot handle frequent extracts of data without encountering concurrency problems
    • source system constraints on languages, network connections, and products
    • the management of thousands of daily processes
    • the management of data logistics code that manages dozens of feeds
    • the rapid loading of data into the consolidated server – without impacting concurrency or creating temporary data inconsistencies

    The data warehousing domain refers to data logistics as “ETL” for Extract, Transform, and Load. Some best practices and methods have developed to address these challenges, but little effort has been put into reusable patterns – more effort has gone into mostly commercial products. But in spite of a lack of formal patterns, a sense of what works and what doesn’t work has emerged – and can be read “between the lines” if someone knows what to look for.

    This presentation will describe what the challenges look like when trying to deliver data insights that out of necessity span many sets of data. It will explain these in both business and technical terms. And then will procede to address some of the common solutions – and their strengths and weaknesses.

    At 5:00pm to 5:40pm, Friday 23rd September

    In Sutton South, New York Hilton