Your current filters are…
Data scientists and technology companies are rapidly recognizing the immense power of data for drawing insights about their impact and operations, yet NGOs and non-profits are increasingly being left behind with mounting data and few resources to make use of it. Data Without Borders seeks to bridge this data divide by matching underserved NGOs with pro bono data scientists so that they can collect, manage, and analyze their data together in the service of humanity, creating a more open environment for socially conscious data and bringing greater change to the world.
by Mark Madsen
History seems irrelevant in the software world, particular when dealing with lots of information. It isn’t. Information explosions are not new. They’ve happened repeatedly throughout human history. A little looking will turn up prior incarnations of information management patterns and concepts that can be repurposed using today’s technologies.
The first person to conceive of something is usually not the first. They’re the first to re-conceive at a point where the current technology caught up to someone else’s idea. We’re at a point today where many old ideas are being reinvented. Come hear why looking to the past beyond your core field of interest is worthwhile.
by Richard Merkin
Dr. Richard Merkin, President and CEO of Heritage Provider Network, is pleased to announce the winner of the first $3 million dollar Heritage Health Progress Prize. Responding to our country’s $2 trillion dollar health care crises, Dr. Merkin created, developed and sponsored the $3 million dollar Heritage Health Prize for predictive modeling to save more than $30 billion in avoidable hospitalizations. It is the largest predictive modeling prize in the world, larger than the Nobel Prize for Medicine and the Gates Prize for Health. Dr. Merkin is genuinely excited to bring new minds to the healthcare table with the prize and believes data miners hold great potential for not only bringing a winning algorithm, but also to grab the attention of data miners globally and raise awareness about competitive innovation, changing the world through healthcare delivery. Dr. Merkin will present the top two teams with $50,000 in the first progress prize, split as $30,000 and $20,000.
by Anne Wright
The BodyTrack project has interviewed a number of people who have improved their health by discovering certain foods or environmental exposures to avoid, or learning other types of behavioral changes. Many describe greatly improved quality of life, overcoming in some cases chronic problems in areas such as sleep, pain, gastrointestinal function, and energy levels. In some cases, a doctor or specialist’s diagnosis led to treatment which mitigated symptoms (e.g. asthma or migraine headache), but where discovery of triggers required self-tracking and self-experimentation.
Importantly, the act of starting to search for one’s sensitivities or triggers appears to be empowering: people who embarked on this path changed their relationship to their health situation even before making the discoveries that helped lead to symptom improvement.
The BodyTrack Project is building tools, both technological and cultural, to empower more people to embrace an “investigator” role in their own lives. The core of the BodyTrack system is an open source web service which allows users to aggregate, visualize, and analyze data from a myriad of sources—physiological metrics from wearable sensors, image and self-observation capture from smart phones, local environmental measures such as bedroom light and noise levels and in-house air quality monitoring, and regional environmental measures such as pollen/mold counts and air particulates. We believe empowering a broader set of people with these tools will help individuals and medical practitioners alike to better address health conditions with complex environmental or behavioral components.
by Ken Bado
Big Data is more than just volume and velocity. MarkLogic CEO Ken Bado will address why complexity is the key gotcha for organizations trying to outflank their competition by managing Big Data in real time. Learn how winners today are using MarkLogic to manage the complexity of their unstructured information to drive revenue and results.
by Hilary Mason
The flow of data across the social web tells us what people, around the world, are paying attention to at any given moment. Understanding this flow is both a mathematical and a human problem, as we develop and adapt techniques to find stories in the data.
Come hear about the expected and the surprises in the bitly data, as well as generalized techniques that apply to any ‘realtime’ data system.
by Arnab Gupta
In 1964, The Twilight Zone aired an episode titled “The Brain Center at Whipple’s,” in which factory owner Wallace Whipple completely eliminates his human workforce in favor of automated machinery. Mr. Whipple’s employees, clearly far ahead of their time, argue to him that human insights far outweigh the advantages provided by mechanical labor. Ironically, at the end of the episode, Mr. Whipple, too, is replaced by a machine.
It’s a well-known dichotomy: man versus machine—and, depending on who’s doing the talking, good (human) versus evil (machine). Today, as technology continues to evolve and machines are capable of ever more advanced processes and functions, the dichotomy is becoming even more pronounced. Look no further than IBM’s Watson, an advanced artificial intelligence machine that squared off against Jeopardy’s best human contestants in 2011—and won.
But, as Opera Solutions’ CEO Arnab Gupta proposes to explore in remarks at Strata, the man-vs.- machine dichotomy is a false one. A far better contest would have been a three-way one, pitting man versus machine versus man-plus-machine. It is almost a certainty that the latter combination would have won.
Consider: nowhere has the machine-vs.-human conflict been played out more fully than in the realm of chess, starting in 1997 with IBM’s Deep Blue vs. Garry Kasparov. Today, chess-playing computers routinely beat the strongest human players. One might conclude that the machines have won. But there’s a twist: as Kasparov has recently stated, a machine plus just an average player can beat all comers, humans or computers. Humans’ ability to think abstractly and creatively, to bring in new ideas, to apply history, to understand irony, opportunity, possibilities—all this, when paired with machines’ abilities to process huge amounts of data flows and bring to light hidden patters and connections that elude human understanding, make the machine/mind connection unbeatable.
In short, it is not humans vs. machines, but rather humans plus machines, which must become the new paradigm for scientists, business people, and others—particularly in the Big Data era. Combining human insight with machine intelligence overcomes the weaknesses of each while delivering never-before-seen strengths.
How can this be accomplished, particularly when machines and people speak different languages and, in truth, “think” differently? How can we create and foster a productive pairing of two very different types of “minds?” Arnab will address the need to create a new language—one mostly visual in nature— to allow humans and machines to work together and realize the full potential of their collaboration. Finding a common language is a pursuit that goes far beyond prosaic “UI” development, and instead forces us to examine how humans can (and might learn to) best understand what machines are saying.
Information technology has been meeting disaster head on with new software, crowdsourcing inputs, and mapping tools gaining incredible potential since the Haiti earthquake. How big data really fits into benefiting disaster response both from a humanitarian relief and business continuity side has yet to mature. I will discuss needs (filtering, interfaces, real-time data processing) specifically for the unique sociological and extreme environment constraints in professional disaster response, and untapped potential for business continuity.
by Alex Lundry
Political campaigns and causes have added another powerful weapon to their messaging arsenal: graphs, charts, infographics and other forms of data visualization. Over just the last year, Barack Obama urged voters to distribute and share a bar graph of job losses, a line graph of labor costs by a New York Times columnist prompted an official graphical response from the government of Spain, and an organizational chart of a health care reform bill became the subject of a Congressional investigation in the United States. To be sure, a good graph has been used as an advocacy tool for years, but only recently, with the rise of the Internet, blogs, hardware and software advances, and freely available machine readable data, political data visualizations have exploded into political discourse. Conveying objective authority, yet the product of dozens of subjective design decisions, political infographics imply hard truths despite their inherently editorial nature.
This talk, given by a political data scientist who has built persuasive data visualizations for political organizations, will dissect some of the most extraordinary and powerful examples of political data visualization used over the last election cycle, focusing upon the methods that make them work so well.
by Ron Avnur and Mark Rodgers
Ron Avnur, SVP Engineering, MarkLogic, and Mark Rodgers, Sr. Director of Product Engineering, LexisNexis will reveal how LexisNexis is rebuilding its business platform to handle Big Data in real-time. LexisNexis is renowned for the technical solutions it has been building for 40+ years. It is well aware of the challenges of Big Data as it has gathered a huge amount of content. Avnur will explain how Big Data and unstructured information is slowly overtaking organizations. Rodgers will discuss the challenges LexisNexis faced as a global organization that was building new products to remain on the cutting edge of Big Data. Together, Avnur and Rodgers will give a brief overview of the technical implementation that enabled LexisNexis to address those challenges. Finally, Rodgers will detail the business benefits LexisNexis is experiencing as a result of its new Big Data business platform.
This session is sponsored by MarkLogic
In the last few years the ubiquitous availability of high bandwidth networks has changed the way both robotic and non-robotic telescopes operate, with single isolated telescopes being integrated into expanding smart telescope networks that can span continents and respond to transient events in seconds. At the same time the rise of data warehousing has made data mining more practical, and correlations between new and existing data can be drawn in real time. These changes have led to fundamental shifts in the way astronomers pursue their science. Astronomy, once a data-poor science, has become data-rich.
For many applications it is practical to extend data warehousing to real-time assets such as telescopes. There are few real intrinsic differences between a database and a telescope other than the access time for your data and the time stamps on the data itself. Inside astronomy architectures are emerging which present both static and real-time data resources using the same interface, inherited from a superset of the functionality possessed by both types of resource.
In these architectures all the components of the system, including the software controlling the science programmes, are thought of as agents. A negotiation takes place between these agents in which each of the resources bids to carry out the work, with the science agent scheduling the work with the agent embedded at the resource that promises to return the best result.
Effectively these architectures can be viewed as a general way to co-ordinate distributed (sensor) platforms, preserving inherent platform autonomy, using collective decision making to allocate resources. Such architectures are applicable to many (geographical) distributed sensors problems, or more generally to problems where you must optimise output from a distributed system in the face of scarce resources.
This talk explores the emergence of these architectures in the astronomical community from the viewpoint of one of the people intimately involved in the process. The talk will walk attendees through the pitfalls faced by developers hoping to implement such novel architectures and discuss how the deployment of these architectures in the field has prompted the interesting and increasing use of scientists as mechanical turks by their own software.
by Ted Dunning
Map-reduce and Hadoop provide new scaling opportunities for analyzing data. As a result organizations are beginning to analyze and derive business value from large amounts of data that, in many cases, were previously simply being discarded. In some cases such as on-line advertising, the ability to analyze these previously impenetrable volumes of data have disrupted entire industries such as is the case with on-line advertising.
Such green field opportunities are rare, however, and few companies can afford to build an entirely new analytics pipeline. Integrating big data analytics systems like Apache Hadoop into existing analytics systems can be very difficult, however, because there are huge differences in the fundamental approaches being taken to the basic problems of how data should be accessed and analyzed.
These differences are exactly what makes these new technologies hugely effective, but they are also what makes integration between conventional and new approaches so difficult.
This talk will provide detailed descriptions of how to use new technologies to
These descriptions will be taken from real-life customer situations. Each will describe the problems faced and the solutions that solved these problems.
This session is sponsored by MapR Technologies
This is a talk aimed at people who know their data, and want to learn how to visualize it most effectively. If you have data, a need for answers, and a blank page, this is a great place to start.
We’ll start briefly addressing the value of visualization, and discuss the differences between visualization for analysis and presentation.
From there we’ll figure out what story to tell with your visualization by examining the holy visualization trinity:
Once the story has been selected, we need to construct it. We’ll discuss key considerations to make good choices about:
We’ll end with a brief discussion of some current tools, and look at some classic and innovative visualization examples.
by Robert Munro
Pandemics are the greatest current threat to humanity. Many unidentified pathogens are already hiding out in the open, reported in local online media as sudden clusters of ‘influenza-like’ or ‘pneumonia-like’ clinical cases many months or even years before careful lab tests confirm a new microbial scourge. For some current epidemics like HIV, SARS, and H1N1, the microbial enemies were anonymously in our midst for decades. With each new infection, viruses and bacteria mutate and evolve into ever more harmful strains, and so we are in a race to identify and isolate new pathogens as quickly as possible.
Until now, no organization has succeeded in the task of tracking every global outbreak and epidemic. The necessary information is spread across too many locations, languages and formats: a field report in Spanish, a news article in Chinese, an email in Arabic, a text-message in Swahili. Even among open data, simple key-word or white-list based searches tend to fall short as they are unable to separate the signal (an outbreak of influenza) from the noise (a new flu remedy). In a project called EpidemicIQ, the Global Viral Forecasting Initiative has taken on the challenge of tracking all outbreaks. We are complementing existing field surveillance efforts in 23 countries with a new initiative that leverages large-scale processing of outbreak reports across a myriad of formats, utilizing machine learning, natural language processing and microtasking coupled with advanced epidemiological analysis.
EpidemicIQ intelligently mines open web-based reports, social media, transportation networks and direct reports from healthcare providers globally. Machine-learning and natural language processing allows us to track epidemic-related information across several orders of magnitude more data than any prior health efforts, even across languages that we do not ourselves speak. By leveraging a scalable workforce of microtaskers we are able to quickly adapt our machine-learning models to new sources, languages and even diseases of unknown origin. During peak times, the use of a scalable microtasking workforce also takes much of the information processing burden off the professional epidemic intelligence officers and field scientists, allowing them to apply their full domain knowledge when needed most.
At Strata, we propose to introduce EpidemicIQ’s architecture, strategies, successes and challenges in big-data to date.
by Tim Moreton
At the heart of every system that harnesses big data is a pipeline that comprises collecting large volumes of raw data, extract value from it through analytics or data transformations, then delivering that condensed set of results back out—potentially to millions of users.
This talk examines the challenges of building manageable, robust pipelines—a great simplifying paradigm that will help participants looking to architect their own big data systems.
I’ll look at what you want from each of these stages—using Google Analytics as a canonical big data example, as well as case studies of systems deployed at LinkedIn. I’ll look at how collecting, analyzing and serving data pose conflicting demands on the storage and compute components of the underlying hardware. I’ll talk about what available tools do to address these challenges.
I’ll move on to consider two holy grails: real-time analytics, and dual data center support. The pipeline metaphor highlights a challenge in deriving real-time value from huge datasets: I’ll explore what happens when you compose multiple, segregated platforms into a single pipeline, and how you can dodge the issue with a ‘fast’ and ‘slow’ two-tier architecture. Then I’ll look at how you can figure dual data center support into the design, particularly important for highly available deployments on the cloud.
In summary, this talk will present a useful metaphor for architecting big data systems, and describe using deployed examples how to go about fitting together the tools available to fit a range of settings.
Birds of a Feather (BoF) sessions provide face to face exposure to those interested in the same projects and concepts. BoFs can be organized for individual projects or broader topics (best practices, open data, standards). BoF topics are entirely up to you.
BoFs at Strata will happen during lunch on Thursday, September 22 and Friday, September 23, where lunch is served on the Mezzanine level of the hotel.
Visit the BoF signup board near registration to claim a reserved table and schedule your BoF.
by Steven Hillion
Do you use all the information you should when you make your most important decisions? Is your organization prepared to go beyond BI to enable breakthrough insights and decisions that transform the way you do business?
Increasingly organizations realize that data intensive predictive analytics is a necessary tool for a company to compete and succeed – even if the organization has already deployed a full-blown BI and DW stack. Armed with advanced analytics insights, business users can make well-informed decisions to support their organizations’ tactical and strategic goals – and create competitive advantage.
Steven Hillion, VP of EMC Greenplum’s Data Analytics Lab lends insight into emerging technologies to take advantage of the big data opportunity and how big data challenges today’s BI architectures and approaches to data management.
This session is sponsored by EMC Greenplum
by Peter Sirota
By pairing the elasticity and pay-as-you-go nature of the cloud with the flexibility and scalability of Hadoop, Amazon Elastic MapReduce has brought Big Data analytics to an even wider array of companies looking to maximize the value of their data. Each day, thousands of Hadoop clusters are run on the Amazon Elastic MapReduce infrastructure by users of every size—from University students to Fortune 50 companies—exposing the Elastic MapReduce team to an unparalleled number of use cases. In this session, we will contrast how three of these users, Amazon.com, Yelp, and Etsy, leverage the marriage of Hadoop and the cloud to drive their businesses in the face of explosive growth, including generating customer insights, powering recommendations, and managing core operations.
Global Pulse is a United Nations innovation initiative that is developing a new approach to crisis impact monitoring. One of the key outputs of the project is HunchWorks, a place where experts can post hypotheses—or hunches—that may warrant further exploration and then crowdsource data and verification. HunchWorks will be a key global platform for rapidly detecting emerging crises and their impacts on vulnerable communities. Using it, experts will be able to quickly surface ground truth and detect anomalies in data about collective behavior for further analysis, investigation and action.
The presentation will open with an introduction by Chris van der Walt (Project Lead, Global Pulse) to the problem that HunchWorks is being designed to address: How to detect the emerging impacts of global crises in real-time? A short discussion of the design thinking behind HunchWorks will follow plus an overview of the HunchWorks feature set.
Dane Petersen (Experience Designer, Adaptive Path) will then discuss some of the complex user experience design challenges that emerged as the team started to wrestle with developing HunchWorks and the approaches used to address them.
Sara Farmer (Chief Platform Architect, Global Pulse) will follow up with a discussion of the technology powering HunchWorks, which is based on autonomy, uncertain reasoning and human-machine team theories, and is designed to to allow users and automated tools to work collaboratively to reduce the uncertainty and missing data issues inherent in hunch formation and management.
The presentation will conclude with 10 minutes of Q&A from the audience.
by Jim Adler
Today’s engineer CPO, and I’m one, must deeply involve themselves with the technology and product design process to bake-in privacy. This new breed of CPO is comfortable in an engineering scrum, product focus group, reviewing pending regulations, or analyzing A/B test results. They have the historical awareness, frontier spirit, regulatory caution, technical chops, and innovator’s curiosity to work through the toughest data issues. The promise of the engineer CPO is that products, not only safeguard privacy, but compete on it.
Panel Discussion on Assembling Data to Fight Breast Cancer
This session sponsored by GE
Analytical culture is the last mile problem of organizations. More data and analytics frequently lead to decision ambiguity. Insights are either not actionable and when they are actionable, they are not widely adopted at an operational level.
There has been a lot of emphasis on technology and data quality aspects of analytics; however without the analytical culture most organizations will not be able to take advantage of the benefits.
After partnering with more than 100 client organizations as a consultant, from small point solution pilots to deploying large decision support systems, I have developed a series of principles which I think are critical to create and foster an analytical culture. I want to introduce the framework and highlight the organizational principles with some real life war stories.
Some of the organizational principles that I will speak about include:
I hope that the audience will embrace some of the principles and implement them as they build their analytical organizations and solutions.
Google is a Data business: over the past few years, many of the tools Google created to store, query, analyze, visualize its data, have been exposed to developers as services.
This talk will give you an overview of Google services for Data Crunchers:
by Ben Gimpert
Most people in our community are accustomed to thinking of a “model” as the end result of a properly functioning big data architecture. Once you have an EC2 cluster reserved, after the database is distributed across some Hadoop nodes, and once a clever MapReduce machine learning algorithm has done its job, the system spits out a predictive model. The model hopefully allows an organization to conduct its business better.
This waterfall approach to modeling is embedded in the hiring process and technical culture of most contemporary big data organizations. When the business users sit in one room and the data scientists sit in another, we preclude one of the most important benefits of having on-demand access to big data. Models themselves are powerful exploratory tools! However, data sparsity, non-linear interactions and the resultant model’s quirks must be interpreted through the lens of domain expertise. All big data models are wrong but some are useful, to paraphrase the statistician George Box.
A data scientist working in isolation could train a predictive model with perfect in-sample accuracy, but only an understanding of how the business will use the model lets her balance the crucial bias / variance trade-off. Put more simply, applied business knowledge is how we can assume a model trained on historical data will do decently with situations we have never seen.
Models can also reveal predictors in our data we never expected. The business can learn from the automatic ranking of predictor importance with statistical entropy and multicollinearity tools. In the extreme, a surprisingly important variable that turns up during the modeling of a big data set could be the trigger of an organizational pivot. What if a movie recommendation model reveals a strange variable for predicting gross at the box office?
My presentation introduces exploratory model feedback in the context of big (training) data. I will use a real-life case study from Altos Research that forecasts a complex system: real estate prices. Rapid prototyping with Ruby and an EC2 cluster allowed us to optimize human time, but not necessarily computing cycles. I will cover how exploratory model feedback blurs the line between domain expert and data scientist, and also blurs the distinction between supervised and unsupervised learning. This is all a data ecology, in which a model of big data can surprise us and suggest its own future enhancement.
by Vineet Tyagi
Businesses today are moving beyond the buzz and experimentation with batch processing options of Hadoop and MapReduce, stretching the limits for cutting edge performance & scalability. This session will talk about emerging trends of a new generation of NoHadoop (Not Only Hadoop) architectures for future proof big data scalability and prepare you for life beyond the elephant ride!
This session is sponsored by Impetus Technologies, Inc.
by Justin Moore
Foursquare stores and processes everything from check-ins to screen views using a combination of home grown and open source tools. This talk covers an overview of our stack, highlighting specific examples of how, and why, it grew to what it is today and continues with the many ways that this infrastructure is employed.
One such example is our data-driven product development with the recently launched recommendations engine, named “Explore.” Explore recycles past check-in data into signals like venue similarity and time-sensitive popularity measures, resulting in intelligent recommendations building upon past user behavior as well as social and bookmarking features.
This talk takes a closer look at how Explore, and other features, emerged from our data analysis as well as the iterative process of monitoring and improvement that is critical for making such features a success.
Data fuels 21st century business and society. Thanks to the rapid pace of innovation and widespread adoption of information technologies, data has become both a strategic asset and a potentially crippling liability. As consumers grow increasingly concerned about the stewardship of their data, policymakers, academics and advocates around the world are questioning boundaries and considering risks:
These questions are urgent in the current business climate where trust in our most basic institutions has been eroded. As organizations cope with growing tension between innovation, privacy and security, they are discovering that appropriate use and protection of data has broad impact on their reputations and bottom lines—a new, holistic ethos of data environmentalism is necessary.
Many enterprises are being overwhelmed by the proliferation of machine data. Websites, communications, networking and complex IT infrastructures constantly generate massive streams of data in highly variable and unpredictable formats that are difficult to process and analyze by traditional methods or in a timely manner. Yet this data holds a definitive record of all activity and behavior, including user transactions, customer behavior, system behavior, security threats and fraudulent activity. Quickly understanding and using this data can provide added value to a companies services, customer sat, revenue growth and profitability. This session examines the challenges and approaches for collecting, organizing and deriving real-time insights from terabytes to petabytes of data, with examples from Salesforce.com, the nation’s leading enterprise cloud computing company.
by Betsy Masiello, Jane Yakowitz and Solon Barocas
Analytics can push the frontier of knowledge well beyond the useful facts that already reside in big data, revealing latent correlations that empower organizations to make statistically motivated guesses—inferences—about the character, attributes, and future actions of their stakeholders and the groups to which they belong.
This is cause for both celebration and caution. Analytic insights can add to the stock of scientific and social scientific knowledge, significantly improve decision-making in both the public and private sector, and greatly enhance individual self-knowledge and understanding. They can even lead to entirely new classes of goods and services, providing value to institutions and individuals alike. But they also invite new applications of data that involve serious hazards.
This panel considers these hazards, asking how analytics implicate:
The panel will also debate the appropriate response to these issues, reviewing the place of norms, policies, legal frameworks, regulation, and technology.
by Ken Farmer
While most of the focus within data science is on the rapid analysis of vast volumes of data, the hardest part of most solutions is the data acquisition, movement, transformation, and loading – the “data logistics”.
Since the early days of data mining and data warehousing (in the late 80s and early 90s) it has been understood that 90% of the effort of these projects will be spent on data acquisition, cleansing, transformation and consolidation. The challenges include:
The data warehousing domain refers to data logistics as “ETL” for Extract, Transform, and Load. Some best practices and methods have developed to address these challenges, but little effort has been put into reusable patterns – more effort has gone into mostly commercial products. But in spite of a lack of formal patterns, a sense of what works and what doesn’t work has emerged – and can be read “between the lines” if someone knows what to look for.
This presentation will describe what the challenges look like when trying to deliver data insights that out of necessity span many sets of data. It will explain these in both business and technical terms. And then will procede to address some of the common solutions – and their strengths and weaknesses.
22nd–23rd September 2011