Opening remarks by the Strata program chairs, Edd Dumbill and Alistair Croll.
From hackathons to API-enabled civic data, learn how New York City government is evolving thanks to deeper engagement with the technology community.
by Jon Jenkins
The Kepler Mission began its science observations just over two years ago on May 12, 2009, initiating NASA’s first search for Earth-like planets. Initial results and light curves from Kepler are simply breath-taking, including confirmation of the first unquestionable rocky planet, Kepler-10b, and Kepler-11b, a system of 6 transiting planets orbiting one Sun-like star.
Kepler released light curves for the first 120 days of observations for over 150,000 target stars on February 2, 2011, and announced the identification of over 1235 planetary candidates, including 68 candidates smaller than 1.25 Earth radii, and 54 candidates in or near the habitable zone of their parent star. An astounding 408 candidates orbiting 170 stars as planetary systems were found. Dr. Jenkins will discuss how much we’ve learned over the 24 months about the instrument, the planets and the stars.
by Elissa Fink
Creating visualizations and info graphics with public data helps keep our politicians honest, and our society transparent. Strata and Tableau Public, a free tool for creating interactive online visualizations, have had hundreds of bloggers striving to win their Interactive Public Data Visualization Contest. Come see the best of the best from the contest, and the official announcement of the winner.
This keynote sponsored by Tableau Software
by Jer Thorp
Almost every piece of data is tethered to something in the real world. When we work with numbers, we are often able (and willing) to ignore the real world objects and systems that these numbers represent.
In this presentation, Jer Thorp will discuss his work with names—designing an arrangement algorithm for the 9/11 Memorial in Manhattan. He’ll walk through collaborative processes, admit to a series of failures and ultimately show how humans and software can combine to solve extraordinary problems.
by Randy Lea
The opportunity exists for organizations in every industry to unlock the power of iterative, big data analysis for new applications such as digital marketing optimization and social network analysis that improve the bottom line. Big data analysis is not just the ability to analyze large volumes of data, but also the ability to analyze more varieties of data and perform more complex analysis than is possible with more traditional technologies. But it doesn’t have to be as complicated as it sounds. This session will show you how you can bring the science of data to the art of business and empower more business users and analysts to operationalize insights and drive results. You’ll see examples of how data science is applied by making emerging analytic technologies more accessible to businesses and easily managed by enterprise architects across retail, financial services, and media companies.
This keynote sponsored by Aster Data
by John Rauser
Quantitative Engineer? Business Intelligence Analyst? Data Scientist? The data deluge has come upon us so quickly that we don’t even know what to call ourselves, much less how to make a career of working with data. This talk examines the critical traits that lead to success by looking back to what may be the first act of data science.
The Apache Cassandra database has added many new enterprise features this year based on the real-world needs of companies like Twitter, Netflix, Openwave, and others building massively scalable systems.
Apache Cassandra addresses a wide variety of real-time big data needs. Capable of tracking transactions in financial markets or the actions of millions of users in massively multiplayer games, Cassandra handles the demands of large volume applications and data streams. Whether it’s storing billions of emails or backing up terabytes of files, Cassandra can store large amounts of data and scale near-infinitely. In today’s information age, Cassandra excels at storing and serving massive amounts of data at low-latency – from geolocation data to server performance metrics, and more.
This talk will cover the motivation and use cases behind features such as secondary indexes, Hadoop integration, SQL support, bulk loading, and more.
The shift to real time data driven applications and what that means
* Why Cassandra is ideal for today’s enterprise data applications
Recap: Cassandra through 2010
* Best-in-class support for multiple datacenters
* High-performance storage engine based on Bigtable
New in Cassandra 1.0
This session is sponsored by DataStax
by Jon Jenkins
The Kepler spacecraft launched on March 7, 2009, initiating NASA’s first search for Earth-size planets orbiting Sun-like stars, with stunning results after being on the job for just over two years. Designing and building the Kepler science pipeline software that processes and analyzes the resulting data to make the discoveries presented a daunting set of challenges.
Although capable of reaching a precision near 20 ppm in 6.5 hours in order to detect 80-ppm drops in brightness corresponding to Earth-size transits, the instrument is sensitive to its environment. Identifying and removing instrumental signatures from the data as well as characterizing the varability of the stars themselves has proven to be extremely important in the quest for Earth-size planets. In addition, the computational intensity of processing the accumulating data compelled us to port the detection and validation pipeline components to the Pleides supercomputer at NASA Ames Research Center. As we look forward to an extended mission of up to 10 years of flight operations, balancing the need for speed against the requirement for ultrahigh precision presents a challenge.
A lesson we have learned from our work on the data portal DataMarket.com and custom projects we’ve done for a wide variety of customers is: Regardless of how interesting the underlying data or how ground-breaking the analysis is, most people only realize the value and see the potential once the data has been properly visualized.
Put another way: Visualization is where normal people fall in love with data, and – when done right – where they can understand the data at a glance.
We are by no means alone in realizing this. Data visualization has become a hot field, and a lot of statisticians, designers and computer professionals are taking their first steps, learning by example from things they’ve seen elsewhere. Some of these examples are colorful, pretty and praised but still don’t communicate the data properly – the real stories may even be obscured or distorted with badly parsed data or gratuitous visual fluff. Other examples are breaking new ground and advancing the field. But which is which?
Visually communicating data is not a new field. People have been honing data visualization skills since the 19th century, learning a lot about what works – and what doesn’t. It is possible to do things both “right” and beautiful at the same time. In this presentation we hope to explain how by showing the audience some of the very best examples of such work from the leaders in this field – and others that have not done as well.
After providing this background we will walk the audience step-by-step through one particular data visualization project we have worked on (possibly our Earthquake and Eruptions video), explaining the methods, tools and process involved in putting that together and the decisions that led to those particular choices.
Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.
You’ve collected a ton of data and your team is busily crunching numbers and coming to conclusions… but are they the right ones? You can only know with the right context and you can’t get context working in a silo. We invite you to bring the rest of the world into your data warehouse. Don’t worry, it’ll add more value than it takes and instead of working on the data, you can work on your vision.
In this talk, we’ll allay your fears of open data, demonstrate the difference between making decisions with and without context and show you other neat things that happen when you share.
Nowadays, major news events prompt millions of responses online. Every message passing through the internet has a voice. Aggregate analysis and visualization helps us see the roar of the crowd.
The Guardian first explored this last year with an award-winning graphic that replays World Cup games , condensing 90 minutes of tweets into 90 seconds of interactive animation. By juxtaposing match events with surges in word popularity, viewers can relive the ripples of human reaction passing through Twitter.
Asked to apply similar techniques to the News International saga, we partnered with Datasift to capture and display public responses during key events in the story. This talk steps through the process of recording, processing and displaying a large volume of tweets which enabled a small team to build complex pieces of interactive content at newsroom speeds.
Above all, the presentation will aim to portray the delicate balance of design, data and storytelling at the heart of interactive news content.
MapReduce, Hadoop, and other “NoSQL” big data approaches open opportunities for data scientists in every industry to develop new data-driven applications for digital marketing optimization and social network analysis through the power of iterative, big data analysis. But what about the business user or analyst? How can they unlock insights through standard business intelligence (BI) tools or SQL access? The challenge with emerging big data technologies is finding staff with the specialized skill sets of the data scientist to implement and use these solutions. Business leaders and enterprise architects struggle to understand, implement, and integrate these big data technologies with their existing business processes and IT investments and provide value to the business.
This session will explore a new class of analytic platforms and technologies such as SQL-MapReduce® which bring the science of data to the art of business. By fusing standard business intelligence and analytics with next-generation data processing techniques such as MapReduce, big data analysis is no longer just in the hands of the few data science or MapReduce specialists in an organization! You’ll learn how business users can easily access, explore, and iterate their analysis of big data to unlock deeper sights. See example applications with digital marketing optimization, fraud detection and prevention, social network and relationship analysis, and more.
This session is sponsored by Aster Data
by John Lucker
Herbert Simon once wrote that “the central concern of administrative theory is with the boundary between rational and nonrational aspects of human social behavior.” Simon’s comment is especially pertinent to the still-emerging field of business analytics. The human dimension of business analytics might facetiously be called the discipline’s “dark matter”: it looms large while tending to remain hidden from view.
In many and diverse domains, human experts must make decisions that require weighing together disparate pieces of information and are made repeatedly. Unfortunately, we are not very good at this. We rely on mental heuristics (rules of thumb), which as psychological research shows, have surprising biases that limit our ability to make truly objective decisions. The implication is society is replete with inefficient markets and business processes that can be improved with business analytics.
Analytics projects are often bedeviled – or simply stopped in their tracks – by challenges emanating from organizational culture, misunderstanding of statistical concepts, and discomfort with probabilistic reasoning. Compounding these challenges is the fact that data scientists often “speak a different language from” the business domain experts that they are charged to help. In our experience, these challenges can be among the most difficult ones faced in an analytics project, and are ignored at one’s peril. This talk will provide a number of case studies and vignettes; relate these examples to relevant ideas from the decision sciences; and offer practical tips for achieving organizational buy-in.
Birds of a Feather (BoF) sessions provide face to face exposure to those interested in the same projects and concepts. BoFs can be organized for individual projects or broader topics (best practices, open data, standards). BoF topics are entirely up to you.
BoFs at Strata will happen during lunch on Thursday, September 22 and Friday, September 23, where lunch is served on the Mezzanine level of the hotel.
Visit the BoF signup board near registration to claim a reserved table and schedule your BoF.
Structured search improves the search experience through the identification of entities and their relationships in documents and queries. This panel will explore the current state of structured and semi-structured search, as well as exploring the open problems in an area that promises to revolutionize information seeking.
The four panelists below work on some of the world’s largest structured search problems, from offering users structured search on Google’s web corpus to building a computing system that defeated Jeopardy! champions in an extreme test of natural language understanding. They work on the data, tools, and research that are driving this field. They are all excellent researchers and presenters, promising to offer a informative and engaging panel discussion, for which I will act as moderator.
by Steve Jackson
How do you efficiently and effectively search the world’s leading collection of legal content — 2.2 billion documents — then quickly zero in on exactly what you need, all in a matter of seconds? Thomson Reuters Professional, built a Big Data information management architecture to do just this for their clients. WestlawNext gives legal professionals comprehensive, specialized content plus unique search technologies and tools that help them find, understand and apply the law and legal concepts in the service of their clients. Learn how Thomson Reuters manages and processes a variety of very large and diverse data sources to quickly publish timely, trusted, and relevant information to their clients.
This session sponsored by Informatica
Good graphs are extremely powerful tools for communicating quantitative information clearly and accurately. Unfortunately, many of the graphs we see today are poor graphs that confuse, mislead or deceive the reader. These poor graphs often occur because the graph designer is not familiar with principles of effective graphs or because the software used has a poor choice of default settings. We point out some of these graphical mistakes including using unnecessary dimensions, not making the data stand out, making mistakes with scales, showing changes in one dimension by area or volume, and not making your message clear. In most cases very simple changes make the resulting graphs easier for the reader to understand. In addition, we show some common mistakes with tables. We end with some useful little-known graph forms that communicate the data more clearly than the everyday graphs that are more commonly used.
by Joseph Adler
Marketing is the art of telling potential customers or users about products or services that they might find useful. Some technology people might look down on marketing as a dirty, but necessary, part of running a company. That’s unfortunate, because marketing is one of the most interesting and valuable things that you can do with data.
At LinkedIn, we look at marketing as a recommendation problem, not a sales problem. Our goal is to help our users get the most benefit from our service. We use a lot of data and technology to market our own services. To do this, we use a variety of big data systems: recommendation engines, data processing, and content delivery. We rely on a team of marketing professionals, designers, engineers, and data scientists. We approach marketing scientifically, and constantly test new hypotheses to learn how to market better.
In this talk, I’m going to describe LinkedIn’s approach to personalized marketing, using the story of the award-winning “Year in Review” email message. I’ll talk about how we come up with ideas, how we test new ideas, and how we quickly turn ideas into scalable production processes. And finally, I’ll talk about Tickle, the Hadoop based system that we built to generate and prioritize marketing email messages.
Whether you believe the hype around Big Data or not, the amount of information accruing throughout large organizations is getting more profound every day. And it’s not simply a question of volume; of equal concern is the variety of data. There are emails, IMs, tweets, Facebook updates and the fastest-growing category of data: video. This variety makes it difficult to generate an apples-to-apples comparison of data from a single individual or entity. Combine this with the fact that experts think that there is no such thing as ‘clean’ data, and you have a growing problem.
This is why it is better to focus on understanding digital character. As with individuals, electronic data has ‘character.’ That character helps to disambiguate the relationship between one piece of data and another. This is particularly important given that because communication is more fragmented than ever, it makes relevance more difficult to ascertain.
Digital character is similar to individual character in the real world; particularly in the sense that character emerges over time. Does one embarrassing photo or comment on Facebook define an individual’s lifetime character? Can’t everyone recollect an email they wish they had never sent? Just as in the real world, digital character requires a large enough body of work to make an accurate character judgment.
Elizabeth Charnock, CEO of Cataphora and author of E-Habits, will discuss the pitfalls of Bad Data, and how it manifests itself in the interaction between a male stripper and a Harvard professor.
‘Crowdsourcing big data’ might sound like a randomly generated selection of buzz words, but it turns out to represent a powerful leap forward in the accuracy of predictive analytics. As companies and researchers are fast discovering, data prediction competitions provide a unique opportunity for advancing the state of the art in fields as diverse as astronomy, health care, insurance pricing, sports ratings systems and tourism forecasting. This session will focus not simply on the mechanics of data prediction competitions, but on why they work so effectively. As it turns out, the ‘why’ boils down to a couple of simple propositions, one associated with Archimedes and the other with world record-breaking sprinter Roger Bannister. Those propositions are not unique to the world of data science, but, as this session will show, have a particularly compelling application to it.
by Lee Feinberg
Sophisticated data analytics is a great thing. But great analytics is only valuable if people use it. The worst thing is a great analysis filled with answers sitting on the shelf going unused. In this session you will learn how to present and show analytics in highly compelling ways. You’ll learn how to use it as a cultural change-agent—and how you must shift to a “data marketing mindset” to make it all happen.
This session is sponsored by Tableau Software
by Irene Ros
Data visualization is an important communication medium in personal and public conversation spheres. Its wide use in entertainment and business settings alike has encouraged the creation of tools and frameworks that allow anyone to create visualizations and share them with their audience. While these tools offer tried and true visualization metaphors they also pose risks such as missing important data points or creating meaningless visuals.
This talk will introduce the concept of “responsible data visualization” in the context of two distinct uses: exploration and narrative. Using personal and industry examples to show best and worst practices in each approach, this talk will offer practical suggestions to bringing data visualization into one’s data workflow.
This talk will address the question of how to enable a much more agile data provisioning model for business units and data scientists. We’re in a mode shift where data unlocks new growth, and almost every Fortune 1000 company is scrambling to architect a new platform to enable data to be stored, shared and analyzed for competitive advantage. Many companies are finding that this shift requires major rethinking of how systems should be architected (and scaled) to enable agile, self-service access to critical data.
In this session we’ll discuss strategies for building agile big-data clouds that make it much faster and easier for data scientists to discover, provision and analyze data. We’ll discuss where and how new technologies (both vendor and OSS) fit into this model.
We will also discuss changes in application architectures as big-data begins to play a role in online applications, incorporating many big-data techniques to deliver consumer-targeted content. This new “real-time” analytics category is growing fast and several new data systems are enabling this shift. We’ll review which players and technologies in the NoSQL community are helping drive this architecture.
Economists utilize a data analysis toolkit and intuition that can be very helpful to Data Scientists. In particular, econometric methods are quite useful in disentangling correlation and causation, a use case not well-handled by standard machine learning and statistical techniques. This session will cover examples of econometric methods in action, as well as other economics-related insights. Think of it as a crash-course in basic econometric intuition that one receives during a PhD in Economics (I received my PhD from Stanford in 2008).
Why econometrics? The difference between econometrics and statistics is that statistical modeling is more concerned with fit, and econometric modeling is more concerned with properly estimating the coefficients in a regression. Getting the “right” (consistent & unbiased) estimates means that the analyst can more effectively measure how a change in one variable can strongly predict (or cause) a change in the dependent variable. These techniques can help solve problems in social/web data that previously were only solvable using future data collection from randomized multivariate experiments.
To do this, the analyst first develops an intuition for whether or not there is a source of “endogeneity” in the regression. This largely is determined by the relationship between the predictors and the error term in the regression. Once the source of the endogeneity is understood, econometric techniques like fixed/random effects and instrumental variables can be quite useful. The type of data that is collected and available is key to the extent to which the power of these techniques can be used. [I might also go into some other techniques, but these are the most useful]
The methods will be presented in a way so that a non-technical person can understand the basic intuition, and also so that a practitioner can apply the methods in the future. Examples will be provided. For panel data econometrics, we will discuss the example of how to identify actions taken early on by a LinkedIn member that are predictive of their future engagement with the product, a problem that is difficult due to the confounding of correlation and causation. For instrumental variables techniques, we will discuss how to use random variation in the weather to say cool things about politics, economics, and web usage.
In addition to the discussion of applied econometric techniques, there may also be time for economics-related data insights. Currently we are developing unemployment rate prediction models using time-series econometrics as well as indexes to measure changes in the supply/demand for talent across regions and industries.
by Bill Schmarzo
Companies are wrestling with the challenges of managing and exploiting big data. Larger, more diverse data sources and the business need for low-latency access to that data combine to provide new data monetization opportunities. But “bolting” analytics onto your existing data warehouse and business intelligence environment does not work. How do business owners and IT work together to identify the right business problem and then design the right architecture, to exploit these new data monetization opportunities? How do you ensure the successful deployment of these new capabilities, given the historically high rate of failure for new technologies?
This session will present a tried and proven methodology that is based upon a simple premise—business opportunities must drive all information technology deployments. While a technology-led approach is useful for helping an organization gain insight into “what” a new technology does, it is critical that the business opportunities drive the “why,” “how,” and “where” to implement new technologies.
This methodology provides the following key benefits:
• Ensures that your big data analytics initiative is focused on the business opportunities that provide the optimal tradeoff between business benefit and implementation feasibility
• Builds the organizational consensus necessary for success by aligning corporate resources around common goals, assumptions, priorities, and metrics
Case study examples will demonstrate its use.
The introduction of Apache Hadoop is changing the business intelligence data stack. In this presentation, Dr. Amr Awadallah, chief technology officer at Cloudera, will discuss how the architecture is evolving and the advanced capabilities it lends to solving key business challenges. Awadallah will illustrate how enterprises can leverage Hadoop to derive complete value from both unstructured and structured data, gaining the ability ask and get answers to previously un-addressable big questions. He will also explain how Hadoop and relational databases complement each other, enabling organizations to access the latent information in all their data under a variety of operational and economic constraints.
This session is sponsored by Cloudera
How do data infrastructure, insights and products change when your user base grows by orders of magnitude? When should you move your user-facing data product off your laptop? (hint: now!) Does your data offer insights about the world at large, or is it just mirroring your early adopters?
In this talk, I will share some of the data scaling lessons we’ve learned at LinkedIn, recount war stories (and close calls!) and document the evolution of the data scientist.
by Paul Brown
Scientists have dealt with big data and big analytics for at least a decade before the business world came to realize they had the same problems, and united around ‘Big Data’, the ‘Data Tsunami’ and ‘the Industrial Revolution of data’. Both the science world and the commercial world share requirements for a high performance informatics platform to support collection, curation, collaboration, exploration, and analysis of massive datasets.
Neither conventional relational database management systems nor hadoop-based systems readily meet all the workflow, data management and analytical requirements desired by either community. They have the wrong data model – tables or files—or no data model. They require excessive data movement and data reformatting for advanced analytics. And they are missing key features, such as provenance.
SciDB is an emerging open source analytical database that runs on a commodity hardware grid or in the cloud. SciDB natively supports:
• An array data model – a flexible, compact, extensible data model for rich, highly dimensional data
• Massively scale math – non-embarassingly parallel operations like linear algebra operations on matrices too large to fit in memory as well as transparently scalable R, MatLab, and SAS style analytics without requiring code for data distribution or parallel computation
• Versioning and Provenance – Data is updated, but never overwritten. The raw data, the derived data, and the derivation are kept for reproducibility, what-if modeling, back-testing, and re-analysis
• Uncertainty support – data carry error bars, probability distribution or confidence metrics that can be propagated through calculations
• Smart storage – compact storage for both dense and sparse data that is efficient for location-based, time-series, and instrument data
We will sketch the design of SciDB and talk about how it’s different from other proposals, and why that matters. We will also put out some early benchmarking data and present a computational genomics use case that showcase SciDB’s massively scalable parallel analytics.
22nd–23rd September 2011