Whether you believe the hype around Big Data or not, the amount of information accruing throughout large organizations is getting more profound every day. And it’s not simply a question of volume; of equal concern is the variety of data. There are emails, IMs, tweets, Facebook updates and the fastest-growing category of data: video. This variety makes it difficult to generate an apples-to-apples comparison of data from a single individual or entity. Combine this with the fact that experts think that there is no such thing as ‘clean’ data, and you have a growing problem.
This is why it is better to focus on understanding digital character. As with individuals, electronic data has ‘character.’ That character helps to disambiguate the relationship between one piece of data and another. This is particularly important given that because communication is more fragmented than ever, it makes relevance more difficult to ascertain.
Digital character is similar to individual character in the real world; particularly in the sense that character emerges over time. Does one embarrassing photo or comment on Facebook define an individual’s lifetime character? Can’t everyone recollect an email they wish they had never sent? Just as in the real world, digital character requires a large enough body of work to make an accurate character judgment.
Elizabeth Charnock, CEO of Cataphora and author of E-Habits, will discuss the pitfalls of Bad Data, and how it manifests itself in the interaction between a male stripper and a Harvard professor.
This talk will address the question of how to enable a much more agile data provisioning model for business units and data scientists. We’re in a mode shift where data unlocks new growth, and almost every Fortune 1000 company is scrambling to architect a new platform to enable data to be stored, shared and analyzed for competitive advantage. Many companies are finding that this shift requires major rethinking of how systems should be architected (and scaled) to enable agile, self-service access to critical data.
In this session we’ll discuss strategies for building agile big-data clouds that make it much faster and easier for data scientists to discover, provision and analyze data. We’ll discuss where and how new technologies (both vendor and OSS) fit into this model.
We will also discuss changes in application architectures as big-data begins to play a role in online applications, incorporating many big-data techniques to deliver consumer-targeted content. This new “real-time” analytics category is growing fast and several new data systems are enabling this shift. We’ll review which players and technologies in the NoSQL community are helping drive this architecture.
by Paul Brown
Scientists have dealt with big data and big analytics for at least a decade before the business world came to realize they had the same problems, and united around ‘Big Data’, the ‘Data Tsunami’ and ‘the Industrial Revolution of data’. Both the science world and the commercial world share requirements for a high performance informatics platform to support collection, curation, collaboration, exploration, and analysis of massive datasets.
Neither conventional relational database management systems nor hadoop-based systems readily meet all the workflow, data management and analytical requirements desired by either community. They have the wrong data model – tables or files—or no data model. They require excessive data movement and data reformatting for advanced analytics. And they are missing key features, such as provenance.
SciDB is an emerging open source analytical database that runs on a commodity hardware grid or in the cloud. SciDB natively supports:
• An array data model – a flexible, compact, extensible data model for rich, highly dimensional data
• Massively scale math – non-embarassingly parallel operations like linear algebra operations on matrices too large to fit in memory as well as transparently scalable R, MatLab, and SAS style analytics without requiring code for data distribution or parallel computation
• Versioning and Provenance – Data is updated, but never overwritten. The raw data, the derived data, and the derivation are kept for reproducibility, what-if modeling, back-testing, and re-analysis
• Uncertainty support – data carry error bars, probability distribution or confidence metrics that can be propagated through calculations
• Smart storage – compact storage for both dense and sparse data that is efficient for location-based, time-series, and instrument data
We will sketch the design of SciDB and talk about how it’s different from other proposals, and why that matters. We will also put out some early benchmarking data and present a computational genomics use case that showcase SciDB’s massively scalable parallel analytics.
by Tim Moreton
At the heart of every system that harnesses big data is a pipeline that comprises collecting large volumes of raw data, extract value from it through analytics or data transformations, then delivering that condensed set of results back out—potentially to millions of users.
This talk examines the challenges of building manageable, robust pipelines—a great simplifying paradigm that will help participants looking to architect their own big data systems.
I’ll look at what you want from each of these stages—using Google Analytics as a canonical big data example, as well as case studies of systems deployed at LinkedIn. I’ll look at how collecting, analyzing and serving data pose conflicting demands on the storage and compute components of the underlying hardware. I’ll talk about what available tools do to address these challenges.
I’ll move on to consider two holy grails: real-time analytics, and dual data center support. The pipeline metaphor highlights a challenge in deriving real-time value from huge datasets: I’ll explore what happens when you compose multiple, segregated platforms into a single pipeline, and how you can dodge the issue with a ‘fast’ and ‘slow’ two-tier architecture. Then I’ll look at how you can figure dual data center support into the design, particularly important for highly available deployments on the cloud.
In summary, this talk will present a useful metaphor for architecting big data systems, and describe using deployed examples how to go about fitting together the tools available to fit a range of settings.
by Ben Gimpert
Most people in our community are accustomed to thinking of a “model” as the end result of a properly functioning big data architecture. Once you have an EC2 cluster reserved, after the database is distributed across some Hadoop nodes, and once a clever MapReduce machine learning algorithm has done its job, the system spits out a predictive model. The model hopefully allows an organization to conduct its business better.
This waterfall approach to modeling is embedded in the hiring process and technical culture of most contemporary big data organizations. When the business users sit in one room and the data scientists sit in another, we preclude one of the most important benefits of having on-demand access to big data. Models themselves are powerful exploratory tools! However, data sparsity, non-linear interactions and the resultant model’s quirks must be interpreted through the lens of domain expertise. All big data models are wrong but some are useful, to paraphrase the statistician George Box.
A data scientist working in isolation could train a predictive model with perfect in-sample accuracy, but only an understanding of how the business will use the model lets her balance the crucial bias / variance trade-off. Put more simply, applied business knowledge is how we can assume a model trained on historical data will do decently with situations we have never seen.
Models can also reveal predictors in our data we never expected. The business can learn from the automatic ranking of predictor importance with statistical entropy and multicollinearity tools. In the extreme, a surprisingly important variable that turns up during the modeling of a big data set could be the trigger of an organizational pivot. What if a movie recommendation model reveals a strange variable for predicting gross at the box office?
My presentation introduces exploratory model feedback in the context of big (training) data. I will use a real-life case study from Altos Research that forecasts a complex system: real estate prices. Rapid prototyping with Ruby and an EC2 cluster allowed us to optimize human time, but not necessarily computing cycles. I will cover how exploratory model feedback blurs the line between domain expert and data scientist, and also blurs the distinction between supervised and unsupervised learning. This is all a data ecology, in which a model of big data can surprise us and suggest its own future enhancement.
22nd–23rd September 2011