by John Rauser
Quantitative Engineer? Business Intelligence Analyst? Data Scientist? The data deluge has come upon us so quickly that we don’t even know what to call ourselves, much less how to make a career of working with data. This talk examines the critical traits that lead to success by looking back to what may be the first act of data science.
Whether you believe the hype around Big Data or not, the amount of information accruing throughout large organizations is getting more profound every day. And it’s not simply a question of volume; of equal concern is the variety of data. There are emails, IMs, tweets, Facebook updates and the fastest-growing category of data: video. This variety makes it difficult to generate an apples-to-apples comparison of data from a single individual or entity. Combine this with the fact that experts think that there is no such thing as ‘clean’ data, and you have a growing problem.
This is why it is better to focus on understanding digital character. As with individuals, electronic data has ‘character.’ That character helps to disambiguate the relationship between one piece of data and another. This is particularly important given that because communication is more fragmented than ever, it makes relevance more difficult to ascertain.
Digital character is similar to individual character in the real world; particularly in the sense that character emerges over time. Does one embarrassing photo or comment on Facebook define an individual’s lifetime character? Can’t everyone recollect an email they wish they had never sent? Just as in the real world, digital character requires a large enough body of work to make an accurate character judgment.
Elizabeth Charnock, CEO of Cataphora and author of E-Habits, will discuss the pitfalls of Bad Data, and how it manifests itself in the interaction between a male stripper and a Harvard professor.
‘Crowdsourcing big data’ might sound like a randomly generated selection of buzz words, but it turns out to represent a powerful leap forward in the accuracy of predictive analytics. As companies and researchers are fast discovering, data prediction competitions provide a unique opportunity for advancing the state of the art in fields as diverse as astronomy, health care, insurance pricing, sports ratings systems and tourism forecasting. This session will focus not simply on the mechanics of data prediction competitions, but on why they work so effectively. As it turns out, the ‘why’ boils down to a couple of simple propositions, one associated with Archimedes and the other with world record-breaking sprinter Roger Bannister. Those propositions are not unique to the world of data science, but, as this session will show, have a particularly compelling application to it.
This talk will address the question of how to enable a much more agile data provisioning model for business units and data scientists. We’re in a mode shift where data unlocks new growth, and almost every Fortune 1000 company is scrambling to architect a new platform to enable data to be stored, shared and analyzed for competitive advantage. Many companies are finding that this shift requires major rethinking of how systems should be architected (and scaled) to enable agile, self-service access to critical data.
In this session we’ll discuss strategies for building agile big-data clouds that make it much faster and easier for data scientists to discover, provision and analyze data. We’ll discuss where and how new technologies (both vendor and OSS) fit into this model.
We will also discuss changes in application architectures as big-data begins to play a role in online applications, incorporating many big-data techniques to deliver consumer-targeted content. This new “real-time” analytics category is growing fast and several new data systems are enabling this shift. We’ll review which players and technologies in the NoSQL community are helping drive this architecture.
How do data infrastructure, insights and products change when your user base grows by orders of magnitude? When should you move your user-facing data product off your laptop? (hint: now!) Does your data offer insights about the world at large, or is it just mirroring your early adopters?
In this talk, I will share some of the data scaling lessons we’ve learned at LinkedIn, recount war stories (and close calls!) and document the evolution of the data scientist.
by Paul Brown
Scientists have dealt with big data and big analytics for at least a decade before the business world came to realize they had the same problems, and united around ‘Big Data’, the ‘Data Tsunami’ and ‘the Industrial Revolution of data’. Both the science world and the commercial world share requirements for a high performance informatics platform to support collection, curation, collaboration, exploration, and analysis of massive datasets.
Neither conventional relational database management systems nor hadoop-based systems readily meet all the workflow, data management and analytical requirements desired by either community. They have the wrong data model – tables or files—or no data model. They require excessive data movement and data reformatting for advanced analytics. And they are missing key features, such as provenance.
SciDB is an emerging open source analytical database that runs on a commodity hardware grid or in the cloud. SciDB natively supports:
• An array data model – a flexible, compact, extensible data model for rich, highly dimensional data
• Massively scale math – non-embarassingly parallel operations like linear algebra operations on matrices too large to fit in memory as well as transparently scalable R, MatLab, and SAS style analytics without requiring code for data distribution or parallel computation
• Versioning and Provenance – Data is updated, but never overwritten. The raw data, the derived data, and the derivation are kept for reproducibility, what-if modeling, back-testing, and re-analysis
• Uncertainty support – data carry error bars, probability distribution or confidence metrics that can be propagated through calculations
• Smart storage – compact storage for both dense and sparse data that is efficient for location-based, time-series, and instrument data
We will sketch the design of SciDB and talk about how it’s different from other proposals, and why that matters. We will also put out some early benchmarking data and present a computational genomics use case that showcase SciDB’s massively scalable parallel analytics.
by Jim Falgout
Telecom network switches, network servers and other equipment generate and store large amounts of data every day. The data is mainly used for billing and network operations, If utilized fully, this data can have an enormous impact on network operations and overall profitability.
Many communications service providers (CSP) do not have the tools to mine this data quickly and deeply enough to realize its value. Tools are being used that are usually home grown and not scalable. Valuable information is being lost—information that could be used to predict network issues rather than respond to them after the fact. The alternative of a full analytic database can be cost-prohibitive.
By applying big data tools and predictive analytics upstream of the database, CSPs can move from reactive to proactive use of the data. Network quality problems can be identified in minutes rather than days. By analyzing all the data, analytics tools can pinpoint root cause and suggest corrective actions. Finding and fixing these issues more quickly leads to higher call quality, more profitable service and increased customer satisfaction.
This session is sponsored by Pervasive
22nd–23rd September 2011