Your current filters are…
by John Rauser
Quantitative Engineer? Business Intelligence Analyst? Data Scientist? The data deluge has come upon us so quickly that we don’t even know what to call ourselves, much less how to make a career of working with data. This talk examines the critical traits that lead to success by looking back to what may be the first act of data science.
Whether you believe the hype around Big Data or not, the amount of information accruing throughout large organizations is getting more profound every day. And it’s not simply a question of volume; of equal concern is the variety of data. There are emails, IMs, tweets, Facebook updates and the fastest-growing category of data: video. This variety makes it difficult to generate an apples-to-apples comparison of data from a single individual or entity. Combine this with the fact that experts think that there is no such thing as ‘clean’ data, and you have a growing problem.
This is why it is better to focus on understanding digital character. As with individuals, electronic data has ‘character.’ That character helps to disambiguate the relationship between one piece of data and another. This is particularly important given that because communication is more fragmented than ever, it makes relevance more difficult to ascertain.
Digital character is similar to individual character in the real world; particularly in the sense that character emerges over time. Does one embarrassing photo or comment on Facebook define an individual’s lifetime character? Can’t everyone recollect an email they wish they had never sent? Just as in the real world, digital character requires a large enough body of work to make an accurate character judgment.
Elizabeth Charnock, CEO of Cataphora and author of E-Habits, will discuss the pitfalls of Bad Data, and how it manifests itself in the interaction between a male stripper and a Harvard professor.
‘Crowdsourcing big data’ might sound like a randomly generated selection of buzz words, but it turns out to represent a powerful leap forward in the accuracy of predictive analytics. As companies and researchers are fast discovering, data prediction competitions provide a unique opportunity for advancing the state of the art in fields as diverse as astronomy, health care, insurance pricing, sports ratings systems and tourism forecasting. This session will focus not simply on the mechanics of data prediction competitions, but on why they work so effectively. As it turns out, the ‘why’ boils down to a couple of simple propositions, one associated with Archimedes and the other with world record-breaking sprinter Roger Bannister. Those propositions are not unique to the world of data science, but, as this session will show, have a particularly compelling application to it.
This talk will address the question of how to enable a much more agile data provisioning model for business units and data scientists. We’re in a mode shift where data unlocks new growth, and almost every Fortune 1000 company is scrambling to architect a new platform to enable data to be stored, shared and analyzed for competitive advantage. Many companies are finding that this shift requires major rethinking of how systems should be architected (and scaled) to enable agile, self-service access to critical data.
In this session we’ll discuss strategies for building agile big-data clouds that make it much faster and easier for data scientists to discover, provision and analyze data. We’ll discuss where and how new technologies (both vendor and OSS) fit into this model.
We will also discuss changes in application architectures as big-data begins to play a role in online applications, incorporating many big-data techniques to deliver consumer-targeted content. This new “real-time” analytics category is growing fast and several new data systems are enabling this shift. We’ll review which players and technologies in the NoSQL community are helping drive this architecture.
How do data infrastructure, insights and products change when your user base grows by orders of magnitude? When should you move your user-facing data product off your laptop? (hint: now!) Does your data offer insights about the world at large, or is it just mirroring your early adopters?
In this talk, I will share some of the data scaling lessons we’ve learned at LinkedIn, recount war stories (and close calls!) and document the evolution of the data scientist.
by Paul Brown
Scientists have dealt with big data and big analytics for at least a decade before the business world came to realize they had the same problems, and united around ‘Big Data’, the ‘Data Tsunami’ and ‘the Industrial Revolution of data’. Both the science world and the commercial world share requirements for a high performance informatics platform to support collection, curation, collaboration, exploration, and analysis of massive datasets.
Neither conventional relational database management systems nor hadoop-based systems readily meet all the workflow, data management and analytical requirements desired by either community. They have the wrong data model – tables or files—or no data model. They require excessive data movement and data reformatting for advanced analytics. And they are missing key features, such as provenance.
SciDB is an emerging open source analytical database that runs on a commodity hardware grid or in the cloud. SciDB natively supports:
• An array data model – a flexible, compact, extensible data model for rich, highly dimensional data
• Massively scale math – non-embarassingly parallel operations like linear algebra operations on matrices too large to fit in memory as well as transparently scalable R, MatLab, and SAS style analytics without requiring code for data distribution or parallel computation
• Versioning and Provenance – Data is updated, but never overwritten. The raw data, the derived data, and the derivation are kept for reproducibility, what-if modeling, back-testing, and re-analysis
• Uncertainty support – data carry error bars, probability distribution or confidence metrics that can be propagated through calculations
• Smart storage – compact storage for both dense and sparse data that is efficient for location-based, time-series, and instrument data
We will sketch the design of SciDB and talk about how it’s different from other proposals, and why that matters. We will also put out some early benchmarking data and present a computational genomics use case that showcase SciDB’s massively scalable parallel analytics.
by Jim Falgout
Telecom network switches, network servers and other equipment generate and store large amounts of data every day. The data is mainly used for billing and network operations, If utilized fully, this data can have an enormous impact on network operations and overall profitability.
Many communications service providers (CSP) do not have the tools to mine this data quickly and deeply enough to realize its value. Tools are being used that are usually home grown and not scalable. Valuable information is being lost—information that could be used to predict network issues rather than respond to them after the fact. The alternative of a full analytic database can be cost-prohibitive.
By applying big data tools and predictive analytics upstream of the database, CSPs can move from reactive to proactive use of the data. Network quality problems can be identified in minutes rather than days. By analyzing all the data, analytics tools can pinpoint root cause and suggest corrective actions. Finding and fixing these issues more quickly leads to higher call quality, more profitable service and increased customer satisfaction.
This session is sponsored by Pervasive
by Ken Bado
Big Data is more than just volume and velocity. MarkLogic CEO Ken Bado will address why complexity is the key gotcha for organizations trying to outflank their competition by managing Big Data in real time. Learn how winners today are using MarkLogic to manage the complexity of their unstructured information to drive revenue and results.
by Hilary Mason
The flow of data across the social web tells us what people, around the world, are paying attention to at any given moment. Understanding this flow is both a mathematical and a human problem, as we develop and adapt techniques to find stories in the data.
Come hear about the expected and the surprises in the bitly data, as well as generalized techniques that apply to any ‘realtime’ data system.
by Arnab Gupta
In 1964, The Twilight Zone aired an episode titled “The Brain Center at Whipple’s,” in which factory owner Wallace Whipple completely eliminates his human workforce in favor of automated machinery. Mr. Whipple’s employees, clearly far ahead of their time, argue to him that human insights far outweigh the advantages provided by mechanical labor. Ironically, at the end of the episode, Mr. Whipple, too, is replaced by a machine.
It’s a well-known dichotomy: man versus machine—and, depending on who’s doing the talking, good (human) versus evil (machine). Today, as technology continues to evolve and machines are capable of ever more advanced processes and functions, the dichotomy is becoming even more pronounced. Look no further than IBM’s Watson, an advanced artificial intelligence machine that squared off against Jeopardy’s best human contestants in 2011—and won.
But, as Opera Solutions’ CEO Arnab Gupta proposes to explore in remarks at Strata, the man-vs.- machine dichotomy is a false one. A far better contest would have been a three-way one, pitting man versus machine versus man-plus-machine. It is almost a certainty that the latter combination would have won.
Consider: nowhere has the machine-vs.-human conflict been played out more fully than in the realm of chess, starting in 1997 with IBM’s Deep Blue vs. Garry Kasparov. Today, chess-playing computers routinely beat the strongest human players. One might conclude that the machines have won. But there’s a twist: as Kasparov has recently stated, a machine plus just an average player can beat all comers, humans or computers. Humans’ ability to think abstractly and creatively, to bring in new ideas, to apply history, to understand irony, opportunity, possibilities—all this, when paired with machines’ abilities to process huge amounts of data flows and bring to light hidden patters and connections that elude human understanding, make the machine/mind connection unbeatable.
In short, it is not humans vs. machines, but rather humans plus machines, which must become the new paradigm for scientists, business people, and others—particularly in the Big Data era. Combining human insight with machine intelligence overcomes the weaknesses of each while delivering never-before-seen strengths.
How can this be accomplished, particularly when machines and people speak different languages and, in truth, “think” differently? How can we create and foster a productive pairing of two very different types of “minds?” Arnab will address the need to create a new language—one mostly visual in nature— to allow humans and machines to work together and realize the full potential of their collaboration. Finding a common language is a pursuit that goes far beyond prosaic “UI” development, and instead forces us to examine how humans can (and might learn to) best understand what machines are saying.
Information technology has been meeting disaster head on with new software, crowdsourcing inputs, and mapping tools gaining incredible potential since the Haiti earthquake. How big data really fits into benefiting disaster response both from a humanitarian relief and business continuity side has yet to mature. I will discuss needs (filtering, interfaces, real-time data processing) specifically for the unique sociological and extreme environment constraints in professional disaster response, and untapped potential for business continuity.
by Ron Avnur and Mark Rodgers
Ron Avnur, SVP Engineering, MarkLogic, and Mark Rodgers, Sr. Director of Product Engineering, LexisNexis will reveal how LexisNexis is rebuilding its business platform to handle Big Data in real-time. LexisNexis is renowned for the technical solutions it has been building for 40+ years. It is well aware of the challenges of Big Data as it has gathered a huge amount of content. Avnur will explain how Big Data and unstructured information is slowly overtaking organizations. Rodgers will discuss the challenges LexisNexis faced as a global organization that was building new products to remain on the cutting edge of Big Data. Together, Avnur and Rodgers will give a brief overview of the technical implementation that enabled LexisNexis to address those challenges. Finally, Rodgers will detail the business benefits LexisNexis is experiencing as a result of its new Big Data business platform.
This session is sponsored by MarkLogic
by Ted Dunning
Map-reduce and Hadoop provide new scaling opportunities for analyzing data. As a result organizations are beginning to analyze and derive business value from large amounts of data that, in many cases, were previously simply being discarded. In some cases such as on-line advertising, the ability to analyze these previously impenetrable volumes of data have disrupted entire industries such as is the case with on-line advertising.
Such green field opportunities are rare, however, and few companies can afford to build an entirely new analytics pipeline. Integrating big data analytics systems like Apache Hadoop into existing analytics systems can be very difficult, however, because there are huge differences in the fundamental approaches being taken to the basic problems of how data should be accessed and analyzed.
These differences are exactly what makes these new technologies hugely effective, but they are also what makes integration between conventional and new approaches so difficult.
This talk will provide detailed descriptions of how to use new technologies to
These descriptions will be taken from real-life customer situations. Each will describe the problems faced and the solutions that solved these problems.
This session is sponsored by MapR Technologies
by Robert Munro
Pandemics are the greatest current threat to humanity. Many unidentified pathogens are already hiding out in the open, reported in local online media as sudden clusters of ‘influenza-like’ or ‘pneumonia-like’ clinical cases many months or even years before careful lab tests confirm a new microbial scourge. For some current epidemics like HIV, SARS, and H1N1, the microbial enemies were anonymously in our midst for decades. With each new infection, viruses and bacteria mutate and evolve into ever more harmful strains, and so we are in a race to identify and isolate new pathogens as quickly as possible.
Until now, no organization has succeeded in the task of tracking every global outbreak and epidemic. The necessary information is spread across too many locations, languages and formats: a field report in Spanish, a news article in Chinese, an email in Arabic, a text-message in Swahili. Even among open data, simple key-word or white-list based searches tend to fall short as they are unable to separate the signal (an outbreak of influenza) from the noise (a new flu remedy). In a project called EpidemicIQ, the Global Viral Forecasting Initiative has taken on the challenge of tracking all outbreaks. We are complementing existing field surveillance efforts in 23 countries with a new initiative that leverages large-scale processing of outbreak reports across a myriad of formats, utilizing machine learning, natural language processing and microtasking coupled with advanced epidemiological analysis.
EpidemicIQ intelligently mines open web-based reports, social media, transportation networks and direct reports from healthcare providers globally. Machine-learning and natural language processing allows us to track epidemic-related information across several orders of magnitude more data than any prior health efforts, even across languages that we do not ourselves speak. By leveraging a scalable workforce of microtaskers we are able to quickly adapt our machine-learning models to new sources, languages and even diseases of unknown origin. During peak times, the use of a scalable microtasking workforce also takes much of the information processing burden off the professional epidemic intelligence officers and field scientists, allowing them to apply their full domain knowledge when needed most.
At Strata, we propose to introduce EpidemicIQ’s architecture, strategies, successes and challenges in big-data to date.
by Tim Moreton
At the heart of every system that harnesses big data is a pipeline that comprises collecting large volumes of raw data, extract value from it through analytics or data transformations, then delivering that condensed set of results back out—potentially to millions of users.
This talk examines the challenges of building manageable, robust pipelines—a great simplifying paradigm that will help participants looking to architect their own big data systems.
I’ll look at what you want from each of these stages—using Google Analytics as a canonical big data example, as well as case studies of systems deployed at LinkedIn. I’ll look at how collecting, analyzing and serving data pose conflicting demands on the storage and compute components of the underlying hardware. I’ll talk about what available tools do to address these challenges.
I’ll move on to consider two holy grails: real-time analytics, and dual data center support. The pipeline metaphor highlights a challenge in deriving real-time value from huge datasets: I’ll explore what happens when you compose multiple, segregated platforms into a single pipeline, and how you can dodge the issue with a ‘fast’ and ‘slow’ two-tier architecture. Then I’ll look at how you can figure dual data center support into the design, particularly important for highly available deployments on the cloud.
In summary, this talk will present a useful metaphor for architecting big data systems, and describe using deployed examples how to go about fitting together the tools available to fit a range of settings.
by Steven Hillion
Do you use all the information you should when you make your most important decisions? Is your organization prepared to go beyond BI to enable breakthrough insights and decisions that transform the way you do business?
Increasingly organizations realize that data intensive predictive analytics is a necessary tool for a company to compete and succeed – even if the organization has already deployed a full-blown BI and DW stack. Armed with advanced analytics insights, business users can make well-informed decisions to support their organizations’ tactical and strategic goals – and create competitive advantage.
Steven Hillion, VP of EMC Greenplum’s Data Analytics Lab lends insight into emerging technologies to take advantage of the big data opportunity and how big data challenges today’s BI architectures and approaches to data management.
This session is sponsored by EMC Greenplum
by Peter Sirota
By pairing the elasticity and pay-as-you-go nature of the cloud with the flexibility and scalability of Hadoop, Amazon Elastic MapReduce has brought Big Data analytics to an even wider array of companies looking to maximize the value of their data. Each day, thousands of Hadoop clusters are run on the Amazon Elastic MapReduce infrastructure by users of every size—from University students to Fortune 50 companies—exposing the Elastic MapReduce team to an unparalleled number of use cases. In this session, we will contrast how three of these users, Amazon.com, Yelp, and Etsy, leverage the marriage of Hadoop and the cloud to drive their businesses in the face of explosive growth, including generating customer insights, powering recommendations, and managing core operations.
by Ben Gimpert
Most people in our community are accustomed to thinking of a “model” as the end result of a properly functioning big data architecture. Once you have an EC2 cluster reserved, after the database is distributed across some Hadoop nodes, and once a clever MapReduce machine learning algorithm has done its job, the system spits out a predictive model. The model hopefully allows an organization to conduct its business better.
This waterfall approach to modeling is embedded in the hiring process and technical culture of most contemporary big data organizations. When the business users sit in one room and the data scientists sit in another, we preclude one of the most important benefits of having on-demand access to big data. Models themselves are powerful exploratory tools! However, data sparsity, non-linear interactions and the resultant model’s quirks must be interpreted through the lens of domain expertise. All big data models are wrong but some are useful, to paraphrase the statistician George Box.
A data scientist working in isolation could train a predictive model with perfect in-sample accuracy, but only an understanding of how the business will use the model lets her balance the crucial bias / variance trade-off. Put more simply, applied business knowledge is how we can assume a model trained on historical data will do decently with situations we have never seen.
Models can also reveal predictors in our data we never expected. The business can learn from the automatic ranking of predictor importance with statistical entropy and multicollinearity tools. In the extreme, a surprisingly important variable that turns up during the modeling of a big data set could be the trigger of an organizational pivot. What if a movie recommendation model reveals a strange variable for predicting gross at the box office?
My presentation introduces exploratory model feedback in the context of big (training) data. I will use a real-life case study from Altos Research that forecasts a complex system: real estate prices. Rapid prototyping with Ruby and an EC2 cluster allowed us to optimize human time, but not necessarily computing cycles. I will cover how exploratory model feedback blurs the line between domain expert and data scientist, and also blurs the distinction between supervised and unsupervised learning. This is all a data ecology, in which a model of big data can surprise us and suggest its own future enhancement.
Many enterprises are being overwhelmed by the proliferation of machine data. Websites, communications, networking and complex IT infrastructures constantly generate massive streams of data in highly variable and unpredictable formats that are difficult to process and analyze by traditional methods or in a timely manner. Yet this data holds a definitive record of all activity and behavior, including user transactions, customer behavior, system behavior, security threats and fraudulent activity. Quickly understanding and using this data can provide added value to a companies services, customer sat, revenue growth and profitability. This session examines the challenges and approaches for collecting, organizing and deriving real-time insights from terabytes to petabytes of data, with examples from Salesforce.com, the nation’s leading enterprise cloud computing company.
by Betsy Masiello, Jane Yakowitz and Solon Barocas
Analytics can push the frontier of knowledge well beyond the useful facts that already reside in big data, revealing latent correlations that empower organizations to make statistically motivated guesses—inferences—about the character, attributes, and future actions of their stakeholders and the groups to which they belong.
This is cause for both celebration and caution. Analytic insights can add to the stock of scientific and social scientific knowledge, significantly improve decision-making in both the public and private sector, and greatly enhance individual self-knowledge and understanding. They can even lead to entirely new classes of goods and services, providing value to institutions and individuals alike. But they also invite new applications of data that involve serious hazards.
This panel considers these hazards, asking how analytics implicate:
The panel will also debate the appropriate response to these issues, reviewing the place of norms, policies, legal frameworks, regulation, and technology.
22nd–23rd September 2011