Your current filters are…
by John Rauser
Quantitative Engineer? Business Intelligence Analyst? Data Scientist? The data deluge has come upon us so quickly that we don’t even know what to call ourselves, much less how to make a career of working with data. This talk examines the critical traits that lead to success by looking back to what may be the first act of data science.
MapReduce, Hadoop, and other “NoSQL” big data approaches open opportunities for data scientists in every industry to develop new data-driven applications for digital marketing optimization and social network analysis through the power of iterative, big data analysis. But what about the business user or analyst? How can they unlock insights through standard business intelligence (BI) tools or SQL access? The challenge with emerging big data technologies is finding staff with the specialized skill sets of the data scientist to implement and use these solutions. Business leaders and enterprise architects struggle to understand, implement, and integrate these big data technologies with their existing business processes and IT investments and provide value to the business.
This session will explore a new class of analytic platforms and technologies such as SQL-MapReduce® which bring the science of data to the art of business. By fusing standard business intelligence and analytics with next-generation data processing techniques such as MapReduce, big data analysis is no longer just in the hands of the few data science or MapReduce specialists in an organization! You’ll learn how business users can easily access, explore, and iterate their analysis of big data to unlock deeper sights. See example applications with digital marketing optimization, fraud detection and prevention, social network and relationship analysis, and more.
This session is sponsored by Aster Data
This talk will address the question of how to enable a much more agile data provisioning model for business units and data scientists. We’re in a mode shift where data unlocks new growth, and almost every Fortune 1000 company is scrambling to architect a new platform to enable data to be stored, shared and analyzed for competitive advantage. Many companies are finding that this shift requires major rethinking of how systems should be architected (and scaled) to enable agile, self-service access to critical data.
In this session we’ll discuss strategies for building agile big-data clouds that make it much faster and easier for data scientists to discover, provision and analyze data. We’ll discuss where and how new technologies (both vendor and OSS) fit into this model.
We will also discuss changes in application architectures as big-data begins to play a role in online applications, incorporating many big-data techniques to deliver consumer-targeted content. This new “real-time” analytics category is growing fast and several new data systems are enabling this shift. We’ll review which players and technologies in the NoSQL community are helping drive this architecture.
Economists utilize a data analysis toolkit and intuition that can be very helpful to Data Scientists. In particular, econometric methods are quite useful in disentangling correlation and causation, a use case not well-handled by standard machine learning and statistical techniques. This session will cover examples of econometric methods in action, as well as other economics-related insights. Think of it as a crash-course in basic econometric intuition that one receives during a PhD in Economics (I received my PhD from Stanford in 2008).
Why econometrics? The difference between econometrics and statistics is that statistical modeling is more concerned with fit, and econometric modeling is more concerned with properly estimating the coefficients in a regression. Getting the “right” (consistent & unbiased) estimates means that the analyst can more effectively measure how a change in one variable can strongly predict (or cause) a change in the dependent variable. These techniques can help solve problems in social/web data that previously were only solvable using future data collection from randomized multivariate experiments.
To do this, the analyst first develops an intuition for whether or not there is a source of “endogeneity” in the regression. This largely is determined by the relationship between the predictors and the error term in the regression. Once the source of the endogeneity is understood, econometric techniques like fixed/random effects and instrumental variables can be quite useful. The type of data that is collected and available is key to the extent to which the power of these techniques can be used. [I might also go into some other techniques, but these are the most useful]
The methods will be presented in a way so that a non-technical person can understand the basic intuition, and also so that a practitioner can apply the methods in the future. Examples will be provided. For panel data econometrics, we will discuss the example of how to identify actions taken early on by a LinkedIn member that are predictive of their future engagement with the product, a problem that is difficult due to the confounding of correlation and causation. For instrumental variables techniques, we will discuss how to use random variation in the weather to say cool things about politics, economics, and web usage.
In addition to the discussion of applied econometric techniques, there may also be time for economics-related data insights. Currently we are developing unemployment rate prediction models using time-series econometrics as well as indexes to measure changes in the supply/demand for talent across regions and industries.
How do data infrastructure, insights and products change when your user base grows by orders of magnitude? When should you move your user-facing data product off your laptop? (hint: now!) Does your data offer insights about the world at large, or is it just mirroring your early adopters?
In this talk, I will share some of the data scaling lessons we’ve learned at LinkedIn, recount war stories (and close calls!) and document the evolution of the data scientist.
Data scientists and technology companies are rapidly recognizing the immense power of data for drawing insights about their impact and operations, yet NGOs and non-profits are increasingly being left behind with mounting data and few resources to make use of it. Data Without Borders seeks to bridge this data divide by matching underserved NGOs with pro bono data scientists so that they can collect, manage, and analyze their data together in the service of humanity, creating a more open environment for socially conscious data and bringing greater change to the world.
by Ben Gimpert
Most people in our community are accustomed to thinking of a “model” as the end result of a properly functioning big data architecture. Once you have an EC2 cluster reserved, after the database is distributed across some Hadoop nodes, and once a clever MapReduce machine learning algorithm has done its job, the system spits out a predictive model. The model hopefully allows an organization to conduct its business better.
This waterfall approach to modeling is embedded in the hiring process and technical culture of most contemporary big data organizations. When the business users sit in one room and the data scientists sit in another, we preclude one of the most important benefits of having on-demand access to big data. Models themselves are powerful exploratory tools! However, data sparsity, non-linear interactions and the resultant model’s quirks must be interpreted through the lens of domain expertise. All big data models are wrong but some are useful, to paraphrase the statistician George Box.
A data scientist working in isolation could train a predictive model with perfect in-sample accuracy, but only an understanding of how the business will use the model lets her balance the crucial bias / variance trade-off. Put more simply, applied business knowledge is how we can assume a model trained on historical data will do decently with situations we have never seen.
Models can also reveal predictors in our data we never expected. The business can learn from the automatic ranking of predictor importance with statistical entropy and multicollinearity tools. In the extreme, a surprisingly important variable that turns up during the modeling of a big data set could be the trigger of an organizational pivot. What if a movie recommendation model reveals a strange variable for predicting gross at the box office?
My presentation introduces exploratory model feedback in the context of big (training) data. I will use a real-life case study from Altos Research that forecasts a complex system: real estate prices. Rapid prototyping with Ruby and an EC2 cluster allowed us to optimize human time, but not necessarily computing cycles. I will cover how exploratory model feedback blurs the line between domain expert and data scientist, and also blurs the distinction between supervised and unsupervised learning. This is all a data ecology, in which a model of big data can surprise us and suggest its own future enhancement.
22nd–23rd September 2011