Strata 2012 schedule

Tuesday 28th February 2012

  • Big Data Without the Heavy Lifting

    by Chris Deptula and James Dixon

    The big data world is extremely chaotic based on technology in its infancy. Learn how to tame this chaos, integrate it within your existing data environments (RDBMS, analytic databases, applications), manage the workflow, orchestrate jobs, improve productivity and make using big data technologies accessible to a much wider spectrum of developers, analysts and data scientists. Learn how you can actually leverage Hadoop and NoSQL stores via an intuitive, graphical big data IDE – eliminating the need for deep developer skills such as Hadoop MapReduce, Pig scripting, or NoSQL queries.

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom F, Santa Clara Convention Center

  • Designing Data Vizualisations Workshop

    by Noah Iliinsky

    We will discuss how to figure out what story to tell, select the right data, and pick appropriate layout and encodings. The goal is to learn how to create a visualization that conveys appropriate knowledge to a specific audience (which may include the designer).

    We’ll briefly discuss tools, including pencil and paper. No prior technology or graphic design experience is necessary. An awareness of some basic user-centered design concepts will be helpful.

    Understanding of your specific data or data types will help immensely. Please do bring data sets to play with.

    At 9:00am to 12:30pm, Tuesday 28th February

    In GA J, Santa Clara Convention Center

  • Hadoop Data Warehousing with Hive

    by Dean Wampler and Jason Rutherglen

    In this hands-on tutorial, you’ll learn how to install and use Hive for Hadoop-based data warehousing. You’ll also learn some tricks of the trade and how to handle known issues.

    Using the Hive Tutorial Tools

    We’ll email instructions to you before the tutorial so you can come prepared with the necessary tools installed and ready to go. This prior preparation will let us use the whole tutorial time to learn Hive’s query language and other important topics. At the beginning of the tutorial we’ll show you how to use these tools.

    Writing Hive Queries

    We’ll spend most of the tutorial using a series of hands-on exercises with actual Hive queries, so you can learn by doing. We’ll go over all the main features of Hive’s query language, HiveQL, and how Hive works with data in Hadoop.

    Advanced Techniques

    Hive is very flexible about the formats of data files, the “schema” of records and so forth. We’ll discuss options for customizing these and other aspects of your Hive and data cluster setup. We’ll briefly examine how you can write Java user defined functions (UDFs) and other plugins that extend Hive for data formats that aren’t supported natively.

    Hive in the Hadoop Ecosystem

    We’ll conclude with a discussion of Hive’s place in the Hadoop ecosystem, such as how it compares to other available tools. We’ll discuss installation and configuration issues that ensure the best performance and ease of use in a real production cluster. In particular, we’ll discuss how to create Hive’s separate “metadata” store in a traditional relational database, such as MySQL. We’ll offer tips on data formats and layouts that improve performance in various scenarios.

    At 9:00am to 12:30pm, Tuesday 28th February

    Coverage slide deck

  • Introduction to Apache Hadoop

    by Sarah Sproehnle

    This tutorial provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop, plus its associated ecosystem. This session is intended for those who are new to Hadoop and are seeking to understand where Hadoop is appropriate and how it fits with existing systems.

    The agenda will include:

    • The rationale for Hadoop
    • Understanding the Hadoop Distributed File System (HDFS) and MapReduce
    • Common Hadoop use cases including recommendation engines, ETL, time-series analysis and more
    • How Hadoop integrates with other systems like Relational Databases and Data Warehouses
    • Overview of the other components in a typical Hadoop “stack” such as these Apache projects: Hive, Pig, HBase, Sqoop, Flume and Oozie

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom CD, Santa Clara Convention Center

  • Introduction to R for Data Mining

    by Joseph B Rickert

    This tutorial will enable anyone with some programming experience to begin analyzing data with the R programming language


    • Where did R come from?
    • What makes R different from other statistical software?
    • Data structures in R
    • Reading and writing data sets
    • Manipulating Data
    • Basic statistics in R
    • Exploratory Data Analysis
    • Multiple Regression
    • Logistic Regression
    • Data mining in R
    • Cluster analysis
    • Classification algorithms
    • Working with Big Data
    • Challenges
    • Extensions to R for big data
    • Where to go from here?
    • The R community
    • Resources for learning R
    • Getting help

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom G, Santa Clara Convention Center

  • Jumpstart Welcome

    by Alistair Croll

    Opening remarks by Program Chair, Alistair Croll, Founder, Bitcurrent

    At 9:00am to 9:20am, Tuesday 28th February

    In GA K, Santa Clara Convention Center

  • Large scale web mining

    by Ken Krugler

    This tutorial will teach attendees about the key aspects of scalable web mining, via six modules:

    1. Introduction

    • Why web data is valuable
    • Key challenges to web crawling
    • Realistic definitions for success

    2. Focused Web Crawling

    • Reducing time & cost by focusing the crawl
    • Approaches to classifying and scoring pages
    • Solutions for scalable web crawling

    3. Structured Data Extraction

    • Data mining essentials
    • Structured text extraction
    • Automated vs. manual extraction

    4. Analyzing the Data

    • Making it searchable
    • Finding "interesting" text
    • Machine learning with Mahout

    5. Barriers to Success

    • Polite crawling versus deep crawling
    • Spam, splog, honeypots and nasty webmasters
    • Ajax, robots.txt and Facebook

    6. Examples and Summary

    • Hotel reviews
    • Music pages
    • SEO analysis

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom E, Santa Clara Convention Center

    Coverage slide deck

  • SQL and NoSQL Are Two Sides Of The Same Coin

    by Michael Rys

    Contrary to popular belief, SQL and NoSQL are not at odds with each other, they are duals—in fact NoSQL should really be called coSQL. Recognizing this duality can change the way we think about which technology to use when, and what we need to invest in next.

    At 9:00am to 9:45am, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

  • What Marketers Can Learn From Analysts

    by Avinash Kaushik

    Author and digital marketing evangelist Avinash Kaushik shares his perspective, drawing from experience with some of the world’s largest online marketers, and looks at how an analyst mentality is quickly permeating all aspects of business and marketing.

    At 9:20am to 10:00am, Tuesday 28th February

    In GA K, Santa Clara Convention Center

  • From Knowing ‘What’ To Understanding ‘Why’

    by Claudia Perlich

    With the collection of almost every piece of information about your customers comes the ability to start asking your data the right question: Why do they do what they do? And even more: what would they do if I could interact with them. We show for the case of online display advertising, how causal analysis gives interesting new answers about the right (and wrong) ways of spending your money.

    At 9:45am to 10:30am, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

  • Big Data and Supply Chain Management: Evolution or Disruptive Force??

    by Pervinder Johar, Lora Cecere, Marilyn Craig and Terence Craig

    The effect of big data on all business models cannot be denied. This panel of SCM experts looks at how business are using, or should be using, big data to drive supply chain management issues focusing on the broader manufacturing issues that must be addressed as well as practical tips that can be applied in dealing with supply chains that now span the globe.

    At 10:00am to 10:30am, Tuesday 28th February

    In GA K, Santa Clara Convention Center

  • Ammunition for the CFO: How to be a Hard-Nosed Business Customer for Analytics

    by JC Herz

    This presentation lays out some clear, concrete gating conditions for when it makes sense to pull the trigger on big data initiatives, and how they should be procured, depending on the use case, the data assets, and the resources available.

    At 11:00am to 11:25am, Tuesday 28th February

    In GA K, Santa Clara Convention Center

  • The Model and the Train Wreck: A Training Data How-to

    by Monica Rogati

    Getting training data for a recommender system is easy: if users clicked it, it’s a positive – if they didn’t, it’s a negative. … Or is it? In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.

    At 11:00am to 11:30am, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

    Coverage slide deck

  • 3 Essential Skills of a Data Driven CEO

    by Diego Saenz

    What are the fundamental skills that a CEO needs to become “Data Driven”? In this session we will discuss the 3 essential skills that will enable CEOs to effectively lead their organizations into the Data Revolution. These organizations will harness the power of data to innovate, grow profits and beat the competition.

    At 11:25am to 11:50am, Tuesday 28th February

    In GA K, Santa Clara Convention Center

  • Corpus Bootstrapping with NLTK

    by Jacob Perkins

    Learn various ways to bootstrap a custom corpus for training highly accurate natural language processing models. Real world examples will be presented with Python code samples using NLTK. Each example will show you how, starting from scratch, you can rapidly produce a highly accurate custom corpus for training the kinds of natural language processing models you need.

    At 11:30am to 12:00pm, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

    Coverage slide deck

  • Big Data: What are Enterprises Really Thinking About?

    by Vanessa Alvarez

    Despite the hype, enterprises are still wrapping their arms around the large amounts of data they’re sitting on, and how to leverage it. In this session, we’ll look at a snapshot of how enterprises are thinking about their big data strategies, and what it means to their top line.

    At 11:50am to 12:00pm, Tuesday 28th February

    In GA K, Santa Clara Convention Center

  • Business Intelligence: What Have We Been Missing?

    by Josh Gold and Felix Hamilton

    There are many rapidly evolving technologies that provide objective metrics and analytics for most outward facing business interactions. The evolution of similar inward facing tools has not kept pace. In this presentation we discuss which sources of internal organizational data are frequently neglected, approaches for automating data collection, and what valuable insights can result from analysis.

    At 12:00pm to 12:30pm, Tuesday 28th February

    In GA K, Santa Clara Convention Center

  • The Importance of Importance: An Introduction to Feature Selection

    by Ben Gimpert

    Twenty-first century big data is being used to train predictive models of emotional sentiment, customer churn, patient health, and other behavioral complexities. Variable importance and feature selection reduces the dimensionality of our models, so an unfeasible and complex problem may become somewhat more predictable.

    At 12:00pm to 12:30pm, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

  • Big Data Entity Extraction With Less Work and Less Code

    by Richard Taylor

    Do you want to write less code and get more done? This tutorial will demonstrate a natural language parsing technology to extract entities from all kinds of text using massively parallel clusters. Attendees will gain hands-on experience with the newly-released, data-centric cluster programming technology from HPCC Systems to extract entities from semi-structured and free-form text data. Students will leave with all the data and code used in the class along with the latest HPCC Client Tools installation, HPCC documentation, and HPCC’s VMware installation. Prizes, give-aways and a raffle is included.

    This session is sponsored by HPCC

    At 1:30pm to 5:00pm, Tuesday 28th February

    In Ballroom F, Santa Clara Convention Center

  • Building Applications with Apache Cassandra

    by Nate McCall

    The database industry has been abuzz over the past year about NoSQL databases and their applications in the realm of solutions commonly placed under the ‘big data’ heading.

    This interest has filtered down to software development organizations who have had to scramble to make sense of terminology, concepts and patterns particularly in the areas of distributed computing which were previously limited to academics and a very small number of special case applications.

    Like all of these systems, Apache Cassandra, which has quickly emerged as a best-of-breed solution in this space in the NoSQL/Big Data space, still requires a substantial learning curve to implement correctly.

    This tutorial will walk attendees through the fundamentals of Apache Cassandra, installing a small working cluster either locally or via a cloud provider, and practice configuring and managing this cluster with the tools provided in the open source distribution.

    Attendees will then use this cluster to design a simple Java web application as a way to gain practical, hands on experience in designing applications to take advantage of the massive performance gains and operational efficiency that can be leveraged from a correctly architected Apache Cassandra cluster.

    Attendees should leave the tutorial with hands-on knowledge of building a real, working distributed database.

    At 1:30pm to 5:00pm, Tuesday 28th February

    In Ballroom G, Santa Clara Convention Center

    Coverage slide deck

  • Developing applications for Apache Hadoop

    by Sarah Sproehnle

    This tutorial will explain how to leverage a Hadoop cluster to do data analysis using Java MapReduce, Apache Hive and Apache Pig. It is recommended that participants have experience with some programming language. Topics include:

    • Why are Hadoop and MapReduce needed?
    • Writing a Java MapReduce program
    • Common algorithms applied to Hadoop such as indexing, classification, joining data sets and graph processing
    • Data analysis with Hive and Pig
    • Overview of writing applications that use Apache HBase

    At 1:30pm to 5:00pm, Tuesday 28th February

    In GA J, Santa Clara Convention Center

  • Hands-on Visualization with Tableau

    by Jock Mackinlay, Tableau Software and Ross Perez

    Important: please read the equipment requirements at the bottom of this page before attending the tutorial

    Data has always been a second class citizen on the web. As images, then audio, then video made their way onto the internet, data was always left out of the party, forced into dusty Excel files and ugly HTML tables. Tableau Public is one of the tools aiming to change that by allowing anyone to create interactive charts, maps and graphs and publishing to the web—no programming required.

    In this tutorial you will learn why data is vital to the future of the web, how Tableau Public works, and gain hands-on experience with taking data from numbers to the web.

    Through three different use cases, you will learn the capabilities of the Tableau Public product. The tutorial will conclude with an extended hands-on session covering the visualization process from data to publishing. Topics covered will include:

    • constructing a data set for best performance
    • formatting visualizations to match your preferred branding
    • designing charts for clear communication and impact


    This is a hands-on tutorial. You will need to bring either a Windows laptop or a laptop with a Windows virtual machine installed. Before arriving, you should download and install Tableau Public from this URL: http://www.tableausoftware.com/p...

    At 1:30pm to 5:00pm, Tuesday 28th February

    In Ballroom H, Santa Clara Convention Center

  • Social Network Analysis Isn't Just For People

    by Matt Biddulph

    The tools of social network analysis – centrality measures, clustering, graph-traversal algorithms, community detection and so forth – are largely based on mathematical network theory. There is very little in these techniques that actually requires that the data represents social activity. This presentation will show how these techniques can be applied to data from areas such as geo, the Wikipedia link graph and linguistics.

    We’ll show how to take tabular or textual data and derive graph representations from it that can be used to apply these techniques. We’ll discuss practical applications of these techniques in delivering new features for web applications. We’ll also show how the powerful visualisation tool Gephi can be used to explore the data once it’s in graph form.

    This talk will be partly based on content from an Ignite talk given at Strata NYC 2011: http://slideshare.net/mattb/plac...

    At 1:30pm to 2:15pm, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

  • The Business of Big Data

    by Mark Madsen

    Mark Madsen talks about how regular businesses will eventually embrace a data-driven mindset, with some trademark ‘Madsen’ history background to put it in context. People throw around ‘industrial revolution of data’ and ‘new oil’ a lot without really thinking about what things like the scientific method, or steam power, or petrochemicals did as a result.

    At 1:30pm to 2:10pm, Tuesday 28th February

    In GA K, Santa Clara Convention Center

    Coverage slide deck

  • The Craft of Data Journalism

    by Simon Rogers and Michael Brunton-Spall

    Learn first hand from award-winning Guardian journalists how they mix data, journalism and visualization to break and tell compelling stories: all at newsroom speeds.

    At 1:30pm to 5:00pm, Tuesday 28th February

    In Ballroom E, Santa Clara Convention Center

  • The Two Most Important Algorithms in Predictive Modeling Today

    by Mike Bowles and Jeremy Howard

    When doing predictive modelling, there are two situations in which you might find yourself:

    You need to fit a well-defined parameterised model to your data, so you require a learning algorithm which can find those parameters on a large data set without over-fitting
    You just need a “black box” which can predict your dependent variable as accurately as possible, so you need a learning algorithm which can automatically identify the structure, interactions, and relationships in the data
    For case (1), lasso and elastic-net regularized generalized linear models are a set of modern algorithms which meet all these needs. They are fast, work on huge data sets, and avoid over-fitting automatically. They are available in the “glmnet” package in R.

    For case (2), ensembles of decision trees (often known as “Random Forests”) have been the most successful general-purpose algorithm in modern times. For instance, most Kaggle competitions have at least one top entry that heavily uses this approach. This algorithm is very simple to understand, and is fast and easy to apply. It is available in the “randomForest” package in R.

    Mike and Jeremy will explain in simple terms, using no complex math, how these algorithms work, and will also explain using numerous examples how to apply them using R. They will also provide advice on how to select from these algorithms, and will show how to prepare the data, and how to use the trained models in practice.

    At 1:30pm to 5:00pm, Tuesday 28th February

    In Ballroom CD, Santa Clara Convention Center

  • Do it Right – Proven Techniques for Exploiting Big Data Analytics

    by Bill Schmarzo

    “Big data” provides the opportunity to combine new, rich data sources in novel ways to discover business insights. How do you use analytics to exploit this data so that it will yield real business value? Learn a proven technique that ensures you identify where and how big data analytics can be successfully deployed within your organization. Case study examples will demonstrate its use.

    At 2:10pm to 2:30pm, Tuesday 28th February

    In GA K, Santa Clara Convention Center

  • Array Theory vs. Set Theory in Managing Data

    by Robert Lefkowitz

    Relational databases were based on Set theory — which insists that the order of items does not matter. For many (most?) data problems, however, order does matter. By using Array theory, a relational-like database gains a considerable advantage over set-theory based engines.

    At 2:15pm to 3:00pm, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

  • Big Data, Serious Games, and the Future of Work

    by Michael Hugos

    In this session, business agility expert Michael Hugos will present examples from his work in applying immersive animation techniques and gaming dynamics, and discuss how they can address the challenges of consuming – and responding to – the data deluge, turning information overload into business advantage.

    At 2:30pm to 3:00pm, Tuesday 28th February

    In GA K, Santa Clara Convention Center

  • It’s Not Just About the Data…...the Power of Driving Impact Through Intent and Interconnectedness

    by Marcia Tal

    In this session, Marcia Tal will demonstrate how significant business value is being realized through sophisticated understanding of intent and interconnectedness, at scale.

    At 3:30pm to 3:50pm, Tuesday 28th February

    In GA K, Santa Clara Convention Center

Schedule incomplete?

Add a new session

Filter by Day

Filter by coverage

Filter by Topic

Filter by Venue

Filter by Space