by Chris Deptula and James Dixon
The big data world is extremely chaotic based on technology in its infancy. Learn how to tame this chaos, integrate it within your existing data environments (RDBMS, analytic databases, applications), manage the workflow, orchestrate jobs, improve productivity and make using big data technologies accessible to a much wider spectrum of developers, analysts and data scientists. Learn how you can actually leverage Hadoop and NoSQL stores via an intuitive, graphical big data IDE – eliminating the need for deep developer skills such as Hadoop MapReduce, Pig scripting, or NoSQL queries.
We will discuss how to figure out what story to tell, select the right data, and pick appropriate layout and encodings. The goal is to learn how to create a visualization that conveys appropriate knowledge to a specific audience (which may include the designer).
We’ll briefly discuss tools, including pencil and paper. No prior technology or graphic design experience is necessary. An awareness of some basic user-centered design concepts will be helpful.
Understanding of your specific data or data types will help immensely. Please do bring data sets to play with.
In this hands-on tutorial, you’ll learn how to install and use Hive for Hadoop-based data warehousing. You’ll also learn some tricks of the trade and how to handle known issues.
Using the Hive Tutorial Tools
We’ll email instructions to you before the tutorial so you can come prepared with the necessary tools installed and ready to go. This prior preparation will let us use the whole tutorial time to learn Hive’s query language and other important topics. At the beginning of the tutorial we’ll show you how to use these tools.
Writing Hive Queries
We’ll spend most of the tutorial using a series of hands-on exercises with actual Hive queries, so you can learn by doing. We’ll go over all the main features of Hive’s query language, HiveQL, and how Hive works with data in Hadoop.
Hive is very flexible about the formats of data files, the “schema” of records and so forth. We’ll discuss options for customizing these and other aspects of your Hive and data cluster setup. We’ll briefly examine how you can write Java user defined functions (UDFs) and other plugins that extend Hive for data formats that aren’t supported natively.
Hive in the Hadoop Ecosystem
We’ll conclude with a discussion of Hive’s place in the Hadoop ecosystem, such as how it compares to other available tools. We’ll discuss installation and configuration issues that ensure the best performance and ease of use in a real production cluster. In particular, we’ll discuss how to create Hive’s separate “metadata” store in a traditional relational database, such as MySQL. We’ll offer tips on data formats and layouts that improve performance in various scenarios.
This tutorial provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop, plus its associated ecosystem. This session is intended for those who are new to Hadoop and are seeking to understand where Hadoop is appropriate and how it fits with existing systems.
The agenda will include:
This tutorial will enable anyone with some programming experience to begin analyzing data with the R programming language
Opening remarks by Program Chair, Alistair Croll, Founder, Bitcurrent
by Ken Krugler
This tutorial will teach attendees about the key aspects of scalable web mining, via six modules:
2. Focused Web Crawling
3. Structured Data Extraction
4. Analyzing the Data
5. Barriers to Success
6. Examples and Summary
by Michael Rys
Contrary to popular belief, SQL and NoSQL are not at odds with each other, they are duals—in fact NoSQL should really be called coSQL. Recognizing this duality can change the way we think about which technology to use when, and what we need to invest in next.
Author and digital marketing evangelist Avinash Kaushik shares his perspective, drawing from experience with some of the world’s largest online marketers, and looks at how an analyst mentality is quickly permeating all aspects of business and marketing.
by Claudia Perlich
With the collection of almost every piece of information about your customers comes the ability to start asking your data the right question: Why do they do what they do? And even more: what would they do if I could interact with them. We show for the case of online display advertising, how causal analysis gives interesting new answers about the right (and wrong) ways of spending your money.
The effect of big data on all business models cannot be denied. This panel of SCM experts looks at how business are using, or should be using, big data to drive supply chain management issues focusing on the broader manufacturing issues that must be addressed as well as practical tips that can be applied in dealing with supply chains that now span the globe.
by JC Herz
This presentation lays out some clear, concrete gating conditions for when it makes sense to pull the trigger on big data initiatives, and how they should be procured, depending on the use case, the data assets, and the resources available.
Getting training data for a recommender system is easy: if users clicked it, it’s a positive – if they didn’t, it’s a negative. … Or is it? In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.
by Diego Saenz
What are the fundamental skills that a CEO needs to become “Data Driven”? In this session we will discuss the 3 essential skills that will enable CEOs to effectively lead their organizations into the Data Revolution. These organizations will harness the power of data to innovate, grow profits and beat the competition.
Learn various ways to bootstrap a custom corpus for training highly accurate natural language processing models. Real world examples will be presented with Python code samples using NLTK. Each example will show you how, starting from scratch, you can rapidly produce a highly accurate custom corpus for training the kinds of natural language processing models you need.
Despite the hype, enterprises are still wrapping their arms around the large amounts of data they’re sitting on, and how to leverage it. In this session, we’ll look at a snapshot of how enterprises are thinking about their big data strategies, and what it means to their top line.
by Josh Gold and Felix Hamilton
There are many rapidly evolving technologies that provide objective metrics and analytics for most outward facing business interactions. The evolution of similar inward facing tools has not kept pace. In this presentation we discuss which sources of internal organizational data are frequently neglected, approaches for automating data collection, and what valuable insights can result from analysis.
by Ben Gimpert
Twenty-first century big data is being used to train predictive models of emotional sentiment, customer churn, patient health, and other behavioral complexities. Variable importance and feature selection reduces the dimensionality of our models, so an unfeasible and complex problem may become somewhat more predictable.
by Richard Taylor
Do you want to write less code and get more done? This tutorial will demonstrate a natural language parsing technology to extract entities from all kinds of text using massively parallel clusters. Attendees will gain hands-on experience with the newly-released, data-centric cluster programming technology from HPCC Systems to extract entities from semi-structured and free-form text data. Students will leave with all the data and code used in the class along with the latest HPCC Client Tools installation, HPCC documentation, and HPCC’s VMware installation. Prizes, give-aways and a raffle is included.
This session is sponsored by HPCC
by Nate McCall
The database industry has been abuzz over the past year about NoSQL databases and their applications in the realm of solutions commonly placed under the ‘big data’ heading.
This interest has filtered down to software development organizations who have had to scramble to make sense of terminology, concepts and patterns particularly in the areas of distributed computing which were previously limited to academics and a very small number of special case applications.
Like all of these systems, Apache Cassandra, which has quickly emerged as a best-of-breed solution in this space in the NoSQL/Big Data space, still requires a substantial learning curve to implement correctly.
This tutorial will walk attendees through the fundamentals of Apache Cassandra, installing a small working cluster either locally or via a cloud provider, and practice configuring and managing this cluster with the tools provided in the open source distribution.
Attendees will then use this cluster to design a simple Java web application as a way to gain practical, hands on experience in designing applications to take advantage of the massive performance gains and operational efficiency that can be leveraged from a correctly architected Apache Cassandra cluster.
Attendees should leave the tutorial with hands-on knowledge of building a real, working distributed database.
This tutorial will explain how to leverage a Hadoop cluster to do data analysis using Java MapReduce, Apache Hive and Apache Pig. It is recommended that participants have experience with some programming language. Topics include:
Important: please read the equipment requirements at the bottom of this page before attending the tutorial
Data has always been a second class citizen on the web. As images, then audio, then video made their way onto the internet, data was always left out of the party, forced into dusty Excel files and ugly HTML tables. Tableau Public is one of the tools aiming to change that by allowing anyone to create interactive charts, maps and graphs and publishing to the web—no programming required.
In this tutorial you will learn why data is vital to the future of the web, how Tableau Public works, and gain hands-on experience with taking data from numbers to the web.
Through three different use cases, you will learn the capabilities of the Tableau Public product. The tutorial will conclude with an extended hands-on session covering the visualization process from data to publishing. Topics covered will include:
IMPORTANT: EQUIPMENT REQUIREMENTS
This is a hands-on tutorial. You will need to bring either a Windows laptop or a laptop with a Windows virtual machine installed. Before arriving, you should download and install Tableau Public from this URL: http://www.tableausoftware.com/p...
The tools of social network analysis – centrality measures, clustering, graph-traversal algorithms, community detection and so forth – are largely based on mathematical network theory. There is very little in these techniques that actually requires that the data represents social activity. This presentation will show how these techniques can be applied to data from areas such as geo, the Wikipedia link graph and linguistics.
We’ll show how to take tabular or textual data and derive graph representations from it that can be used to apply these techniques. We’ll discuss practical applications of these techniques in delivering new features for web applications. We’ll also show how the powerful visualisation tool Gephi can be used to explore the data once it’s in graph form.
This talk will be partly based on content from an Ignite talk given at Strata NYC 2011: http://slideshare.net/mattb/plac...
by Mark Madsen
Mark Madsen talks about how regular businesses will eventually embrace a data-driven mindset, with some trademark ‘Madsen’ history background to put it in context. People throw around ‘industrial revolution of data’ and ‘new oil’ a lot without really thinking about what things like the scientific method, or steam power, or petrochemicals did as a result.
Learn first hand from award-winning Guardian journalists how they mix data, journalism and visualization to break and tell compelling stories: all at newsroom speeds.
by Mike Bowles and Jeremy Howard
When doing predictive modelling, there are two situations in which you might find yourself:
You need to fit a well-defined parameterised model to your data, so you require a learning algorithm which can find those parameters on a large data set without over-fitting
You just need a “black box” which can predict your dependent variable as accurately as possible, so you need a learning algorithm which can automatically identify the structure, interactions, and relationships in the data
For case (1), lasso and elastic-net regularized generalized linear models are a set of modern algorithms which meet all these needs. They are fast, work on huge data sets, and avoid over-fitting automatically. They are available in the “glmnet” package in R.
For case (2), ensembles of decision trees (often known as “Random Forests”) have been the most successful general-purpose algorithm in modern times. For instance, most Kaggle competitions have at least one top entry that heavily uses this approach. This algorithm is very simple to understand, and is fast and easy to apply. It is available in the “randomForest” package in R.
Mike and Jeremy will explain in simple terms, using no complex math, how these algorithms work, and will also explain using numerous examples how to apply them using R. They will also provide advice on how to select from these algorithms, and will show how to prepare the data, and how to use the trained models in practice.
“Big data” provides the opportunity to combine new, rich data sources in novel ways to discover business insights. How do you use analytics to exploit this data so that it will yield real business value? Learn a proven technique that ensures you identify where and how big data analytics can be successfully deployed within your organization. Case study examples will demonstrate its use.
Relational databases were based on Set theory — which insists that the order of items does not matter. For many (most?) data problems, however, order does matter. By using Array theory, a relational-like database gains a considerable advantage over set-theory based engines.
In this session, business agility expert Michael Hugos will present examples from his work in applying immersive animation techniques and gaming dynamics, and discuss how they can address the challenges of consuming – and responding to – the data deluge, turning information overload into business advantage.
by Marcia Tal
In this session, Marcia Tal will demonstrate how significant business value is being realized through sophisticated understanding of intent and interconnectedness, at scale.
28th February to 1st March 2012