Python is the language of choice when it comes to integrating analytical components. We will present a series of concepts and walkthroughs that illustrate how easy scientific computing is in Python, from machine learning and time series to spatial relationships and network analysis.
Explore interactively with IPython (5 minutes)
Work with matrices in NumPy (5 minutes)
Apply optimized algorithms from SciPy (10 minutes)
Visualize numerical data with Matplotlib (10 minutes)
Dive into machine learning with scikit-learn (30 minutes)
Model time series with pandas (30 minutes)
Compute statistics with statsmodels (15 minutes)
Interpret spatial relationships with shapely and pysal (30 minutes)
Analyze networks with NetworkX (10 minutes)
Save huge datasets with h5py (5 minutes)
by Sarah Sproehnle and Mark Fei
This tutorial provides an introduction to Apache Hadoop and what it’s being used for. This will include:
For CIOs, IT executives, and technology professionals, Strata’s Bridge to Big Data lays out the roadmap to get your organization up to speed on big data. In this all-day event, learn how to create big data strategy, manage your first pilot project, demystify vendor solutions and understand how big data differs from BI.
Bridge to Big Data will be chaired by industry experts Edd Dumbill (Program Chair, Strata), Mark Madsen (Third Nature) and John Akred (Accenture).
In most companies, the brief for big data will be given to those who already have responsibility for IT, business intelligence or marketing. But what happens after the CEO hands you the big data portfolio? While many big data problems will be familiar, there are also unique challenges and opportunities.
“Bridge to Big Data” combines hard-to-find real-world experience with an independent perspective and deep data know-how. Join us for an essential big data strategy workshop, tailored for CIOs and enterprise IT professionals.
Topics addressed will include:
The difference between big data and BI
Picking a test project and getting buy-in
Scaling up after your pilot project
Big data strategy: centralized or distributed
How to avoid creating competing big data efforts
Beating buzzwords: every vendor has “big data” solutions
How big data changes future IT planning and data architecture
This tutorial begins by reviewing human perception and our ability to decode graphical information. It continues by:
Since scales (the rulers along which we graph the data) have a profound effect on our interpretation of graphs, the section on general principles contains a detailed discussion of scales including:
* To include or not to include zero?
* When do logarithmic scales improve clarity?
* What are breaks in scales and how should they be used?
* Are two scales better than one? How can we distinguish between informative and deceptive double axes?
* Can a scale “hide” data? How can this be avoided?
The tutorial concludes with additional topics appropriate for the expected audience. Possible topics include deceptive and misleading graphs, graphs that have affected history, how to decorate if appropriate, resolving conflicting advice on data displays, and special requests from the customer.
Participants will learn to:
* Present data more effectively in all media.
* Display data so that their structure is more apparent.
* Understand principles of effective simple graphs to build upon when creating interactive or dynamic displays.
* Become more critical and analytical when viewing graphs.
For business strategists, marketers, product managers, and entrepreneurs, Strata's Data Driven Business Day looks at how to use data to make better business decisions faster.
Packed with revealing case studies, thought-provoking panels, and eye-opening presentations, this fast-paced day focuses on how to solve today's thorniest business problems with Big Data. It's the missing MBA for a data-driven, always-on business world.
Past DDBD sessions have looked at fundamental issues, including:
How would Big Data change supply chains, inventory, and Just In Time?
Could companies use data-driven methods to improve their managers' effectiveness?
What changes to compliance and governance would an increasingly data-driven business strategy bring?
How could data show marketers non-obvious things about their campaigns, allowing them to improve?
Do we need to revisit age-old strategic models like those from Michael Porter?
Smart companies can't run on anecdotes and gut instinct. But they also can't follow the data without understanding their business ecosystem or the realities of their current state. They have to strike a balance between the arrogance of opinion and the myopia of a rigid adherence to metrics alone. Rather than data-driven, they need to be data-informed.
This year, we'll focus on actually producing change. While Big Data can inform business, if it isn't tied to an organization's goals, objectives, and culture, it won't have an impact. We might think ourselves rational creatures, but as individuals or organizations, most of our decisions are made through moral reasoning, quick-reaction instinct, and guts. That's orthogonal to the data-driven model, and these two worlds—the rational, scientific mind and the instinctive, quick-to-judge heart—are about to collide in the boardroom.
Big Data is teetering at the apex of Gartner's hype curve. Soon, we'll realize it isn't a panacea; that while it upends many business assumptions, it must deal with traditional obstacles to business—scale, culture, and span of control.
At New York's DDBD, we'll consider:
How are specific disciplines changing because of the advent of data?
What are the biggest obstacles that companies have to overcome to go from data-flooded to data-driven?
What fundamental assumptions—supply and demand, models of competition, innovation, and so on—are no longer true in a world where everyone and everything is connected?
What new interfaces should be used to collect information from employees, competitors, and customers? How should organizations “keep score”?
What new interfaces will democratize data, making it easier to grasp and exposing it to a wider audience?
What technologies are most likely to disrupt business in the coming 24 months, just as MP3s disrupted music or electronic readers disrupted publishing?
It's a packed lineup of sought-after speakers, real-world case studies, and strategic discussions you can't afford to miss. If you're an executive, entrepreneur, manager or innovator, DDBD gives you the tools you need to reap the Big Data harvest—getting not just more information, but more of what really matters: business results.
Though getting the right data is perhaps the most essential part of any data journalism project, often one of the most difficult aspects of this type of project is cleaning and auditing the data so that it is usable – or even intelligible. In fact, one may not even know whether a particular data set is the right one for the story until it has been cleaned. Data problems can take many forms – from misspellings to mixed data types and everything in between. What’s more, there are a wide variety of tools that can be used to handle these cleaning tasks, and sometimes completing it efficiently requires applying several.
This 3-hour tutorial will provide novice users with an overview of a range of common tools use for data cleaning and analysis – including Microsoft Excel, Google Refine, Python and R – along with their relative strengths and weaknesses. In addition to demos of how the more advanced tools like Python and R can be used for text parsing and statistical analysis, hands-on training using concrete data examples with Excel and Refine will also be shown. By the end, attendees will not only have learned useful new skills in Excel and Refine, they will have a roadmap for what kind of expertise they need to look for if they have a more complex task.
by Tom Wheeler
Software testing is hard enough, but it becomes especially challenging when you’re doing large-scale, distributed data processing. This tutorial will present a mix of lecture and instructor-led demonstrations to explain how you can verify that your code performs exactly as you intended.
This session will focus on four key topics:
Unit testing: Proving that a single piece of code works in isolation
Integration testing: Verifying that these units work correctly in conjunction with one another
Performance testing: Ensuring that the code runs at the expected speed and scale
Diagnostics: How to extract valuable information from Hadoop that can help you isolate problems in your code
We will also discuss several problems developers commonly introduce into their code, as well as ways to recognize and solve them.
Why should you care about HBase? Especially if you already know and use Riak/MongoDB/Cassandra/etc?
Well, HBase is inspired by Google’s battle-hardened “BigTable” architecture and is known to be one of the most scalable distributed databases around. For some perspective, the largest Riak and MongoDB clusters in production are measured in the dozens of nodes – while HBase clusters with hundreds of nodes aren’t unusual.
Even if you aren’t running a Top 100 site (yet), HBase could still be very useful to you since it’s strongly consistent and easy to reason about.
However, for all of HBase’s strengths, it is difficult to get started with. Instead of just running one daemon on each server in your cluster, you’ll have the fun of configuring HDFS (which requires you to setup a Namenode, backup Namenode, and datanodes), Zookeeper quorum (which should be at least 3 dedicated servers), and HBase itself (which will require a master, backup master, and regionservers). And that’s assuming you don’t want MapReduce or security features (which each require several more servers).
After this workshop you will:
by Collin Bennett and Robert L. Grossman
In this tutorial, we show how open source tools can be used for the entire life cycle of a predictive model built over big data. Specifically, for anyone who has built a model, we show how to: 1) perform an exploratory data analysis (EDA) of data managed by Hadoop using R and other open source tools; 2) leverage the EDA to build analytic and statistical models over data managed by Hadoop; 3) deploy these models into operational systems; and 4) measure the performance of the models and continuously improve them.
We cover the following topics:
Hadoop HDFS is typically adopted in situations where traditional storage and database systems are either reaching their limits or have already surpassed them. This usually implies that there are one or more large streams of events that need to be collected, such as log data streams. Flume NG was designed from the ground-up to tackle this problem in a straightforward, scalable, reliable way, and empirical results support the success of its approach.
At a high level, Flume NG has a simple well-designed architecture consisting of a set of agents with each agent running any number of sources, channels (event buffers), and sinks. Flume agents can easily be chained across the network to provide a configurable pipeline through which discrete events flow reliably from source (i.e., an application server) to destination (i.e., a Hadoop HDFS cluster). Flume can be configured to support arbitrary data flows, including fan-in (data aggregation) and fan-out (data replication) designs. Such designs are primarily an artifact of the generality of the agent-based architecture.
In this tutorial, a group of people closely involved with Flume walk participants through setting up a typical data collection infrastructure using Flume. We first describe the basic architecture of Flume including its design, the transactional semantics it supports for reliability, and the sources, channels, and sinks included with the Flume core. We then move on to a brief description of common data flow architectures, and choose a typical data collection scenario for which we use Flume to do the heavy lifting. Next we come to the main body of this tutorial session, which is a walkthrough of installing, configuring, and tuning a scalable, reliable, and fault-tolerant Flume-based data collection system for storing events into a Hadoop system in real time.
Throughout this presentation we also cover: (1) how to configure Flume to store data on a secure HDFS cluster, (2) configuration options used to trade off between performance and fault tolerance, (3) Avro support, (4) Flume extension points, plugins, and hooks, (5) Flume compatibility with various versions of Hadoop, (6) performance benchmarks, and (7) general best practices for using Flume NG effectively.
by Ed Kohlwey and Stephanie Beben
Implementing Map/Reduce applications using tools like Java can be hard; as a result, it is often useful to be able to use Map/Reduce from other languages. In this tutorial, we’ll provide an introduction to RHadoop, an open source Map/Reduce library for R. We will assume that attendees have a broad familiarity with R and Hadoop, however the exercises do not require attendees to be an expert in either platform.
First, we will discuss the basics of Map/Reduce, a framework for writing massively parallel big data analytics, and the nuances of the RHadoop implementation.
Next, we’ll discuss some common techniques in RHadoop including maintaining application state, processing data that has a Zipfian distribution, representing distributed matrices, performing basic operations over distributed matrices, finding outliers, and debugging.
Finally, we’ll walk through an interactive exercise to show attendees how to create a trending topic analysis using LDA and RHadoop. First, we’ll show attendees how to install both Hadoop and the rmr package, which provides Map/Reduce functionality. Then we’ll walk through an interactive coding example that demonstrates how to actually use RHadoop to create a sliding window analysis of trending topics.
Attendees: All attendees should bring paper an pen for quick sketching. Attendees should bring their own data to work with. Alternately, they can download interesting data sets from sites such as infochimps.com, buzzdata.com, and data.gov. People with access to a windows machine might want to install Tableau Public.
We will discuss how to figure out what story to tell, select the right data, and pick appropriate layout and encodings. The goal is to learn how to create a visualization that conveys appropriate knowledge to a specific audience (which may include the designer).
We’ll briefly discuss tools, including pencil and paper. No prior technology or graphic design experience is necessary. An awareness of some basic user-centered design concepts will be helpful.
Understanding of your specific data or data types will help immensely. Please do bring data sets to play with.
by Dean Wampler
In this hands-on tutorial, you’ll learn how to install and use Hive for Hadoop-based data warehousing. You’ll also learn some tricks of the trade and how to handle known issues.
Writing Hive Queries
We’ll spend most of the tutorial using a series of hands-on exercises with actual Hive queries, so you can learn by doing. We’ll go over all the main features of Hive’s query language, HiveQL, and how Hive works with data in Hadoop.
Hive is very flexible about the formats of data files, the “schema” of records and so forth. We’ll discuss options for customizing these and other aspects of your Hive and data cluster setup. We’ll briefly examine how you can write Java user defined functions (UDFs) and other plugins that extend Hive for data formats that aren’t supported natively.
Hive in the Hadoop Ecosystem
We’ll learn Hive’s place in the Hadoop ecosystem, such as how it compares to other available tools. We’ll discuss installation and configuration issues that ensure the best performance and ease of use in a real production cluster. In particular, we’ll discuss how to create Hive’s separate “metadata” store in a traditional relational database, such as MySQL. We’ll offer tips on data formats and layouts that improve performance in various scenarios.
In this hands-on tutorial, you will learn the importance of distributed search by our industry experience and a specific example. In particular, we’ll introduce the architecture that incorporates distributed search techniques, share pain points experienced and lessons learned. Building atop, we’ll depict the landscape of distributed search tools and their future directions. For the hands-on part of the tutorial, you will learn how to install and use Apache Solr for real-time Big Data analytics, search, and reporting. You’ll also learn some tricks of the trade and how to handle known issues.
While Hadoop is the most well-known technology in big data, it's not always the most approachable or appropriate solution for data storage and processing. In this session you'll learn about enterprise NoSQL architectures, with examples drawn from real-world deployments, as well as how to apply big data regardless of the size of your own enterprise.
by Mike Olson
by Ben Werther
Hadoop is scalable, inexpensive and can store near-infinite amounts of data. But driving it requires exotic skills and hours of batch processing to answer straightforward questions. Learn how everything is about to change.
by Michael Flowers
New York City is a complex, thriving organism. Hear how data science has played a surprising and effective role in helping the city government provide services to over 8 million people, from preventing public safety catastrophes to improving New Yorkers’ quality of life.
by Rich Hickey
While moving away from single powerful servers, distributed databases still tend to be monolithic solutions. But e.g. key-value storage is rapidly becoming a commodity service, on which richer databases might be built. What are the implications?
In recent years, “Big Data” has matured from a vague description of massive corporate data to a household term that refers to not just volume but the diversity of data and velocity of change. Today, there’s a wealth of data trapped in corporate data repositories, new platforms like Hadoop, a new generation of data marketplaces and volumes generated hourly on the Web. With the opportunity for key insights that these diverse data sources present, the business user’s ability to get to the data when they need it and gleam fast insights has become a massive priority. In a nutshell, easing access and analysis of both private and public data is one of the biggest opportunities ahead. New approaches to enable self-driven exploration of private and public data are necessary and will help address the critical ‘last mile’ problem in big data. Big Data Direct discusses the opportunity ahead for business users to intuitively and easily harness the power of private and public data for deeper customer intelligence and to identify new business opportunities.
by Tim Estes
The onset of the Big Data phenomenon has created a unique opportunity to improve the human condition, but the challenge ahead of us is to move beyond Big Data infrastructure to real, applied, and prioritized comprehension that is morally and practically useful. This requires redirecting our collective energies toward new algorithms, more distributed systems, and purer software architectures that more optimally exploit the infrastructure to answer questions of great social and personal value. Technologies that close the “Understanding Gap” can make great strides to prevent evil, reduce suffering, and create more actualized human potential. This pursuit is more than an opportunity- it is a key responsibility for the technology community today and through at least the next decade.
by Jim Caputo and Michael Manoochehri
60 hours of videos are uploaded to YouTube every minute. The Google search index contained 100 Million Gigabytes of data in 2010. Other Google services have hundreds of millions of users. Each of these products generates massive amounts of data. Google has developed custom technologies to analyze this data and make intelligent product decisions.
Dremel is a scalable, interactive ad-hoc query system. By combining multi-level execution trees and columnar data layout, Dremel allows users to run queries in a SQL-like language over tables with billions of rows in seconds. Dremel uses an architecture distinct from MapReduce-based platforms to improve efficiency when running multiple simultaneous query jobs. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google querying web logs, ad analytics and financial data.
Google’s situation is no longer unique. As more and more companies collect massive amounts of data, they need to quickly analyze it without large investments in infrastructure or human capital. We want everyone to have the power of Dremel.
BigQuery puts the powerful interactive querying capabilities of Dremel into the hands of users everywhere. It is designed for accessibility and ease of use, featuring a REST API as well as a web-based interface. BigQuery enables users to ingest 1 TB of data and run hundreds of queries on it with a SQL-like language in less than an hour.
This session will discuss the development and capabilities of Dremel, in particular its performance characteristics and ability to enable interactive ad-hoc querying on a multi-tenant architecture. We’ll also dive into the design challenges necessary to make the Dremel technology accessible and performant for third-party developers and business users to work with massive data sets.
Big data can do more than support quantitative decision making or lift key metrics. Big data can create new, better, or warming experiences for users, when leveraged correctly. By inverting the traditional relationship of big data, statistics and optimization, we can create experiences that encourage users to explore not only their data, but their world, and create and engage in the world on a deeper and more meaningful basis.
In this session, we’ll examine how we can apply this idea to modern mapping. Traditionally, modern online mapping has been geared towards literal data presentation and getting users between two points in an optimal fashion - catering to the traditional strengths of data science. This need not be the case. We can create online maps that re-enforce the way users engage with cities naturally and augment these experiences - if we take care.
We’ll construct a mapping service that understands the subjective basis of San Francisco, and facilitates serendipity and desired experiences within the city. We’ll power this service using Open Street Maps, data from social services like Foursquare, Flickr, Instagram, and analyze photographs of streets from Google Street View to create a holistic view of the city and where different experiences can be encountered. We’ll show how you can summarize this data numerically, textually, and visually, using simple techniques and then make this data actionable to users in a non-traditional manner. We will cover subjective and interpretive visualization techniques, drawing from mapping and traditional data visualization, as well as more abstract generative art, and show how they can be used most effectively to communicate non-scalar values.
We’ll cover how traditional data analysis tools like R and NumPy can be combined with tools more often associated with robotics like OpenCV (computer-vision) to create a more complete data set. We’ll also cover how traditional data visualization techniques can be combined with mapping to present a more complete picture of any place and more interesting ways of interaction with their locations and their paths between them.
by Erik Shilts
Opower works with utility companies to provide engaging, relevant, and personalized content about home energy use to millions of households. We have found that simply providing data is not enough to change behavior; data are meaningful only to the extent that people can relate to it. Opower makes energy use data relatable and meaningful by using normative comparisons and personalized insights to help reduce energy use. This simple framework has enabled us to reduce consumer’s energy use by over one terawatt hour, or about 25% of the entire output of the US solar industry in 2011. Currently, Opower works with over 60 utilities domestically and internationally and houses energy data on over 30 million households. And our big data is becoming bigger—many households measure energy use monthly, but more and more homes have smart meters which keep energy usage reads at hourly or half hourly intervals.
This talk will discuss how we interact with the Hadoop ecosystem to transform raw data into contextualized information that drives behavior change. This starts with our Hadoop setup and how it has changed how we store data and think about problems. We then interface Hadoop with tools like R and Python to extract features, visualize, and validate data. On top of this setup, we have extended our set of addressable problems by utilizing curated crowdsourcing to classify patterns that humans can easily recognize. Lastly, we use statistical models and machine learning algorithms including nearest neighbors, self organizing maps, and regularized regressions to create models of characteristics that drive energy use.
The inputs and outputs of each step are stored in Hadoop for ease of use, transparency, and to facilitate collaboration. Throughout the talk I’ll give examples of data science and efficiency problems this infrastructure has solved and where we expect it to take us in the future.
by Siraj Khaliq
Big Data takes on the planet’s toughest challenge: analyzing the weather’s complex and multi-layered behavior to help the world’s farmers adapt to climate change. Increased volatility in weather – the source of over 90% of crop loss – has in recent years become a major concern for farmers and our food supply. By combining modern Big Data techniques, climatology and agronomics, The Climate Corporation protects the $3 trillion global agriculture industry with automated hyper-local weather insurance.
TCC’s algorithms analyze millions of weather measurements, billions of soil observations, and trillions of simulation datapoints to quantify weather risk and price insurance policies carefully tailored to specific situations. Their systems, using Hadoop and tools built in the Clojure functional programming language, use thousands of cloud servers to process historical and forecast data and generate 10,000 weather scenarios, going out several years, in a dense grid covering the US. The resulting trillions of scenario datapoints—amounting to hundreds of terabytes—are used to quantify risk to crop yield and build corresponding insurance policies. Weather-related data is acquired multiple times a day directly from major climate models and incorporated into a real-time pricing engine. The quantity of data processed has grown an average of 10x every year as the company adds more granular geographic data, requiring highly scalable processes for rapid data acquisition and ingestion.
In this talk, CTO and company co-founder Siraj Khaliq will discuss the problem space, evolution of the business and corresponding technology, and how the company’s team of mathematicians and computer scientists is using state-of-the-art big data techniques today to tackle this very real-world problem.
by Ben Werther and Kevin Beyer
Traditional ETL assumes you know the target schema and organization of the data. That used to be a realistic assumption, but in a big-data world, data is much bigger, lower density and new sources arrive and evolve much more quickly. Implicit in this is that you are storing data before you know how you are going to use it.
A naive answer to this is schema-on-read. Just write data into Hadoop, and figure it out what you have and how you want to assemble it when you need it. But this means that advanced developers and lots of domain knowledge are needed any time anyone wants to pull anything from Hadoop. The sets the bar too high, and leads to complex and inflexible custom-coded integrations and jobs.
A new approach that we propose is ‘agile iterative ETL’. Hadoop makes this possible, since the data lands in its raw form and can be processed a first time and then revisited when additional detail or refinement is needed.
In other words:
1. land raw data in Hadoop,
2. lazily add metadata, and
3. iteratively construct and refine marts/cubes based on the metadata from step 2.
The big difference is that, once steps #1 and #2 are completed, a relatively unsophisticated user could drive #3. This approach can be used as a recipe for Hadoop developers looking to build a much more agile pipeline, and is heavily utilized in Platfora’s architecture.
23rd–25th October 2012