Sessions at Strata 2012 with slides

Your current filters are…

Tuesday 28th February 2012

  • Hadoop Data Warehousing with Hive

    by Dean Wampler and Jason Rutherglen

    In this hands-on tutorial, you’ll learn how to install and use Hive for Hadoop-based data warehousing. You’ll also learn some tricks of the trade and how to handle known issues.

    Using the Hive Tutorial Tools

    We’ll email instructions to you before the tutorial so you can come prepared with the necessary tools installed and ready to go. This prior preparation will let us use the whole tutorial time to learn Hive’s query language and other important topics. At the beginning of the tutorial we’ll show you how to use these tools.

    Writing Hive Queries

    We’ll spend most of the tutorial using a series of hands-on exercises with actual Hive queries, so you can learn by doing. We’ll go over all the main features of Hive’s query language, HiveQL, and how Hive works with data in Hadoop.

    Advanced Techniques

    Hive is very flexible about the formats of data files, the “schema” of records and so forth. We’ll discuss options for customizing these and other aspects of your Hive and data cluster setup. We’ll briefly examine how you can write Java user defined functions (UDFs) and other plugins that extend Hive for data formats that aren’t supported natively.

    Hive in the Hadoop Ecosystem

    We’ll conclude with a discussion of Hive’s place in the Hadoop ecosystem, such as how it compares to other available tools. We’ll discuss installation and configuration issues that ensure the best performance and ease of use in a real production cluster. In particular, we’ll discuss how to create Hive’s separate “metadata” store in a traditional relational database, such as MySQL. We’ll offer tips on data formats and layouts that improve performance in various scenarios.

    At 9:00am to 12:30pm, Tuesday 28th February

    Coverage slide deck

  • Large scale web mining

    by Ken Krugler

    This tutorial will teach attendees about the key aspects of scalable web mining, via six modules:

    1. Introduction

    • Why web data is valuable
    • Key challenges to web crawling
    • Realistic definitions for success

    2. Focused Web Crawling

    • Reducing time & cost by focusing the crawl
    • Approaches to classifying and scoring pages
    • Solutions for scalable web crawling

    3. Structured Data Extraction

    • Data mining essentials
    • Structured text extraction
    • Automated vs. manual extraction

    4. Analyzing the Data

    • Making it searchable
    • Finding "interesting" text
    • Machine learning with Mahout

    5. Barriers to Success

    • Polite crawling versus deep crawling
    • Spam, splog, honeypots and nasty webmasters
    • Ajax, robots.txt and Facebook

    6. Examples and Summary

    • Hotel reviews
    • Music pages
    • SEO analysis

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom E, Santa Clara Convention Center

    Coverage slide deck

  • The Model and the Train Wreck: A Training Data How-to

    by Monica Rogati

    Getting training data for a recommender system is easy: if users clicked it, it’s a positive – if they didn’t, it’s a negative. … Or is it? In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.

    At 11:00am to 11:30am, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

    Coverage slide deck

  • Corpus Bootstrapping with NLTK

    by Jacob Perkins

    Learn various ways to bootstrap a custom corpus for training highly accurate natural language processing models. Real world examples will be presented with Python code samples using NLTK. Each example will show you how, starting from scratch, you can rapidly produce a highly accurate custom corpus for training the kinds of natural language processing models you need.

    At 11:30am to 12:00pm, Tuesday 28th February

    In Ballroom AB, Santa Clara Convention Center

    Coverage slide deck

  • Building Applications with Apache Cassandra

    by Nate McCall

    The database industry has been abuzz over the past year about NoSQL databases and their applications in the realm of solutions commonly placed under the ‘big data’ heading.

    This interest has filtered down to software development organizations who have had to scramble to make sense of terminology, concepts and patterns particularly in the areas of distributed computing which were previously limited to academics and a very small number of special case applications.

    Like all of these systems, Apache Cassandra, which has quickly emerged as a best-of-breed solution in this space in the NoSQL/Big Data space, still requires a substantial learning curve to implement correctly.

    This tutorial will walk attendees through the fundamentals of Apache Cassandra, installing a small working cluster either locally or via a cloud provider, and practice configuring and managing this cluster with the tools provided in the open source distribution.

    Attendees will then use this cluster to design a simple Java web application as a way to gain practical, hands on experience in designing applications to take advantage of the massive performance gains and operational efficiency that can be leveraged from a correctly architected Apache Cassandra cluster.

    Attendees should leave the tutorial with hands-on knowledge of building a real, working distributed database.

    At 1:30pm to 5:00pm, Tuesday 28th February

    In Ballroom G, Santa Clara Convention Center

    Coverage slide deck

  • The Business of Big Data

    by Mark Madsen

    Mark Madsen talks about how regular businesses will eventually embrace a data-driven mindset, with some trademark ‘Madsen’ history background to put it in context. People throw around ‘industrial revolution of data’ and ‘new oil’ a lot without really thinking about what things like the scientific method, or steam power, or petrochemicals did as a result.

    At 1:30pm to 2:10pm, Tuesday 28th February

    In GA K, Santa Clara Convention Center

    Coverage slide deck

Wednesday 29th February 2012

  • Effective Data Vizualisation

    by Hjalmar Gislason

    Data visualization is often where people realize the real value in underlying data. Good data visualization tools are therefore vital for many data projects to reach their full potential.

    Many companies have realized this and are looking for the best solutions to address their data visualization needs. There is plenty of tools to choose from, but even for relatively simple charting, many have found themselves with limited options. As the requirements pile up, options become limited: Cross-browser compatibility, server-side rendering, iOS support, interactivity, full control of branding, look and feel … and you’ll find yourself compromising, or – worse yet – building your own visualization library!

    Building our data publishing platform – DataMarket.com – we’ve certainly been faced with the aforementioned challenges. In this session we’ll share our findings and approach for others to avoid our mistakes and learn from our – sometimes hard – lessons learned.

    We’ll also share what we see the future of online data visualization holding: the technologies we’re betting on and how things will become easier, visualizations more effective, code easier to maintain and applications more user friendly as these technologies mature and develop.

    At 11:30am to 12:10pm, Wednesday 29th February

    In Ballroom AB, Santa Clara Convention Center

    Coverage slide deck

  • Building a Data Narrative: Discovering Haight Street

    by Jesper Andersen

    Data isn’t just for supporting decisions and creating actionable interfaces. Data can create nuance, giving new understandings that lead to further questioning—rather than just actionable decisions. In particular, curiosity, and creative thinking can be driven by combining different data sets and techniques to develop a narrative around a set of data sets that tells the story of a place—the emotions, history, and change embedded in the experience of the place.

    In this session, we’ll see how far we can go in exploring one street in San Francisco, Haight Street, and see how well we can understand it’s geography, ebbs and flows, and behavior by combining as many data sources as possible. We’ll integrate basic public data from the city, street and mapping data from Open Street Maps, real estate and rental listings data, data from social services like Foursquare, Yelp and Instagram, and analyze photographs of streets from mapping services to create a holistic view of one street and see what we can understand from this. We’ll show how you can summarize this data numerically, textually, and visually, using a number of simple techniques.

    We’ll cover how traditional data analysis tools like R and NumPy can be combined with tools more often associated with robotics like OpenCV (computer-vision) to create a more complete data set. We’ll also cover how traditional data visualization techniques can be combined with mapping and augmented reality to present a more complete picture of any place, including Haight Street.

    At 1:30pm to 2:10pm, Wednesday 29th February

    In Ballroom AB, Santa Clara Convention Center

    Coverage slide deck

  • SQLFire - An Ultra-fast, Memory-optimized Distributed SQL Database

    by Carter Shanklin and Jags Ramnarayan

    These days users won’t tolerate slow applications. More often than not, the database is the bottleneck in the application. To solve this many people add a caching tier like memcache on top of their database. This has been extremely successful but also creates some difficult challenges for developers such as mapping SQL data to key-value pairs, consistency problems and transactional integrity. When you reach a certain size you may also need to shard your database, leading to even more complexity.

    VMware vFabric SQLFire gives you the speed and scale you need in a substantially simpler way. SQLFire is a memory-optimized and horizontally-scalable distributed SQL database. Because SQLFire is memory oriented you get the speed and low latency that users demand, while using a real SQL interface. SQLFire is horizontally scalable, so if you need more capacity you just add more nodes and data is automatically rebalanced. Instead of sharding, SQLFire automatically partitions data across nodes in the distributed database. SQLFire even supports replication across datacenters, so users anywhere on the globe can enjoy the same fast experience.

    Stop by to learn more how SQLFire gives high performance without all the complexity.

    This session is sponsored by VMware

    At 1:30pm to 2:10pm, Wednesday 29th February

    In Ballroom H, Santa Clara Convention Center

    Coverage slide deck

  • Linked Data: Turning the Web into a Context Graph

    by Leigh Dodds

    There are many different approaches to putting data on the web, ranging from bulk downloads through to rich APIs. These styles suit a range of different data processing and integration patterns. But the history of the web has shown that value and network effects follow from making things addressable.

    Facebook’s Open Graph, Schema.org, and a recent scramble towards a “Rosetta Stone” for geodata, are all examples of a trend towards linking data across the web. Weaving data into the web simplifies data integration. Big Data offers ways to mine huge datasets for insight. Linked Data creates massively inter-connected datasets that can be mined or drawn upon to enrich queries and analysis

    This talk will look at the concept of Linked Data and how a rapidly growing number of inter-connected databases, from a diverse range of sources, can be used to contextualise Big Data.

    At 4:50pm to 5:30pm, Wednesday 29th February

    In Ballroom E, Santa Clara Convention Center

    Coverage slide deck

Thursday 1st March 2012

  • Mining the Eventbrite Social Graph for Recommending Events

    by Vipul Sharma

    Recommendation systems have become critical for delivering relevant and personalized content to your users. Such systems not only drive revenues and generate significant user engagement for web companies but also are a great discovery tool for users. Facebook’s newsfeed, Linkedin’s people you may know and Eventbrite’s event recommendations are some great examples of recommendation systems.

    During this talk we will share the architecture and design of Eventbrite’s data platform and recommendation engine. We will describe how we mined a massive social graph of 18M users and 6B first degree connections to provide relevant event recommendations. We will provide details of our data platform, which supports processing more than 2 TB social graph data daily. We intent to describe how Hadoop is becoming the most important tool to do data mining and also discuss how machine learning is changing in presence of Hadoop and big data.

    We hope to provide enough details that folks can learn from our experiences while building their data platform and recommendation systems.

    At 10:40am to 11:20am, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center

    Coverage slide deck

  • Mining Unstructured Data: Practical Applications

    by Anna Divoli and Alyona Medelyan

    The challenge of unstructured data is a top priority for organizations that are looking for ways to search, sort, analyze and extract knowledge from masses of documents they store and create daily. Text mining uses knowledge-driven algorithms to make sense of documents in a similar way a person would do by reading them. Lately, text mining and analytics tools became available via APIs, meaning that organizations can take immediate advantage these tools. We discuss three examples of how such APIs were utilized to solve key business challenges.

    Most organizations dream of paperless office, but still generate and receive millions of print documents. Digitizing these documents and intelligently sharing them is a universal enterprise challenge. Major scanning providers offer solutions that analyze scanned and OCR’d documents and then store detected information in document management systems. This works well with pre-defined forms, but human interaction is required when scanning unstructured text. We describe a prototype build for the legal vertical that scans stacks of paper documents and on the fly categorizes and generates meaningful metadata.

    In the area of forensics, intelligence and security, manual monitoring of masses of unstructured data is not feasible. The ability of automatically identify people’s names, addresses, credit card and bank account numbers and other entities is the key. We will briefly describe a case study of how a major international financial institution is taking advantage of text mining APIs in order to comply with a recent legislation act.

    In healthcare, although Electronic Health Records (EHRs) have been increasingly becoming available over the past two decades, patient confidentiality and privacy concerns have been acting as obstacles from utilizing the incredibly valuable information they contain to further medical research. Several approaches have been reported in assigning unique encrypted identifiers to patients’ ID but each comes with drawbacks. For a number of medical studies consistent uniform ID mapping is not necessary and automated text sanitization can serve as a solution. We will demonstrate how sanitization has practical use in a medical study.

    At 1:30pm to 2:10pm, Thursday 1st March

    In Mission City B1, Santa Clara Convention Center

    Coverage slide deck

  • Beyond Map/Reduce: Getting Creative with Parallel Processing

    by Ed Kohlwey

    While Map/Reduce is an excellent environment for some parallel computing tasks, there are many ways to use a cluster beyond Map/Reduce. Within the last year, the YARN and NextGen Map/Reduce has been contributed into the Hadoop trunk, Mesos has been released as an open source project, and a variety of new parallel programming environments have emerged such as Spark, Giraph, Golden Orb, Accumulo, and others.

    We will discuss the features of YARN and Mesos, and talk about obvious yet relatively unexplored uses of these cluster schedulers as simple work queues. Examples will be provided in the context of machine learning. Next, we will provide an overview of the Bulk-Synchronous-Parallel model of computation, and compare and contrast the implementations that have emerged over the last year. We will also discuss two other alternative environments: Spark, an in-memory version of Map/Reduce which features a Scala-based interpreter; and Accumulo, a BigTable-style database that implements a novel model for parallel computation and was recently released by the NSA.

    At 2:20pm to 3:00pm, Thursday 1st March

    In Ballroom CD, Santa Clara Convention Center

    Coverage slide deck