OSCON Data 2011 schedule

Monday 25th July 2011

  • Welcome

    by Sarah Novotny and Bradford Stephens

    Opening remarks by the OSCON Data program chairs, Sarah Novotny and Bradford Stephens.

    At 9:00am to 9:05am, Monday 25th July

    In Oregon Ballroom 203/204, Oregon Convention Center

  • Finding the Perfect Match

    by Tom Quisel

    Dive into the distributed system that powers OkCupid’s match searches. Learn how we use C++, event-based programming, and SSDs to solve problems that crop up when building a high performance, high availability distributed system.

    At 9:05am to 9:20am, Monday 25th July

    In Oregon Ballroom 203/204, Oregon Convention Center

    Coverage video

  • Data Information & Context

    by Benjamin Black

    Keynote by Benjamin Black, Co-founder, fast_ip.

    At 9:20am to 9:40am, Monday 25th July

    In Oregon Ballroom 203/204, Oregon Convention Center

    Coverage video

  • What Would You Do With Your Own Google?

    by Steve Yegge

    It's 2021. You have a petabyte drive on your keychain, your startup company leases bulk cloud storage by the exabyte, and you have a million cores for data crunching. You even can have your own copy of the entire world's public semantic data. What do you do with it? If you're not sure yet, I've got plenty of ideas for you.

    At 9:40am to 10:00am, Monday 25th July

    In Oregon Ballroom 203/204, Oregon Convention Center

    Coverage video

  • Q & A

    An open microphone question and answer session with the morning's keynote speakers.

    At 10:00am to 10:10am, Monday 25th July

    In Oregon Ballroom 203/204, Oregon Convention Center

  • Introduction to Hadoop

    by Tom Hanlon

    Hadoop gives you the ability to process massive amounts of data at scale. This presentation will show you how hadoop makes use of commodity hardware to allow you to build a system that scales, that deals gracefully with failure of individual nodes, and gives you the power of Map/Reduce to process Petabytes.

    At 10:40am to 11:20am, Monday 25th July

    In C123, Oregon Convention Center

  • MySQL Replication Update

    by Lars Thalmann

    We describe the new replication features in MySQL 5.5 (GA) and MySQL 5.6 (Development release).

    At 10:40am to 11:20am, Monday 25th July

    In C121/122, Oregon Convention Center

    Coverage video

  • NoSQL @ Netflix

    by Sid Anand

    For the past 3 years, Netflix has been building a popular subscription-based service to stream movies and TV shows to game consoles, mobile devices, BluRay players, digital TVs, etc… With tens of millions of paying customers, Netflix has firmly established itself as a household brand in the US. Few people are aware that, while aggressively expanding our market and products, we have also moved our web and data infrastructure to Amazon Web Services. We currently use a large array of AWS’s offerings and deliver >90% of our web traffic from the cloud. While we have moved a significant portion of our web infrastructure to the cloud, the migration of our data has followed a slightly slower pace. Where we once solely relied on relational databases such as Oracle and MySQL, today we use a combination of technologies, including but not limited to SimpleDB, S3, Cassandra, and HBase. We also leverage open source caching technology like Memcached and Squid. This talk will detail the current evolution of Netflix’s cloud-based data infrastructure and specifically its use of open source technology.

    At 10:40am to 11:30am, Monday 25th July

    In B118-119, Oregon Convention Center

  • Playful Explorations of Public and Personal Data

    by Andrew Turner

    It’s easy to find and create data. But what are you going to do with it? Can I ask the world complex questions such as what’s the local crime rate, distance to metro, or rating of my local school? Can you combine these all together to rate houses you may want to buy? And how do you then connect back to your government and local businesses to engage in collaborative decision making.

    This talk with discuss how you should consider users and their personal interactions with data and information. We’ll also peel back the covers on how open source tools such as HBase, Cascading, Geos and Polymaps handle analyzing and streaming realtime data to maps and visualizations both on the web and to mobile devices.

    To illustrate what’s possible, we’ll dive through GeoCommons, a large online community of data sharing and community analytics that uses open source mapping visualization, Hadoop analysis, and mobile interfaces to provide this to the world. Users can even build and socialize their own analysis methods to share their expert knowledge with other users. We’ll also review how global organizations like the World Bank and United Nations are using these tools to connect with citizens in developing countries to empower them to make decisions on building investment and understanding how climate science may affect their areas.

    At 10:40am to 11:20am, Monday 25th July

    In C124, Oregon Convention Center

  • Architectural Anti-patterns for Data Handling

    by Gleicon Moraes

    Ever had to dig into a system that misused the most basic features of a RDBMS ? Better yet - after the whole NoSQL storm had you wondered why it didn't shown before when you had to twist your schema to fit into something it was not designed for ? Check on this anti-patterns collection and feel better that you are not alone - and how you can benefit from it even not having big data around.

    At 11:30am to 12:10pm, Monday 25th July

    In C123, Oregon Convention Center

    Coverage slide deck

  • Developing and Deploying Hadoop Security

    by Owen O'Malley

    Adding security to an existing product is never easy, but our team at Yahoo added strong authentication to Apache Hadoop by integrating it with Kerberos. This project was delivered on time and is currently deployed on all of Yahoo's 40,000 Hadoop computers. Come learn how we added security to and why it matters.

    At 11:30am to 12:10pm, Monday 25th July

    In C124, Oregon Convention Center

    Coverage video

  • Hadoop - Enterprise Data Warehouse Data Flow Analysis and Optimization

    by Aurelian Dumitru

    In this session Dell will discuss the analysis of the data types suitable for transfer between Hadoop and EDW, EDW/Hadoop data lifecycle, Data governance between Hadoop and DBMS, and ETL performance tuning and best practices (i.e. Hadoop/DBMS connector, node and network designs, etc.)

    At 11:30am to 12:10pm, Monday 25th July

    In C125/126, Oregon Convention Center

  • HandlerSocket: NoSQL via MySQL

    by Ryan Lowe and Haidong Ji

    With most modern web applications, there are requirements for both SQL access to complex data as well as simple Key-Value look-ups. This session will cover how to use the HandlerSocket Plug-In for MySQL to get exponentially faster look-ups for simple access patterns.

    At 11:30am to 12:10pm, Monday 25th July

    In C121/122, Oregon Convention Center

    Coverage video

  • The Right Tool For The Right Job: Choosing The Best Data Storage Option

    by Patrick Lightbody

    Between the NoSQL movement and new cloud offerings, it seems there are new storage options popping up every day. How do you select which one is the best for your project? The truth is that it's unlikely one option is best for all your needs. This session walks you through the various options considered by one startup and how it selected five separate storage engines - and has no regret doing so!

    At 11:30am to 12:10pm, Monday 25th July

    In B118-119, Oregon Convention Center

    Coverage video

  • Building Web Applications with MongoDB

    by Roger Bodamer

    In this workshop, one of the core MongoDB committers will present the fundamental principles of MongoDB, how to set up and interact with the database, and what to consider when building applications using a document-based data model.

    At 1:30pm to 2:10pm, Monday 25th July

    In B118-119, Oregon Convention Center

    Coverage slide deck

  • DataStax’ Brisk – A More Powerful, Real-time, And Easier To Deploy Hadoop, Powered By Apache Cassandra

    by Jonathan Ellis

    Brisk is an open-source Hadoop and Hive distro that utilizes Cassandra for its core services. Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking, while providing an HDFS-compatible storage layer powered by Cassandra. By accelerating the time between data creation and analysis with DataStax’ Brisk, users experience greater reliability, simpler deployment and lower TCO.

    At 1:30pm to 2:10pm, Monday 25th July

    In C125/126, Oregon Convention Center

  • Ephemeral Hadoop Clusters in the Cloud

    by Greg Fodor

    The data & analytics teams at Etsy build up and tear down more than a thousand independent Hadoop clusters on EC2 each month. This talk discusses the benefits of this approach, where Elastic Map Reduce serves as a "meta-cluster" in which on-demand Hadoop clusters can be created, used, and shut down quickly and easily.

    At 1:30pm to 2:10pm, Monday 25th July

    In C121/122, Oregon Convention Center

    Coverage video

  • OpenTSDB: A Scalable, Distributed Time Series Database

    by Benoit Sigoure

    OpenTSDB is an open-source, distributed time series database designed to monitor large clusters of commodity machines at an unprecedented level of granularity. OpenTSDB enables operations teams to keep track in real-time of all the metrics exposed by operating systems, applications and network equipment, and makes the data easily accessible.

    At 1:30pm to 2:10pm, Monday 25th July

    In C124, Oregon Convention Center

    Coverage video

  • What Every Data Programmer Needs to Know About Disks

    by Ted Dziuba

    What happens when you write data to disk? We'll explore everything between your programming language and the spinning platters - both optimizations and dangerous pitfalls.

    At 1:30pm to 2:10pm, Monday 25th July

    In C123, Oregon Convention Center

  • Esperwhispering: get your real-time data game on

    by Theo Schlossnagle

    The art of dealing with real-time data is not new. In fact, much of the world's economy is propped up my making decisions on data sub milliseconds. The technology is there, we have the power. We'll take a whirlwind tour of the open-source Esper system and understand how to integrate it into your stack to enable rapid decision making on real-time data from anywhere in your architecture.

    At 2:20pm to 3:00pm, Monday 25th July

    In C123, Oregon Convention Center

  • MVCC Unmasked

    by Bruce Momjian

    Multiversion Concurrency Control (MVCC) allows Postgres to offer high concurrency even during significant database read/write activity. MVCC specifically offers behavior where "readers never block writers, and writers never block readers". This talk explains how MVCC is implemented in Postgres and highlights optimizations which minimize the downsides of MVCC. This talk is for advanced users.

    At 2:20pm to 3:00pm, Monday 25th July

    In C121/122, Oregon Convention Center

    Coverage video

  • Redis: CS101 Data Structures via the Network

    by Ezra Zygmuntowicz

    Redis is an entry in the new breed of nosql databases. But it takes a different approach that makes it much more interesting then most of the other key/value stores in the same category. Come learn what makes redis so useful that it seems everyone is adding it to their toolbox.

    At 2:20pm to 3:00pm, Monday 25th July

    In B118-119, Oregon Convention Center

  • YARN - Next Generation Hadoop Map-Reduce

    by Arun C Murthy

    YARN is the next generation of Hadoop Map-Reduce designed to scale out much further while allowing for running applications other than pure Map-Reduce in a highly fault-tolerant manner.

    At 2:20pm to 3:00pm, Monday 25th July

    In C124, Oregon Convention Center

  • Distributed Data Analysis with Hadoop and R

    by Jonathan Seidman and Ramesh Venkatar

    An overview of the state of the art for bringing together the analytical power of the R language with the big data capabilities of Hadoop.

    At 3:30pm to 4:10pm, Monday 25th July

    In C123, Oregon Convention Center

    Coverage slide deck

  • MySQL for the Large Scale Social Games

    by Yoshinori Matsunobu

    We at DeNA (largest social game provider in Japan) handle over 2 billion page views per day with MySQL. We heavily use SSD and tune Linux. We run non-trivial solutions such as non-stop, automated MySQL master failover. We also use MySQL not only as traditional RDBMS but also an extremely high performance NoSQL. I'd like to introduce our MySQL solutions to make our social games scale better.

    At 3:30pm to 4:10pm, Monday 25th July

    In C121/122, Oregon Convention Center

    Coverage slide deck

  • Real-time Streaming Analysis for Hadoop and Flume

    by Aaron Kimball

    This talk introduces an open-source SQL-based system for continuous or ad-hoc analysis of streaming data built on top of Flume-based data collection for Hadoop. Attendees will understand how to use a new tool to extend their Hadoop data collection pipeline with real-time streaming analytics.

    At 3:30pm to 4:10pm, Monday 25th July

    In C124, Oregon Convention Center

    Coverage video

  • Whirr: Open Source Cloud Services

    by Tom White

    Apache Whirr is a way to run distributed systems - such as Hadoop, HBase, Cassandra, and ZooKeeper - in the cloud. Whirr provides a simple API for starting and stopping clusters for evaluation, test, or production purposes. This talk explains Whirr's architecture and shows how to use it.

    At 3:30pm to 4:10pm, Monday 25th July

    In B118-119, Oregon Convention Center

    Coverage video

  • Gearman: From the Worker's Perspective

    by Brian Aker

    Many people view topics like Map/Reduce and queue systems as advanced concepts that require in-depth knowledge and time consuming software setup. Gearman is changing all that by making this barrier to entry as low as possible with an open source, distributed job queuing system.

    At 4:20pm to 5:00pm, Monday 25th July

    In B118-119, Oregon Convention Center

    Coverage video

  • InnoDB: Performance and Scalability Features

    by Calvin Sun and Inaam Rana

    There are many exciting InnoDB performance and Scalability features in MySQL 5.5 and its upcoming release. But how to best use them? What are the caveats? At this session, we will describe those performance and Scalability features in depth. We will also present some benchmark results that explore the performance of those features.

    At 4:20pm to 5:00pm, Monday 25th July

    In C121/122, Oregon Convention Center

  • Querying Riak Just Got Easier - Introducing Secondary Indices

    by Rusty Klophaus

    The Basho engineering team has been working to make Riak more queryable with the addition of built-in indexing plus a SQL-style query language. In this talk, Rusty describes the usage, benefits, limitations, and evolution of this this functionality, called Secondary Indices. He also covers the challenges and pitfalls of adding indexing to a distributed datastore.

    At 4:20pm to 5:00pm, Monday 25th July

    In C124, Oregon Convention Center