by Sarah Novotny and Bradford Stephens
Opening remarks by the OSCON Data program chairs, Sarah Novotny and Bradford Stephens.
by Tom Quisel
Dive into the distributed system that powers OkCupid’s match searches. Learn how we use C++, event-based programming, and SSDs to solve problems that crop up when building a high performance, high availability distributed system.
Keynote by Benjamin Black, Co-founder, fast_ip.
by Steve Yegge
It's 2021. You have a petabyte drive on your keychain, your startup company leases bulk cloud storage by the exabyte, and you have a million cores for data crunching. You even can have your own copy of the entire world's public semantic data. What do you do with it? If you're not sure yet, I've got plenty of ideas for you.
An open microphone question and answer session with the morning's keynote speakers.
by Tom Hanlon
Hadoop gives you the ability to process massive amounts of data at scale. This presentation will show you how hadoop makes use of commodity hardware to allow you to build a system that scales, that deals gracefully with failure of individual nodes, and gives you the power of Map/Reduce to process Petabytes.
We describe the new replication features in MySQL 5.5 (GA) and MySQL 5.6 (Development release).
by Sid Anand
For the past 3 years, Netflix has been building a popular subscription-based service to stream movies and TV shows to game consoles, mobile devices, BluRay players, digital TVs, etc… With tens of millions of paying customers, Netflix has firmly established itself as a household brand in the US. Few people are aware that, while aggressively expanding our market and products, we have also moved our web and data infrastructure to Amazon Web Services. We currently use a large array of AWS’s offerings and deliver >90% of our web traffic from the cloud. While we have moved a significant portion of our web infrastructure to the cloud, the migration of our data has followed a slightly slower pace. Where we once solely relied on relational databases such as Oracle and MySQL, today we use a combination of technologies, including but not limited to SimpleDB, S3, Cassandra, and HBase. We also leverage open source caching technology like Memcached and Squid. This talk will detail the current evolution of Netflix’s cloud-based data infrastructure and specifically its use of open source technology.
It’s easy to find and create data. But what are you going to do with it? Can I ask the world complex questions such as what’s the local crime rate, distance to metro, or rating of my local school? Can you combine these all together to rate houses you may want to buy? And how do you then connect back to your government and local businesses to engage in collaborative decision making.
This talk with discuss how you should consider users and their personal interactions with data and information. We’ll also peel back the covers on how open source tools such as HBase, Cascading, Geos and Polymaps handle analyzing and streaming realtime data to maps and visualizations both on the web and to mobile devices.
To illustrate what’s possible, we’ll dive through GeoCommons, a large online community of data sharing and community analytics that uses open source mapping visualization, Hadoop analysis, and mobile interfaces to provide this to the world. Users can even build and socialize their own analysis methods to share their expert knowledge with other users. We’ll also review how global organizations like the World Bank and United Nations are using these tools to connect with citizens in developing countries to empower them to make decisions on building investment and understanding how climate science may affect their areas.
Ever had to dig into a system that misused the most basic features of a RDBMS ? Better yet - after the whole NoSQL storm had you wondered why it didn't shown before when you had to twist your schema to fit into something it was not designed for ? Check on this anti-patterns collection and feel better that you are not alone - and how you can benefit from it even not having big data around.
Adding security to an existing product is never easy, but our team at Yahoo added strong authentication to Apache Hadoop by integrating it with Kerberos. This project was delivered on time and is currently deployed on all of Yahoo's 40,000 Hadoop computers. Come learn how we added security to and why it matters.
In this session Dell will discuss the analysis of the data types suitable for transfer between Hadoop and EDW, EDW/Hadoop data lifecycle, Data governance between Hadoop and DBMS, and ETL performance tuning and best practices (i.e. Hadoop/DBMS connector, node and network designs, etc.)
by Ryan Lowe and Haidong Ji
With most modern web applications, there are requirements for both SQL access to complex data as well as simple Key-Value look-ups. This session will cover how to use the HandlerSocket Plug-In for MySQL to get exponentially faster look-ups for simple access patterns.
Between the NoSQL movement and new cloud offerings, it seems there are new storage options popping up every day. How do you select which one is the best for your project? The truth is that it's unlikely one option is best for all your needs. This session walks you through the various options considered by one startup and how it selected five separate storage engines - and has no regret doing so!
In this workshop, one of the core MongoDB committers will present the fundamental principles of MongoDB, how to set up and interact with the database, and what to consider when building applications using a document-based data model.
Brisk is an open-source Hadoop and Hive distro that utilizes Cassandra for its core services. Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking, while providing an HDFS-compatible storage layer powered by Cassandra. By accelerating the time between data creation and analysis with DataStax’ Brisk, users experience greater reliability, simpler deployment and lower TCO.
by Greg Fodor
The data & analytics teams at Etsy build up and tear down more than a thousand independent Hadoop clusters on EC2 each month. This talk discusses the benefits of this approach, where Elastic Map Reduce serves as a "meta-cluster" in which on-demand Hadoop clusters can be created, used, and shut down quickly and easily.
OpenTSDB is an open-source, distributed time series database designed to monitor large clusters of commodity machines at an unprecedented level of granularity. OpenTSDB enables operations teams to keep track in real-time of all the metrics exposed by operating systems, applications and network equipment, and makes the data easily accessible.
by Ted Dziuba
What happens when you write data to disk? We'll explore everything between your programming language and the spinning platters - both optimizations and dangerous pitfalls.
The art of dealing with real-time data is not new. In fact, much of the world's economy is propped up my making decisions on data sub milliseconds. The technology is there, we have the power. We'll take a whirlwind tour of the open-source Esper system and understand how to integrate it into your stack to enable rapid decision making on real-time data from anywhere in your architecture.
Multiversion Concurrency Control (MVCC) allows Postgres to offer high concurrency even during significant database read/write activity. MVCC specifically offers behavior where "readers never block writers, and writers never block readers". This talk explains how MVCC is implemented in Postgres and highlights optimizations which minimize the downsides of MVCC. This talk is for advanced users.
Redis is an entry in the new breed of nosql databases. But it takes a different approach that makes it much more interesting then most of the other key/value stores in the same category. Come learn what makes redis so useful that it seems everyone is adding it to their toolbox.
YARN is the next generation of Hadoop Map-Reduce designed to scale out much further while allowing for running applications other than pure Map-Reduce in a highly fault-tolerant manner.
by Jonathan Seidman and Ramesh Venkatar
An overview of the state of the art for bringing together the analytical power of the R language with the big data capabilities of Hadoop.
We at DeNA (largest social game provider in Japan) handle over 2 billion page views per day with MySQL. We heavily use SSD and tune Linux. We run non-trivial solutions such as non-stop, automated MySQL master failover. We also use MySQL not only as traditional RDBMS but also an extremely high performance NoSQL. I'd like to introduce our MySQL solutions to make our social games scale better.
This talk introduces an open-source SQL-based system for continuous or ad-hoc analysis of streaming data built on top of Flume-based data collection for Hadoop. Attendees will understand how to use a new tool to extend their Hadoop data collection pipeline with real-time streaming analytics.
by Tom White
Apache Whirr is a way to run distributed systems - such as Hadoop, HBase, Cassandra, and ZooKeeper - in the cloud. Whirr provides a simple API for starting and stopping clusters for evaluation, test, or production purposes. This talk explains Whirr's architecture and shows how to use it.
by Brian Aker
Many people view topics like Map/Reduce and queue systems as advanced concepts that require in-depth knowledge and time consuming software setup. Gearman is changing all that by making this barrier to entry as low as possible with an open source, distributed job queuing system.
by Inaam Rana and Calvin Sun
There are many exciting InnoDB performance and Scalability features in MySQL 5.5 and its upcoming release. But how to best use them? What are the caveats? At this session, we will describe those performance and Scalability features in depth. We will also present some benchmark results that explore the performance of those features.
The Basho engineering team has been working to make Riak more queryable with the addition of built-in indexing plus a SQL-style query language. In this talk, Rusty describes the usage, benefits, limitations, and evolution of this this functionality, called Secondary Indices. He also covers the challenges and pitfalls of adding indexing to a distributed datastore.