Your current filters are…
Opening remarks by the OSCON Data program chairs, Sarah Novotny and Bradford Stephens.
Much has been made of scalability as a driver for choosing a database, but the choice of a database influences much more than the scaling architecture. Different database choices drive different data models which in turn influence the development process.
Keynote by Adrian Cockcroft, Cloud Architect, Netflix.
by Brian Aker
We love data, and today we generate data in astronomical amounts. When we hit save on a document, snap a photo, or fill out a form online, we want to know that this data will persist, and we want to know that we can share, access, or reference it in the future. For any meaningful use, we need to how data relates to other data.
The first OSCON Data Innovation Award winner will be announced.
PostgreSQL continues to provide a major release every year full of improvements, better performance and features that measure up to the most popular commercial databases. Our 2011 release, 9.1, is no exception!
Imagine for a moment doing a JOIN on two HBase tables, crazy talk right? Well now you can thanks to Hive. True, it is only meant to be used in a batch context, but we have being doing it for a few months now at StumbleUpon and our analysts and engineers love it. This presentation will cover how the Hive-HBase integration works and how we use it at our company.
by John Hugg
In this talk, we will introduce a simple formula for all Big Data applications: Big Data = Fast Data + Deep Data. Through a use-case format, we will discuss the specialized requirements for real-time (“fast”) and analytic (“deep”) data management.
by Jay Kreps
The last few years have brought a wealth of new data technologies organized around horizontal scalability. This talk will cover the essential infrastructure areas: real-time stream processing, offline data crunching, large-scale data deployments and live serving. The focus will be on how these ingredients come together to enable innovative data-driven products at LinkedIn.
In November, Facebook launched a new version of Messages that combines chat, SMS, email, and Messages into a real-time conversation. Facebook relies on Apache HBase, a NoSQL-style database, for storing this real-time message data. This talk will elaborate on our decision process, system configuration, scaling issues, and advantages gained by choosing Open Source.
by Jeff Hamann
Learn how to cobble together a PostgreSQL database, install a few handy R packages, a pinch of language extensions, and a handful of publicly available data to generate a forest monitoring platform to help landscape managers make better decisions using basic design-engineering paradigms to perform quick trade-off analyses.
The story of the development team and what lessons we learned in building Open Legislation - an open government platform. It will detail our transition from a MySQL back end to an application fully powered by Lucene, the data quality and efficiency issues that we’ve had to address, and how we’re now trying to rebuild internal trust after our iterative and initially shaky development process.
by Bill Fox
A big data case study with the NY Medicaid Inspector General's Office and HPCC Systems from LexisNexis.
by Andrew Aksyonoff and Rich Kelm
Whether you're a beginner Web guy or a veteran DBA, whether you get hands dirty with any code or just manage systems, you still must know algorithms. How come? Because that knowledge enables you to optimize your work, conduct correct benchmarks, and make educated decisions. We'll show you how knowing only a little about SQL internals can help so much with tuning things.
by Tom Wilkie
The standard Linux storage stack wasn't designed for write-heavy big data workloads, nor is it well-suited to modern hardware: large, slow SATA disks, SSDs or many cores. Castle, an open-source project, is a ground-up overhauling of RAID, file systems, and the POSIX interface.
Keeping a busy site going when you don't have a lot of servers or developer resources can be a struggle. Hear what we did at Daily Kos to make the most of what we had to bring MySQL in line, make it quick, and keep the users and the boss happy.
One of the challenges that comes with moving to MongoDB is figuring how to best model your data. While most developers have internalized the rules of thumb for designing schemas for RDBMSs, these rules don't always apply to MongoDB.
Synthetic biology is a new field where basic biological components can be engineered to create something new. It often involves DNA synthesizers, ligation, promoters, and polymerase chain reaction -- which may or may not be safe for your in silico environment. However, as the size and complexity of the systems increase, tools become more and more important, thus CAD for biology has emerged.
Building large data applications can present a unique set of technical challenges because things that often work well in the conventional development environment can become incredibly arduous or expensive when applied on a much bigger scale. This talk will cover some of those challenges and potential solutions for each.
We'll present the architecture and implementation of a Node.js/DTrace-based distributed platform for analyzing the performance of cloud applications in real-time. We'll do a live demo on a real, internet-facing cloud and discuss some of the interesting performance pathologies we've found and explained using this tool.
by Brian Aker
Ever wondered what would happen if you could rethink a decade worth of design changes? Drizzle is a redesign of the MySQL server targeted at web development and cloud infrastructure. Update yourself on the latest features, and use cases for Drizzle7 and what is in store for the near future.
by Erik Onnen
This talk will cover lessons learned in building Urban Airship's large-scale data warehouse in EC2 including PostgreSQL, Kafka, Cassandra, HBase and Hadoop.
This language-agnostic proposal focuses upon concepts and strategies critical to the design and implementation of asynchronous systems and data processing layers. Key components include a survey of implementation strategies for non-blocking edge tiers, patterns for building out a distributed worker / processing tier, along with several horror stories of cascading failures and their resolution.
by Christine White
Sharing data is critical in a world where crisis can occur at any moment. Often, valuable data is stored in disparate locations with no information on how to access. This presentation discusses spatial data discovery and open source tools for implementing a data-sharing catalog. Esri’s Geoportal Server will be used to show sharing and discovery in action. Talk is open to all attendees.
If you've ever had to move from data center to data center or to the cloud, or from old hardware to new hardware, you know that it's even more painful than moving house. In this presentation, survivors will tell you how to stay sane (and how to get it right) with a case study from Mozilla: moving 30TB of crash reports with no downtime in data collection.
by Adam Silberstein
I will overview PNUTS, a large-scale, geographically-replicated serving data store in widespread use at Yahoo! I will introduce key use cases, the main system components, key design decisions, and ongoing work.
by Rob Treat
Everyone thinks they know what sharding is and how to do it, but simple horizontal read scaling is the small potatoes. In this talk we'll focus on the sharding pattern for large scale read/write architectures, based on real world implementations. Supporting millions of users on commodity hardware doesn't need magical software, just careful application of the right scalability pattern.
Time Series sensors are being ubiquitously integrated in places like cell phones, environmental sensors, and the smart grid. As we scale out this type of data RDBMS systems strain to scale with the high insertion rates and real time query requirements. In this talk we introduce “Lumberyard” which is a scalable indexing and low latency fuzzy pattern searching time series data.
Location-based services are hot, but geographic datasets are complex. But this shouldn’t put you off writing awesome location-aware services. This talk will show how to create spatial models and query the Open Street Map dataset together with social data using the Neo4j graph database.
A talk about how scaling foursquare using MongoDB and Scala.
25th–27th July 2011