by Ben Fried
How do you create something truly fast? How do you create a truly distributed system?
While deploying a global HTTP CDN may sound intimidating, the concept and design are quite straight forward. Implementation is where the devil plays. At every level of the stack there is a new set of challenges and bottlenecks to discover and work around. I'll take you through a in-the-weed tour of these challenges that we discovered building a CDN.
Hardware configuration, Solid State Devices, kernel patches, kernel configuration, tuning, software configuration, memory tuning and routing. But more importantly the process for finding the issues and eliminating them, the monitoring and mindset to make modern hardware perform at incredible performance.
We will detail several architectures we evaluated or operated to meet our needs: accepting a continuous influx of events, storing tens of billions of points as time-series, and offering rich and low-latency query semantics on them.
Expect to hear in particular about:
by John Hugg
The term Big Data describes a new class of database applications that need to process massive data volumes in two disparate states—real time and historical. In either state, the requirements of Big Data applications vastly exceed the capabilities of traditional, one-size-fits-all database systems. Most Big Data applications require MPP scale-out architectures and have the following characteristics:
In this talk, we will introduce a simple formula for all Big Data applications: Big Data = Fast Data + Deep Data. Through a use-case format, we will discuss the specialized requirements for real-time ("fast") and analytic ("deep") data management. We'll also explore ways in which popular business intelligence solutions can be used to implement real-time and historical analytics.
Large systems fail constantly and in a wide variety of ways. Planning for those failures and responding under pressure are vital to running a reliable service.
In this presentation, we will discuss the way that the Heroku engineering teams plan for failures. We will cover our on-call methodology, incident response procedures, and metrics we use to track our performance. Finally, since everyone loves war stories, we'll present a couple of case studies on outages that we've experienced and how we responded to them.
Ruby on Rails is a great framework for quickly building applications, but what happens when you are wildly successful and need to scale way up? This talk is a case study in the evolution of our Rails application from a monolithic "does everything" systems running on a hosted server to a service-oriented system running in the cloud. As part of this tale, we will also cover the migration from a single MySQL data store to an environment with sharded MySQL and several NoSQL variants. We will detail choices made along the way, the data used to make those choices, and distill the decision path down to a series of recommendations.
The audience will come away with concrete methods to scale their web applications and hints and heuristics to manage the evolution of their data.
by Hubert Fonseca and André Galvani
This work aims to present a new approach on data centers monitoring. We make use of Complex Event Processing (CEP) to provide both efficient data visualization and significant threats and opportunities patterns detection. Our architecture defines a data bus that receives events from several data sources. Configuration items like servers, routers or applications send events to the data bus, while the CEP engine processes and correlates all these data.
At runtime, sysadmins can easily express patterns like "warn my team when session count falls more than 20% on a load balancer in less then one minute and cache fails on connecting to backend" using Event Processing Language (EPL), as well as create real time and historical charts. Once a pattern is detected, the system may take the corresponding actions to warn who are interested.
We have applied the solution in iG, one of the biggest brazilian ISP's and content providers. We have created 18 different patterns to prevent and detect several infrastructure and application usual problems.
As CTO of a successful high-growth startup, and in my new stealth company, I've seen my share of challenging scenarios keeping a very busy PostgreSQL-based business online and responsive during tremendous growth. EC2 + PostgreSQL + PostGIS + + Scala + Ruby + no downtime. Others can probably learn from my battle scars!
Running multiple PostgreSQL, Ruby, Solr, and Scala instances on the EC2 cloud opens up many possibilities for scaling infrastructure, but also introduces a number challenges. From unpredictable disk i/o, to RAM constraints, vanishing instances and replication challenges, this talk aims to share real-world experiences and solutions with Postgres 8.* and 9.* and all of the technologies we used to build solutions.
It covers the topic via a series of anecdotes from tiny team of developers supporting one of the biggest sites in the world (CNN.com); these stories range from humorous to horrifying, and there are valuable lessons for attendees running or contemplating a similar architecture with PostgreSQL, MongoDB, Solr, Ruby, and Scala.
by Poul-Henning Kamp
Today's newspaper finds few buyers tomorrow and an HTTP response arriving seconds after the request finds few readers. Getting the message efficiently delivered is an entirely different job than getting it written in the first place. This talk is about the thoughts behind and inspirations for the Varnish HTTP accelerator, which have set new standards for performance, flexibility and price in HTTP delivery.
by Maxwell Luebbe, Raymond Blum and Dr. Jia Guo
Systems like GMail and Picasa keep massive amounts of data in the cloud, all of which has to be constantly backed up to prepare for the inevitable. Typical backup and recovery techniques don't scale, so Google has devised new methods for securing unprecedented volumes of data against every type of failure.
There are many unique challenges, both obvious and subtle, in delivering storage systems at this scale; we'll discuss these and their solutions as well as some alternatives that didn't make the grade.
by Robert Treat
Postgres has a long and stable legacy, but with that comes some steadfast design principals that don't always jive with operability in large systems. Fixing these issues in-kernel is an epic undertaking in both code complexity and social complexity, and sometimes it just proves easier to do it your damn self.
In this talk, we'll deep dive into the development and refinement of a low-level bloat removal/defragmentation system, designed entirely outside the database kernel and explore an adventure of wrong turns, bad bugs, and handicapped approaches. The adventure doesn't have a magical ending, but rather an acceptably happy one: a tool that can be used on highly concurrent, high-performance, zero-downtime environments.
We present our experience designing, implementing, and deploying a Node.js-based distributed system for analyzing system and application performance across a datacenter. Our system's design, and particularly the choice of programming environment, were driven by our goals of supporting real-time analysis of problems spanning hundreds of production systems, which requires that the system deal with large volumes of data with very low latency. We will briefly discuss these considerations and why we chose Node.js for the implementation. We will then present our actual experience building and deploying the software, including topics of software development speed; availability of libraries and tools for development, testing, and verification; difficulties observing and debugging Node applications (especially post-mortem); packaging issues related to lack of C++ binary compatibility; and other development and deployment issues. Finally, we will close with a demonstration of the facility itself, and some discussions of the production pathologies that it has found—including the results of using the facility to analyze its own performance.
by Adam Jacob
This talk focuses on letting the audience guide the presentation, and serves as a kind of round-up for topics such as: infrastructure automation, cloud computing, configuration management tools, NoSQL, CAP theorem and scalable web application, building open source communities and funding a start up. The presentation contains 16 different segments, with each segment being a 5 minute round-up of the topic. The audience calls out which topic they want to hear, and is encouraged to ask questions as they come up.
This session will provide an introduction to scaling with MongoDB by one of the developers working on the project. MongoDB's architecture features built-in support for horizontal scalability, and high availability through replica sets. Auto-sharding allows users to easily distribute data across many nodes. Replica sets enable automatic failover and recovery of database nodes within or across data centers.
Half of the pundits claim that The Cloud is the future. The other half warn us that Big Data is coming. Real engineers know these two don't always marry easily. How do we build systems that really scale, inside or outside the cloud? What conventional approaches don't work? What new approaches are available? What must we sacrifice from using them, and what do we gain in return? How do the economics really shake out? What does the CAP theorem really mean? Which things you've heard about the cloud are just hype, and which are really revolutionary?
At Circonus we process a lot of data. We learned early on that some data can be sampled and some data cannot. The way you treat data when you "need it all" to make good sense of things is radically different than the way you must treat sampled datapoints.
This presentation will walk through the architectural evolution of our system as it had to scale to billions of events per day and trillions of datapoints for regression.
I hope you'll learn some of what we learned constructing a real-time, ad-hoc, global event analysis software as a service platform.
Cloudant was born of a desire to make scalable data storage and analysis simple, both for users and administrators. This talk will focus on the technologies and processes that have proven instrumental in the pursuit of that vision. Adam will discuss local storage engines that are insensitive to system crashes and distributed databases that remain available during network partitions. He'll describe the Erlang VM's code upgrades and how they can be leveraged for rapid, low-impact deployments via continuous integration and configuration management, and he'll draw on years of operational experience to present pitfalls encountered and lessons learned.
Many systems that speak over TCP have simple call-and-response protocols. These lend themselves well to extracting reasonably accurate request and response start and end times. This is a fabulously rich source of data that can be mined for many types of purposes. All that is needed is the first 384 bytes of each packet, containing IP and TCP headers but no payload. This is simple and non-intrusive to capture.
What can we do with this data? Here are some examples:
This analysis technique works for any system, at any layer in the stack, as long as the TCP-based protocol follows call-and-response semantics. Examples of systems that conform to this are most databases, including relational and NoSQL databases; key-value caches such as memcached; and many types of HTTP interactions.
This talk will introduce tools and techniques for capturing and analyzing the network traffic to give deeper insight into system behavior. This knowledge can reveal patterns that can be predictive of emerging problems, such as impending saturation or lock contention in an architectural layer. This overall system performance analysis is a valuable technique for finding problems without needing to analyze each layer independently.
by Chris Burroughs
Large infrastructures generate torrents of data. As this flow of data increases, so does the pressure to avoid outages, minimize downtime, and maximize efficiency. The goals of insight and informed action go unmet if they depend on manual correlation of traditional charts as the sheer volume of data overwhelms the common approaches and the operators using them.
Good visualizations are a critical tool for transforming raw data into actionable information. The right visualizations and task-oriented interfaces enable operators to simply and efficiently explore their data, discover problems, and find solutions.
We will present the principles underlying the information dashboards Boundary developed to visualize trillions of data points and demonstrate how they can be used in your own systems. You will gain an understanding of how visualization can be applied to effectively solve common data problems encountered scaling infrastructures and businesses.
SimpleGeo provides hosted services for location-aware applications, one of which is a cloud spatial database. One can make puns and jokes about "a cloud within a cloud" until blue in the face, but the reality of the matter is that accomplishing such a thing is a non-trivial technical endeavor. Designing for failure is a hard requirement by definition, which turns out to be a blessing in disguise.
The competitive features of our database include built-in scaling, high availability and data redundancy. Come hear how we are harnessing various open source software and cloud services to back up the promises made to our customers and allow them to store and query their data, big and small, without worrying about what's happening behind the scenes; we've built an infrastructure that has the cloud mentality of a flexible, reliable, and highly available service baked in at every layer and have stumbled on something completely unexpected at almost every step of the way.
The talk will include a high-level overview of our infrastructure, as well as brief deepdives into pieces more specific to our usecase (including bits we've built out ourselves). We'll be sure to include juicy war stories about handling the frequent failures guaranteed by running on AWS and kernel-level issues we've encountered while running on virtual instances.
by Greg Lindahl
Many NoSQL databases eschew RAID and use a 3-replica system for protecting against data loss. These systems are tricky to operate at scale, and only a limited number of organizations are attempting to operate high-availability production systems of this type. At blekko, a new web-scale search engine, we have operated our home-grown NoSQL database for 3.5 years with high uptime and very graceful degradation in the face of hardware failures - come learn about the features we found were key for achieving this.
by Wez Furlong
There's nothing quite like learning from failures. Even, or perhaps especially, systems with good architectures have failure modes that can cause acute pain. In this session, we'll cover some examples of architectural choices and failure scenarios that we at Message Systems have experienced in production environments with high volume and high concurrency messaging requirements.
by Gavin M. Roy
At myYearbook.com we use RabbitMQ as the core messaging platform for remote asynchronous task processing, cross-application communication and general information routing and delivery. Pushing RabbitMQ's limits in scaling has provided good insight in how to scale RabbitMQ clusters while avoiding RabbitMQ's common pitfalls.
In this talk we will cover RabbitMQ basics including common message routing patterns, clustering, monitoring and management. myYearbook's use of and integrations with RabbitMQ will be covered, reviewing the bottlenecks we encountered and the strategies we used to keep scaling RabbitMQ to 11. In addition we will cover alternative methods for message publishing such as web application message publishing over HTTP to RabbitMQ using the JSON-RPC-Channel plugin.
by Rod Cope
The concept of cloudbursting is very appealing. Who wouldn't want to maximize control and minimize costs by running a few servers in-house and borrowing public cloud resources when the odd demand spike occurs? This arrangement is particularly appealing if you already have a reason to run at least part of your system in-house, like Big Data that would cost a fortune to store in the public cloud. Unfortunately, the implementation is not so straightforward. You have to keep two asynchronous systems in lockstep across the public Internet while ensuring security, handling unpredictable latency, and optimizing for data locality. Come to this session to learn how this can actually work with Amazon EC2 and SQS. The implementation is based on Ruby on Rails and Resque/Redis, but the concepts are broadly applicable.
Managing large volumes of data can be challenging—sometimes just getting the data requires moving mountains! Recently, the Exceptional Performance team at Yahoo! was challenged to develop a massive business intelligence system for user performance data. And of course, it had to be fast! Our proposed solution broke all records for data scaling and attempted to optimize the entire data pipeline. Come and hear how we used groundbreaking technologies and innovative techniques to build one of the world's largest and most powerful data solutions for performance data.
by Ross Snyder
You don't become the world's largest handmade marketplace without learning how to grow quickly. Etsy was launched in 2005 and soon had to figure out how to meet increasing demand. A novel approach was taken that gave the site some breathing room, but which—for a variety of reasons we'll explore—ultimately resulted in an architectural dead end. In order to survive, a course correction was made, company culture was shifted, and—thanks to battle-tested scaling strategies—Etsy's growth is now exciting instead of ominous. Let's look back at this turbulent period in Etsy's architectural history and examine the cultural and technical challenges that got us to where we are today.
As system engineers, we have a wide range of tools and techniques to ensure our web-based business stays in business 24/7. To keep load manageable, we deploy features to small portions of our user base or coordinate with the business to spread out traffic-generating events. When loads increase, we turn on caching or simply turn off features to preserve capacity.
But what happens if your business model is based on driving peak loads, where full site functionality is required to serve perishable customer interest?
This talk focuses on the speaker's experience in the design, development, operation, evolution and repair of at least one infrastructure for such a business.
28th–30th September 2011