Your current filters are…
Large systems fail constantly and in a wide variety of ways. Planning for those failures and responding under pressure are vital to running a reliable service.
In this presentation, we will discuss the way that the Heroku engineering teams plan for failures. We will cover our on-call methodology, incident response procedures, and metrics we use to track our performance. Finally, since everyone loves war stories, we'll present a couple of case studies on outages that we've experienced and how we responded to them.
28th–30th September 2011