Despite the implications of its most breathless proponents, the trend towards building putatively reliable systems out of wholly unreliable components does not mean the end of software defects: even where bugs do not result in system outage, they can induce misbehavior, degradations of service, and cascading failures that themselves can lead to outage. So it remains critical to debug our software -- and more critical than ever that we are able to do so in production environments. This talk will discuss the essential technologies for debugging such systems: postmortem debugging (when failure is fatal) and dynamic instrumentation (when failure is transient). We will discuss the history, current state-of-the-art, intersections, and open problems of these technologies -- and how they have been shaped in the kiln of unspeakable pain that is production systems failure.
by Evan Cooke
Many APIs tend to be overly complex, poorly documented, and not designed with the customer in mind. In this talk we'll explore some of the lessons we've learned at Twilio building a team and designing and building simple, powerful APIs that focus on the customer instead of back-end requirements.
by Sid Anand
In 2008, Netflix began to see traction in its new mode of video delivery -- video streaming to devices in the home and in your pocket. As part of this transition, we are witnessing a shift in our traffic patterns and in the expectations of our customers regarding availability. Specifically, as we become indistinguishable from TV, we cannot afford service downtimes, planned or otherwise. To complicate matters further, our systems operate in AWS, where we have less control over networking, persistence, virtualization, etc... In an effort to build a highly-available system using a sometimes-unavailable cloud, Netflix must adopt new deployment paradigms and new ways of testing reliability. From red-black deployments to the Simian army to reliable Cassandra clusters to new features in our platfrom service layer and more, we are investing in reliability across the board.
14th–18th November 2011