Automating Chaos Experiments In Production

A session at QCon San Francisco 2016

Imagine a world where you receive an alert about an outage that hasn’t happened yet. At Netflix, we are building a Chaos Automation Platform (ChAP) to realize this vision. ChAP runs experiments to test that microservices are resilient to failures in downstream dependencies. These experiments run in production. ChAP siphons off a fraction of real traffic, injects failures, and measures how these failures change system behavior.

ChAP focuses on a specific type of failure: a failed RPC call between microservices. Many types of failures at the level of an individual service can be modeled as an RPC failure or delay: a service that crashes, runs out of resources, or is highly loaded will appear to a client as either returning an error or increased latency.

This talk will cover the motivation behind ChAP, how we implemented it, and how Netflix service teams are using it to identify systemic weaknesses.

About the speaker

This person is speaking at this event.
Ali Basiri

Wreaking Havoc @ Netflix bio from Twitter

Coverage of this session

Sign in to add slides, notes or videos to this session

Tell your friends!

When

Date Wed 9th November 2016

Session Hash Tag

#qconsf

Short URL

lanyrd.com/sfdtyd

Official session page

qconsf.com

View the schedule

Share

See something wrong?

Report an issue with this session