Imagine a world where you receive an alert about an outage that hasn’t happened yet. At Netflix, we are building a Chaos Automation Platform (ChAP) to realize this vision. ChAP runs experiments to test that microservices are resilient to failures in downstream dependencies. These experiments run in production. ChAP siphons off a fraction of real traffic, injects failures, and measures how these failures change system behavior.
ChAP focuses on a specific type of failure: a failed RPC call between microservices. Many types of failures at the level of an individual service can be modeled as an RPC failure or delay: a service that crashes, runs out of resources, or is highly loaded will appear to a client as either returning an error or increased latency.
This talk will cover the motivation behind ChAP, how we implemented it, and how Netflix service teams are using it to identify systemic weaknesses.
Sign in to add slides, notes or videos to this session