With the recent surge in highly available microsevervices with high incoming traffic, it is becoming more and more important to know how your service is performing right now and to be able to diagnose issues in production quickly. It took a while for us to understand how to produce meaningful graphs and alerts that help us truly understand our application performance.
We initially found that most developers did not understand what they were measuring and that many of the graphs caused confusion. In this talk I show how we collect application performance metrics at Sky.
I focus on the use of histogram metrics to monitor response times, explain how reservoir sampling can help and show the trade-offs among reservoir types. Finally I illustrate, with real-world examples, some good and bad practices when monitoring response times.
Sign in to add slides, notes or videos to this session