Thursday 27th September, 2012
10:45am to 11:30am
""#monitoringsucks."" I can appreciate the sentiment. I used to use Nagios too! However, I can't agree that monitoring sucks. Monitoring is awesome! We observe our systems to understand their behaviour. We do this in various ways like reading logs or taking measurements and, more recently, storing them in a timeseries database such as collectd or graphite. However, the standard practice for alerting is still to check the measurement at the time that it is taken and it is this ""check script"" model of monitoring that is long due for an overhaul. In this talk, I'll start over from first principles: what do we want monitoring to do for us? I'll deconstruct the ""check script"" and rebase its essentials on the humble timeseries. I'll demonstrate simple aggregation and apply some maths and stats to show how monitoring can scale to cluster size without increasing maintenance costs. With worked examples based on real-world situations, you'll learn techniques that you can use to make your monitoring systems more useful.
Sign in to add slides, notes or videos to this session