Tutorial T7: How to Not Get Paged: Managing On-call to Reduce Outages

A session at LISA15

Tuesday 10th November, 2015

1:30pm to 5:00pm (EST)

People think of "on call” as responding to a pager that beeps because of an outage. In this class, you will learn how to run an on-call system that improves uptime and reduces how often you are paged. We will start with a monitoring philosophy that prevent outages. Then we will discuss how to construct an on-call schedule—possibly in more detail than you've cared about before—but, as a result, it will be more fair and less stressful. We'll discuss how to conduct “fire drills” and “game day exercises” that create antifragile systems. Lastly, we'll discuss how to conduct a postmortem exercise that promotes better communication and prevents future problems.

Who should attend:
Sysadmins, devs, operations, and their managers

Take back to work:

  • Knowledge that makes being on call more fair and less stressful
  • Strategies for using monitoring to improve uptime and reliability
  • Team-training techniques such as "fire drills" and "game day exercises"
  • How to conduct better postmortems/learning retrospectives

Topics include:

  • Why your monitoring strategy is broken and how to fix it
  • Building a more fair on-call schedule
  • Monitoring to detect outages vs. monitoring to improve reliability
  • Alert review strategies
  • Conducting “fire drills” and “game day exercises”
  • "Blameless postmortem documents"

About the speaker

This person is speaking at this event.
Thomas A. Limoncelli

Co-author of http://the-cloud-book.com the Grey's Anatomy of DevOps/SRE practices. Author, LGBT, Sysadmin, SRE at http://StackOverflow.com. Overturn Heller! bio from Twitter

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 1:30pm5:00pm EST

Date Tue 10th November 2015

Session Hash Tag


Short URL


View the schedule


Books by speaker

  • The Practice of System and Network Administration
  • Time Management for System Administrators

See something wrong?

Report an issue with this session