With over 750 million active users worldwide, half of whom log in daily, Facebook is now the #1 most popular website. The infrastructure that powers Facebook is made up of many back-end services that all work together to provide a coherent user experience. The complex infrastructure has a cache with over 2 trillion objects in it, accessed 100 million times a second across multiple datacenters and geographies. There are over 900 million objects that people interact with and more than 30 billion items of content are uploaded every month.
Reliability is important to Facebook but failures will occur, and the Operations and Infrastructure Engineering teams need to respond to these failures quickly. Facebook is a fast paced environment and the principle of moving fast is applied not only to its engineering practices but also to how things are fixed. Systems, processes and culture all work together to make this happen.
This talk will highlight some of the systems and practices that are employed at Facebook to manage systems and software at scale. I will use a few case studies to describe how these are built and provide guidelines for how others can build their own systems and operations teams that can scale with infrastructure growth. It will touch on concepts like automation, communication, monitoring, incident management, infrastructure design & code releases.
22nd–23rd September 2011