Site reliability engineering

So, site reliability engineering (SRE) is considered a form of DevOps by many and is considered to be separate from DevOps by others. I’m putting this section in here because, regardless of your opinion on the subject, you as a DevOps Engineer will have to deal with the concepts of site reliability, how to maintain it, and how to retain customer trust.

SRE as a concept is more rigid and inflexible than the DevOps philosophy as a whole. It is the evolution of the data center technicians of the past who practically lived in data centers for the course of their careers, maintaining server racks and configurations to ensure whatever product that was being delivered by their servers would continue to be delivered. That was their job: not creating anything new, but finding solutions to maintain their old infrastructure.

SRE is similar, but the engineer has been taken out of the data center and placed inside a remote work desk at an office or their own home. They still live fairly close to their data center or the cloud region containing the resources that they manage, but they differ from their predecessors in a couple of ways:

  1. Their teams are likely scattered across their regions as opposed to being in a singular place.
  2. Their emphasis is now on what we call predictive maintenance, i.e. they do not wait for something to go wrong to respond.

Incident response teams

This new trend of SRE also helped produce incident response teams, which can be quickly created from within the ranks of the DevOps team to monitor and deal with an incident. They can do so while communicating with stakeholders to keep them informed about the situation and finding the root cause of the incident. These teams also produce reports that can help the DevOps team deal with and mitigate such potential situations in the future. In a world where an outage of a few minutes can sometimes cause millions of dollars of loss and damage, incident response teams have become a prominent part of any DevOps engineer’s world.

Usually, an incident response team is made up of the following members:

  • Incident commander (IC): An incident commander leads the response to the incident and is responsible for a post-incident response plan
  • Communications leader (CL): A communications leader is the public-facing member of the team who is responsible for communicating the incident and the progress made to mitigate the incident to the stakeholders
  • Operations leader (OL): Sometimes synonymous with the incident commander, the OL leads the technical resolution of the incident by looking at logs, errors, and metrics and figures out a way to bring the site or application back online
  • Team members: Team members under the CL and OL who are coordinated by their respective leaders for whatever purpose they may require

Figure 1.1 – A typical incident response team structure

As you can see in Figure 1.1, the structure of the incident response team is fairly simple and is usually quite effective in mitigating an incident when such a case arises. But what happens after the incident? Another incident? That’s a possibility and the fact that it’s a possibility is the exact reason we need to gain insight from the current incident. We do this with post-mortems.