How to respond to an incident (in life and DevOps)
Incidents happen, and the people who are responsible for dealing with these incidents need to handle them. Firefighters have to battle fires, doctors have to treat the sick, and DevOps engineers have to contend with a number of incidents that can occur when running the sites that they manage and deploy.
Now, in life, how would you deal with an incident or something that affects your life or your work that you need to deal with? There’s one approach that I read in a book called Mental Strength by Iain Stuart Abernathy that I subsequently found everywhere among the DevOps courses and experts that I met: Specific, Measurable, Achievable, Realistic, and Time-bound (SMART). If a solution to a problem has to follow all of these principles, it will have a good chance of working. You can apply this to your own life along with your DevOps journey. It’s all problem-solving, after all.
To define the SMART principle in brief, let’s go over each of the components one by one:
- Specific: Know exactly what is happening
- Measurable: Measure its impact
- Achievable: Think of what your goal is for mitigation
- Realistic: Be realistic with your expectations and what you can do
- Time-bound: Time is of the essence, so don’t waste it
Here are some common incidents DevOps engineers may have to deal with:
- The production website or application goes down
- There is a mass spike in traffic suggesting a distributed denial-of-service attack
- There is a mass spike in traffic suggesting an influx of new users that will require an upscale in resources
- There is an error in building the latest code in the code pipeline
- Someone deleted the production database (seriously, this can happen)
Dealing with incidents involves first dividing the incident based on the type of response that can be provided and whether this type of incident has been anticipated and prepared for. If the response is manual, then time isn’t a factor. Usually, this occurs if an incident doesn’t affect the workload but must be addressed, such as a potential anomaly or a data breach. The stakeholders need to be told so that they can make an informed decision on the matter. Automatic responses are for common errors or incidents that you know occur from time to time and have the appropriate response for. For example, if you need to add more computing power or more servers in response to increased traffic or if you have to restart an instance if a certain metric goes awry (this happens quite a bit with Kubernetes).
We deal with these incidents in order to provide the maximum availability possible for any application or site that we manage. This practice of aiming for maximum availability will be covered in the next section on site reliability engineering.