Post-mortems

An incident happens. It affects business value and the users of the application, and then it goes away or is solved. But what’s to say it doesn’t happen again? What could be done to mitigate it before it even has the chance to happen again? Post-mortems are the answer to all of that. Any good DevOps team will perform a post-mortem after an incident has occurred. This post-mortem will be led by the incident response team that handled the situation.

Post-mortems sound macabre, but they are an essential part of the healing process and improvement of a workload and a DevOps team. They let the DevOps team understand the incident that occurred and how it happened, and they dissect the response made by the response team. Exercises such as these create a solid foundation for faster response times in the future as well as for learning experiences and team growth.

One of the aspects of post-mortems that is constantly emphasized is that they must be blameless, i.e., there mustn’t be any placing of responsibility for the cause of the incident upon an individual. If an incident has occurred, it is the process that must be modified, not the person. This approach creates an environment of openness and makes sure that the results of the post-mortem are factual, objective, and unbiased.

So, you may ask yourself, why go through all of this? The reason is often contractual and obligatory. In a modern technological landscape, things such as these are necessary and expected to deliver value and availability to the end user. So let’s understand exactly what that availability means.

Understanding high availability

I’m not going to state Murphy’s Law a third time, but understand that it applies here as well. Things will go wrong and they will fall apart. Never forget that. One of the reasons DevOps as a concept and culture became so popular was that its techniques delivered a highly available product with very little downtime, maintenance time, and vulnerability to app-breaking errors.

One of the reasons DevOps succeeds in its mission for high availability is the ability to understand failure, react to failure, and recover from failure. Here’s a famous quote from Werner Vogel, the CTO of Amazon:

Everything fails, all the time.

This is, in fact, the foundation of the best practice guides, tutorials, and documentation that AWS makes for DevOps operations, and it’s true. Sometimes, things fail because of a mistake that has been made. Sometimes, they fail because of circumstances that are completely out of our control, and sometimes, things fail for no reason. But the point is that things fail, and when they do, DevOps engineers need to deal with those failures. Additionally, they need to figure out how to deal with them as fast as possible with as little disturbance to the customer as possible.

A little advice for people who may have never worked on a solid project before, or at least been the guy facing the guy giving orders: ask for specifics. It’s one of the tenets of DevOps, Agile, and any other functional strategy and is vital to any sort of working relationship between all the stakeholders and participants of a project. If you tell people exactly what you want, and if you give them metrics that define that thing, it becomes easier to produce it. So, in DevOps, there are metrics and measurements that help define the requirements for the availability of services as well as agreements to maintain those services.

There are a number of acronyms, metrics, and indicators that are associated with high availability. These are going to be explored in this section and they will help define exactly what high availability means in a workload.