How To Stay Calm During Outages

Introduction
On-call mostly sucks.
Why does on-call suck?
Set Boundaries
Slow down
Separate Discovery and Action
Take Notes
We Are In This Together
Blameless postmortems

Monday, 20th February 2023

Do you have anxiety when you are on-call?

Do you get panicked and disorganised when an outage occurs?

Hey Pixelers,

On-call mostly sucks.

As an infrastructure engineer I am often a main contact-point during systems outages. This sometimes is the fault of something in my domain, such as network, dns or load issues. However, a significant portion of the time it is an issue outside my domain, where I have either been brought into help resolve the issue, or I was the first point of contact on-call. With the adoption of the "You build it, you own it" mindset, more developers are tasked with being responsible for their application in production. Due to this, there is an increase of people with less experience managing outages, now thrown into the front-line. This can be very stressful, so I am going to have a few recommendations on how you can handle this situation better.

Why does on-call suck?

Before we continue, it is good to understand what actually causes the stress and anxiety in the first place. Usually this is a combination of:

Not knowing when an issue will occur, what happens if an incident happens when I go to the supermarket?
This happens way too often, I am struggling to live my normal life with a healthy work-life balance.
The fear of failing, what if I take too long to resolve the issue? Am I competent enough?

We can use this list to design a strategy of how to better structure our on-call support. For this reason, it is also important to candidly discuss what concerns you have with your organisation.

Set Boundaries

I have worked with people who stress out and basically freeze their private life during their on-call schedule. This is bad.

It is extremely important that engineers can continue to live their private lives and not dread their spot in the rota. A question worth asking is, what is our reasonable expectation of uptime, or alternatively, what level of downtime is considered acceptable. It is easy for an organisation to say "no down-time is acceptable", but this is a foolish statement. This is likely due to poorly designed business-to-business contractual agreements, or an unrealistic expectation of low profit-loss due to systems-outages. The costs associated with ensuring uptime grow exponentially the higher the goal we set. So much so, that these costs can be far greater than any profit the organisation potentially lost. Even when an organisation is calculating its losses in the hundreds of thousands of Euroes.

This is because a common way of tackling downtime is by doubling resources. Either by hiring more people, more engineers, dedicated run-departments... By paying for expensive support contracts. Or paying for more processing, distributed computing, spread across multiple service providers.

To this end, it is important as an engineer that you properly set boundaries with your managers and temper uptime expectations. It is usually worth more to the organisation that you are happy with your job and do not burn out. Hiring and retraining a replacement can be expensive, and cost an organisation a lot of money if it gets in the way of achieving product goals. It is vital for the organisation to promote a culture where it is OK to fail, make mistakes, and the mental health of the engineers is taken seriously. Therefore, it should be encouraged for people on-call to still have social events, maybe avoid intoxication, but worst case scenario, have rigid escalation processes defined. If you are driving, it can be escalated to the next colleague in-line. Maybe he is swimming, then we can escalate up to someone else, or the entire team. This is fine.

Slow down

Now we have properly set our boundaries and expectations, we can talk about how to respond in the crisis situation. The Navy Seals have a phrase "Slow is smooth, Smooth is fast". The theory, is by taking things slowly, you are less likely to make mistakes, therefore doing a better job than someone who rushes. This lesson is a core principle in the fable of The Tortoise and the Hare, which I am sure you are all aware of. The tale of someone rushing, accidentally deleting the wrong resource and making a small incident into a catastrophe, is unfortunately not uncommon. Take a deep breath, look out the window, brew yourself a cup of tea/coffee, take breaks to stretch and clear your mind, stay hydrated, you will need it. You might have people bothering you, maybe management hovering awkwardly in the corner, tune them out, and keep a cool head.

Separate Discovery and Action

Rushing into action, to "quickly resolve the problem", is how you end up rushing to solve two problems. You should split your investigations into three phases.

Discovery, what is the problem, what caused it, what is the root cause?
Strategy, how should we fix it, who should be involved in fixing it.
Implementation, actually taking the time to perform the repair.

During the discovery phase, some people love to create check-lists, this can help ensure that when it comes time to implement the repair, that all the relevant steps are performed in order, without missing anything. We should only move onto the next stage when we are confident we have exhausted all the knowledge gathering required from the previous stage. You may have multiple potential root-causes. Go through each one, one at a time, systematically rule each one out, until you have one option left. Time to move onto the next step? Remember to take a deep breath, and double check all the information compiled in the previous phases first.

Take Notes

A good way to structure an incident response and to keep yourself from going too fast, is to approach the situation like a crime-scene detective. Open up a notepad, and write the general steps you made, investigations conducted, who you talked to and at what time. If you have a colleague assisting you during the outage, you can assign them as the dedicated scribe. It can be helpful to simplify the situation by drawing out the issue, as you understand it and update your understanding as people share knowledge, or realisations are made. Whiteboards are great for this. I have one behind me, and one I can place on my desk. Feeling stressed decreases the amount of cognitive resources that you have available, and greatly compromises your ability for original thinking. Therefore, this added layer of structure is essential and should not be understated.

We Are In This Together

You should not feel alone in an incident, reaching out to colleagues is a critical and important step. Create a group chat, immediately pull in relevant parties, escalate to other engineers as needed. Ask questions to things you do not know, this is not the time to be arrogant.

Blameless postmortems

I won't talk about how to structure postmortems in this video, however, the most important thing for you to understand, is they are vital. By sharing the notes and checklists we developed during the investigation, we can create internal documentation on how to resolve these issues faster in the future.

Sharing this calm and measured approach with your team, can help keep other people in the organisation become calmer and more trusting when an issue does occur, that it will be resolved professionally. In a future video I will talk about how to best organise a team to reduce the frequency of outages and reducing your mean time-to-recover. I hope that this video has helped you learn some techniques to stay calm during a system outage. Oh, and if you are wondering whether or not you are competent enough? Do not worry, the chance is you probably are, and this will be a great experience to learn more about how your system works!

How To Stay Calm During Outages

Table of contents