Disaster Recovery Strategies

Hey Pixelers, Michael Haddon here. Last week, a client asked me, "What should a startup do for disaster recovery?" They had a lot of questions about theory—stuff you can usually find on the AWS blog or similar sites, which tend to focus on the complex strategies enterprise companies use for disaster recovery. So, I thought it’d be nice to create a video that covers everything from A to Z: strategies for disaster recovery, depending on the size of your organization. We’ll explore what you should tackle first as a startup and work up to how enterprises manage these processes. Having managed disaster recovery strategies for both startups and enterprises, I’ve implemented all of these methods across various organizations.

Now, usually, when you research disaster recovery online, most of the content focuses on enterprise approaches. The problem with this is that enterprise-level strategies often assume you’ve already done all the foundational work and take that for granted. This can feel overwhelming for smaller companies, making them think, "I already need to implement pilot lights, right? That’s the first step?" In reality, there’s a lot to address before reaching that stage.

Think of disaster recovery as a maturity chain, starting from the basics for startups and progressing to enterprise-level sophistication. In this video, I’ll guide you through the entire chain, helping you identify where you currently stand and what your next steps should be to improve your current situation.

The first stage is "praying." This means you don’t really have a disaster recovery strategy. So, what is disaster recovery? Disaster recovery is the process your organization follows to recover from an incident. As Google’s SRE book famously says, "Hope is not a strategy." It’s similar to Murphy’s Law: "Anything that can go wrong, will go wrong." Assuming your system will stay safe and stable forever is foolish. Incidents will happen, and if you’re just hoping for the best, you’re doomed from the start. This is how companies end up offline for months, needing to rebuild from scratch, often leading to bankruptcy. Hope is not a strategy.

This means having key measures in place before taking a new application or system into production. This includes runbooks, backup and restore processes, and other best practices we’ll discuss further. It’s about ensuring your team knows how to act during incidents and that everything is ready before you hit production.

The first actionable step is backup and restore. Backups are more nuanced than they may seem. A backup is only as good as your ability to test it. Regular testing is crucial, with the frequency depending on compliance requirements and your investment capacity. Many companies test quarterly or even monthly, while enterprise setups may automate this process continuously. It’s also vital to document the processes so that anyone in the organization—not just the experts—can follow them. This reduces the risk of human error and ensures smooth execution, even if key personnel are unavailable.

There are also best practices for backups worth noting. For example, use idiomatic backups, which leverage tools officially recommended or widely used for your specific system. This avoids pitfalls like corrupted data from poorly executed custom backup solutions. Atomic backups are another must-have, ensuring your backup reflects a single point in time, preventing inconsistencies between files.

Then comes the "3-2-1 backup" strategy: three copies of your data (including the live version), two different media types, and one offsite copy. Some organizations take this a step further with air-gapped or immutable backups, which are especially effective against ransomware attacks.

The next stage is infrastructure as code (IaC). This might seem unrelated to disaster recovery, but it’s a foundational step. Without IaC, you’re stuck backing up not just your data but also server configurations. With IaC, you can automate rebuilding your infrastructure, significantly reducing downtime and human error. Tools like Kubernetes and Terraform are my go-to solutions for this.

For startups, the next recommendation might be controversial: partner with major cloud providers. While this may seem expensive compared to renting virtual machines or buying your own hardware, these providers offer enterprise-grade disaster recovery capabilities at a fraction of the cost it would take to implement on your own. Their expertise and resources can save you significant time and effort, allowing you to focus on your core business.

From here, we move into enterprise-level strategies. Enterprises define disaster recovery metrics using RPO (Recovery Point Objective) and RTO (Recovery Time Objective). RPO defines how much data loss is acceptable, while RTO defines how quickly you can recover. These metrics vary depending on business needs. For example, an e-commerce site like Amazon may have stricter requirements than an internal government system.

Enterprise strategies include pilot light setups, warm standby, and active-active architectures, each offering progressively lower downtime and higher resilience but also increasing costs. Pilot light systems keep critical components running in a passive state, ready to scale up when needed. Warm standby systems take this further by running scaled-down applications that can quickly handle production loads. Active-active setups maintain fully operational systems in multiple regions simultaneously, offering near-zero downtime but requiring sophisticated data synchronisation and a significant budget.

For enterprises, disaster recovery planning often begins with identifying critical systems, estimating downtime costs, and defining acceptable RPO and RTO values. These steps help justify investments and ensure a balanced approach to risk and resilience.

To wrap up, disaster recovery is about preparation, not reaction. Address these challenges before they become problems. If your organization needs help improving its disaster recovery maturity, feel free to reach out—I’d be happy to assist. And as always, let me know in the comments if there are other topics you’d like me to cover. Cheers!

Let us take your systemsto the next level

Let us take your systems
to the next level