Get in Touch

Blog

BCDR

You Have a DR Plan. Does It Actually Work?

BCDR

Published Apr 11, 2026

Mahesh Chandran

CEO Dataring

Here is a question that separates theoretical preparedness from real preparedness: when was your organization's disaster recovery plan last tested with an actual failover, and what were the results?

In the engagements I've worked on, the honest answer is usually "I don't know" or "I'm not sure we've ever done that." The gap between having a plan and having a plan that works is one of the most common failure modes I see. The plan exists, the document satisfies the auditor, the auditor signs off, and nobody actually runs the plan to see if it survives contact with reality.

This post is about how to tell the difference. For the broader BCDR architecture context, see our cloud DR in the GCC pillar. For the regulator-facing maturity model that complements this, see our SAMA CSF post.

The thesis

Most DR plans fail because they were never tested, not because they were poorly designed. An untested plan is a hope document. Worse, an untested plan creates false confidence — the leader who believes there is a plan behaves as if the problem is solved, while the leader who knows there is no plan behaves cautiously and pushes for improvement. The behavioral difference is enormous, and it's entirely driven by whether the plan has been stress-tested.

The confidence gap

Ask most executives whether their organization has a DR plan and you'll get a confident yes. Ask whether the plan has been tested recently and you'll usually get a hesitant "I believe so." Ask what the test revealed and you'll often get silence. That sequence is the confidence gap.

Industry research on testing rates and restoration success varies by methodology and reporting; institutions interested in the specific figures should consult current sources from established research firms. The pattern across studies is consistent: a meaningful share of organizations don't test backups at all, and a meaningful share of restoration attempts in real incidents fail when first attempted. The exact percentages move; the direction doesn't.

Three flavors of false confidence

In the assessments I've done, false confidence about DR shows up in three patterns. Each has a tell, and each can be diagnosed without technical expertise.

"We have a plan."

There is a documented DR plan. PDF in SharePoint, page in Confluence, section in the employee handbook. It satisfies the annual audit. The tell: when you ask to see it, nobody knows where it lives; or when you find it, the contact list names people who left two years ago; or it references systems that were decommissioned eighteen months ago. A plan that hasn't been reviewed in eighteen months is essentially fiction. The diagnostic question is not "do we have a plan?" but "when was the plan last reviewed against current operations, by whom, and what changed?"

"IT handles it."

The business leader has delegated continuity entirely to IT. IT is competent and is genuinely trying to protect the organization, but IT protects infrastructure, not business processes. They know how to restore a database and fail over a VM. They don't know which of your team's workflows are Tier 1, which can pause, or which customers must be contacted first. This isn't IT's fault — they cannot know these things without being told. The result is a plan optimized for technical metrics (time to restore systems) rather than business metrics (time to resume critical processes). The gap between "the database is back up" and "my team can do its job" is often hours.

"We've backed everything up."

Backups exist. They run nightly. They consume storage and budget. They have never been tested for restoration. The organization assumes that backups equal recovery, which is a category error: backups are necessary but not sufficient. Many backups silently fail to be restorable, and the failure only becomes visible when you actually try to restore them. A backup that cannot be restored is a wish, not a backup. See our backup guide for the verification step that closes this gap.

Why plans decay

Even a well-written DR plan degrades over time through a process worth naming explicitly: plan decay. The plan becomes less accurate without anyone updating it, because the business keeps changing underneath the plan.

The forces that cause decay are mundane. Staff turnover: the named incident coordinator left two months ago. New SaaS tools: the team adopted three new tools last quarter, none of which are in the plan (see our dependency mapping post). System migrations: IT moved the core application to a different region but didn't update the runbook. Restructuring: the department that owned a critical process was dissolved, and the process is now shared across two teams, neither of which thinks it's their responsibility. New customer commitments: a contract signed last quarter requires a 4-hour RTO for services currently protected at 8 hours.

Each change is small. Each should trigger an update. Almost none do, because nobody's job is "keep the DR plan accurate" and the plan is only opened during the annual compliance cycle.

A useful diagnostic for plan decay is the newspaper test: if the DR plan were published on the front page tomorrow, would it accurately describe how the organization would actually respond? For most organizations, the honest answer is no.

Three test types, three failure modes

A common mistake is to treat "DR testing" as one thing. It isn't. There are three test types, and each catches a different class of failure. Running one of them does not substitute for the others.

Tabletop exercise (decision testing). A structured discussion of a hypothetical scenario, with business and IT leadership in the same room, no systems actually affected. What it catches: decision authority gaps, communication failures, coordination problems, missing escalation paths, conflicting team assumptions. What it does not catch: whether the technical recovery actually works.

Restore test (data integrity testing). Actually restoring a backup to ephemeral compute and verifying that the restored system is functional. What it catches: corrupted backups, missing tables, schema drift, encryption-key issues, broken restoration scripts. What it does not catch: whether the failover path can serve real users, or whether the business knows what to do once the data is back.

Failover test (user-facing path testing). Switching production traffic to the DR environment for a defined period and serving real users from there. What it catches: DNS and routing issues, load characteristics in DR, application-level state problems, third-party integration health, and the human factors of running production from the unfamiliar environment. What it does not catch: scenarios where multiple systems fail simultaneously or where the cyber and physical layers are both degraded.

An organization that runs only tabletops has rehearsed decisions but not validated technical recovery. An organization that runs only restore tests has validated data but not the user-facing path. An organization that runs only failover tests has validated production switching but may not have rehearsed the human-decision layer. All three are necessary; the absence of any one is a known gap.

The progressive testing ladder

Beyond the three test types, there is a maturity progression that organizations move through over time. This is the same ladder I describe in our SAMA CSF post, applied here as a testing curriculum.

Level 1 — Plan review. A team reads the current plan together, identifies inaccuracies, and updates contact information and procedures. The minimum viable form of testing. Should happen quarterly. Catches plan decay before it becomes dangerous.

Level 2 — Tabletop exercise. A structured discussion-based walkthrough of a scenario, two to four hours, with business and IT in the same room. Identifies decision-making and communication gaps. Should happen at least annually.

Level 3 — Functional drill. A technical test of a specific recovery capability — a real backup restore, a real database failover, a real switch to a manual process. Validates individual recovery procedures. Should happen at least annually for each critical system. This is the bar that the SAMA CSF maturity work calls Level 3, and that I'd consider the minimum standard for Tier 0 systems after the March 2026 Gulf cloud incident.

Level 4 — Coincident-scenario simulation. A realistic, multi-hour or multi-day exercise combining technical failure with cyber pressure, degraded monitoring, and time pressure. Tests the full chain from detection through decision through recovery. Few organizations reach this level. The ones that do are meaningfully better prepared than their peers.

As a business leader, the goal is to participate in Level 2 annually and to push for Level 3 on your function's most critical processes. The difference between an untested plan and a Level 3-tested plan is the difference between hoping and knowing.

Tabletop exercises and what business leaders actually do

Three misconceptions keep business leaders out of tabletops.

"Tabletops are a technical exercise." They aren't. Tabletops test decision-making, communication, and coordination — not system recovery. The participants who matter most are the people who make operational decisions under pressure, which usually means business leaders. IT engineers are useful as technical resources but are not the main audience.

"I'll be exposed as technically ignorant." The questions are about business judgment: "Which customers do we contact first?" "What do we tell the board?" "Which deadline can we slip?" "Who has authority to make this decision?" These are exactly the questions business leaders are paid to answer. Practicing them in a low-stakes environment is the point.

"Tabletops are optional if we have a real DR plan." They aren't. The tabletop is the only place where you find out whether the plan survives contact with reality. It's where you discover that the escalation tree has gaps, that two teams think the same decision is the other team's job, that the customer-notification process depends on a system that would be down during the exact scenario where you need to notify customers.

If your organization runs tabletops, your job as a business leader has three parts.

Before. Review your MVB Canvas if you have one — the exercise tests whether your priorities survive simulated pressure. Review the current RTOs and RPOs for the systems your team depends on, using the downtime economics framework. Come prepared with a clear understanding of your function's critical processes.

During. At each scenario stage, answer three questions out loud: (1) what does my team do right now, given what we know? (2) who do I need to communicate with, and what do I tell them? (3) what decisions am I authorized to make, and what do I need to escalate? Answer in real time, as if the scenario were real. The goal is not perfect answers; it's surfacing the places where you don't have good answers.

After. The most valuable part happens after the scenario ends. Document every gap, confusion, and missing piece. Schedule one follow-up to verify that each gap has been addressed. A tabletop that surfaces ten gaps and addresses none is worse than no tabletop — it creates documentation of preparedness without the underlying work. A tabletop that surfaces ten and addresses five in the following month is genuinely valuable.

Seven diagnostic questions

You can diagnose your organization's real preparedness in about twenty minutes with seven questions. No technical knowledge required.

1. When was the DR plan last updated? More than twelve months: the plan is almost certainly inaccurate. More than eighteen months: fiction.

2. Does the plan name specific people, and are those people still in those roles? Staff turnover breaks DR plans faster than any other force.

3. Has the plan been tested with an actual restoration drill, and what were the results? "Yes, last year, here are the issues we found and addressed" is great. "I think so?" is concerning.

4. Does the plan cover the SaaS tools my team uses daily? Most organizational DR plans cover on-premise and cloud infrastructure but not SaaS. If your team runs on 30 SaaS tools and the plan mentions five, there are 25 gaps.

5. Have I (as a business leader) ever participated in a DR tabletop or drill? If not, the plan has been tested only against technical infrastructure, not against business priorities.

6. If our primary systems went down right now, does everyone on my team know what to do in the first hour? Ask your direct reports directly. Most will say no. That's useful information.

7. Do our customer contracts require a tested DR plan, and can we show evidence of testing? Enterprise procurement increasingly includes this. Inability to produce evidence is contractual exposure regardless of technical preparedness.

The conversation with IT

After running the diagnostic, schedule a collaborative conversation with your IT leader or CISO.

Frame it as understanding, not audit: "I want to know how ready we actually are for the systems my team depends on. Can we walk through this together?"

Ask to see the current DR plan for your top three systems. Note the last-updated date, listed owners, stated RTO/RPO, and documented test history.

Ask the specific question: "When was the last time we actually tested recovery for these systems, and what did we learn?" If the answer is "we don't do that kind of testing" or "it's been a while," you've found the gap.

Ask whether you can participate in the next tabletop. If none is scheduled, ask whether one can happen in the next 90 days. Offer to help design the scenario based on your function's critical processes.

End with a 30-day follow-up commitment. Calendar invite, specific agenda. Leaders who follow up drive change. Leaders who don't get the plan on paper and the failure in practice.

Tradeoffs and honest limitations

Tests have a cost. Level 3 functional drills consume real engineering time, real change windows, and create real (if controlled) production risk. The cost is meaningful, and it's lower than the cost of discovering the gap during a real incident.

Tests can produce false reassurance too. A test that's been carefully scripted and rehearsed in advance demonstrates that the rehearsed scenario works. Real incidents rarely match the rehearsed scenario. Variation in scenarios across exercises is what surfaces the gaps that scripted tests miss.

Frequency matters less than honesty. One Level 3 drill per year, conducted with realistic conditions and honest reporting, beats four scripted Level 2 exercises that are designed to pass.

A practical takeaway

If your organization has a DR plan but you cannot answer the seven diagnostic questions confidently, the highest-leverage 30-day project is asking those questions out loud, in writing, to your IT leader. The conversation that follows is usually more productive than any number of strategy sessions about resilience in the abstract.

For the prioritization work that should sit upstream of testing, see our MVB post. For the SaaS dependency layer that most plans miss, see our dependency mapping post. For the regional regulatory context, see our SAMA CSF guide. If you'd like outside facilitation for a tabletop or Level 3 drill, Dataring's resilience practice runs these regularly. Get in touch.