/

BCDR

The First 72 Hours: Operational Lessons from the March 2026 Gulf Cloud Incident

BCDR

Mahesh Chandran

CEO, Dataring

In the months since the March 2026 Gulf cloud incident, most of the public discussion has been about architecture: which patterns held up, which didn't, and what GCC organizations should change. That conversation is well covered, including in our pillar guide on cloud DR in the GCC.

This post is about something else. The architectural lessons matter, but the operational lessons — about the decisions teams had to make in real time, often with incomplete information and degraded tooling — are where I think the more durable learning sits. They are also less talked about, because they are less comfortable to write about. Most of the operational mistakes I've heard described in retrospectives weren't architecture problems. They were decision-making problems under conditions the runbook hadn't anticipated.

The thesis

Disaster recovery runbooks are written for the scenarios people thought to write down. The decisions that matter most in a real incident are the ones that fall outside the runbook — where the team has to act before they have full information, with tools that are partially broken, against a clock they don't control. These decisions cluster into three distinct windows in the first 72 hours, and good incident response treats each window differently.

What follows is a generic, decision-focused timeline of those three windows. Specific incident details (which facilities, which time stamps, which providers) are intentionally generalized; for those, see public reporting from established cybersecurity research outlets and provider post-incident communications.

The Three Decision Windows

The framework I use when debriefing teams on incidents like this one is to break the first 72 hours into three distinct windows. Each window has different information available, different decisions to make, and different characteristic failure modes. Treating them as one continuous incident is one of the most common mistakes I see.

Window 1: Triage (roughly the first 0–2 hours)

What's available: Partial monitoring data. Conflicting signals. Reports from staff and external feeds. The provider's status page lagging behind reality.

The decisions: Is this real? How wide is the impact? Who needs to be told, in what order? Do we declare an incident, and at what severity?

Characteristic mistakes:

Anchoring on the provider's status page. Status pages are routinely behind the actual situation by 30–90 minutes during a fast-moving event, because they are themselves dependent on the systems being affected, and because providers are appropriately cautious about what they publish. Teams that treat the status page as ground truth lose time waiting for confirmation that may not come.

Under-declaring severity to avoid disruption. Declaring SEV-1 wakes people up and triggers expensive processes. Teams that under-declare to avoid the disruption end up with the same disruption later, plus a delayed start.

What good teams did differently: Treated their own monitoring as primary signal, the provider's status page as a corroborating data point, and external news/peer reports as a tertiary signal. Declared incident severity based on what was observable, not on what the provider had confirmed.

Window 2: Commit (roughly 2–24 hours)

What's available: Confirmation that the impact is real and not transient. Some idea of scope. Increasing pressure from leadership and customers. Still incomplete information about duration.

The decisions: Are we failing over to DR or waiting for primary recovery? Who has the authority to make that call? What's the rollback path if we fail over and the primary comes back?

Characteristic mistakes:

Waiting for full information before committing. The information you want — "how long until the primary is back" — is the information the provider can rarely give in the early hours of a major incident. Teams that wait for it are committing to wait, even if they don't realize it. The decision to wait has the same cost as the decision to fail over, sometimes higher.

Authority that lives with people who are unreachable. If the failover decision requires sign-off from a CTO who is on a flight, in a conflict zone, or unreachable for any reason, the failover does not happen on time. Several teams I've spoken with discovered during March 2026 that their decision authority chain assumed business-hours availability that did not hold.

No rollback plan. Failing over to DR is reversible only if you've planned the reverse path. Teams that committed to failover without a documented rollback ended up running from DR longer than necessary, because they were afraid of going back.

What good teams did differently: Pre-defined the trigger conditions for failover before the incident, with named decision-makers who had pre-delegated authority. Rehearsed both the failover and the rollback. Made the call inside Window 2 even with imperfect information, because they had practiced making decisions on partial data.

Window 3: Operate (roughly 24–72 hours)

What's available: Stable operation in DR (assuming Window 2 went well). Better information about the primary. Time pressure has shifted from "decide now" to "sustain this for as long as needed."

The decisions: What stays running, what stays paused? What gets recovered in what order? When do we plan to go back, and under what conditions? How do we sustain on-call for a multi-day operation?

Characteristic mistakes:

Treating Window 3 like an extended Window 2. The pace of decisions slows. The team's adrenaline drops. Procedures that worked under Window 2 pressure (improvised communication channels, ad-hoc decision-making) don't sustain for multiple days. Teams that don't transition to a more durable operating mode burn out their incident commanders within 36 hours.

Underestimating what's missing in DR. Pattern A and Pattern B (see our pattern guide) typically protect the production path. They often don't protect ancillary systems: internal admin tools, scheduled jobs, reporting pipelines, third-party integrations, monitoring dashboards. In Window 3, the gaps surface. Teams discover that a critical batch job, a reconciliation report, or a partner-facing API is broken because it was hosted somewhere that wasn't covered.

Going back too early. The pressure to return to primary increases over time. Teams that go back before the primary's stability is verified end up failing over again — which is roughly twice as expensive as staying in DR until the primary is confirmed stable.

What good teams did differently: Set up a sustainable operating cadence — shift rotations, daily decision points, written status reports — within the first 24 hours of being in DR. Maintained a running list of "things that are missing or broken in DR" and worked it down over the operation. Set explicit criteria for when to return to primary, and held the line on those criteria.

What I take from these three windows

The architectural lessons of March 2026 are real and have been written about extensively. I won't repeat them here. The operational lessons that I think are durable are these three:

Decision authority needs to be pre-delegated and pre-rehearsed. The single highest-leverage thing most teams could improve is moving the failover decision from "the CTO calls it" to "the on-call incident commander calls it within these defined trigger conditions." The first version of that delegation will feel uncomfortable. It is still better than the alternative.

The runbook covers Windows 1 and 2 reasonably well. It almost never covers Window 3. If you have a multi-day operating plan for running from DR — shift schedules, decision rhythm, communication cadence, the criteria for returning — you are in a small minority. Most teams treat Window 3 as "figure it out when we get there." That works for short incidents. It does not work for incidents that last a week.

Tabletop tests rarely cover the human factors. The technical failover can be exercised in a Level 3 test (see our SAMA CSF maturity ladder). The human-factor questions — who calls it, with what information, against what trigger, with what communication — require a different kind of exercise. Most of the gaps I've heard described from March 2026 retrospectives were human-factor gaps that a properly designed Level 4 exercise would have surfaced in advance.

A practical takeaway

If you are sitting down to plan your next BCDR exercise, the most useful thing to add is not another technical scenario. It is a Window 2 decision drill: present the team with a partial-information scenario at an inconvenient hour, with the named decision-maker unreachable, and observe what happens. The results will tell you more about your actual incident response capability than any technical failover test.

For the architectural context behind these patterns, see our cloud DR in the GCC pillar. For the pattern selection logic, see our pattern decision guide. If you'd like help designing a Level 4 exercise that includes Window 2 decision drills, Dataring's resilience practice does this work. Get in touch.