Get in Touch

Blog

BCDR

Recovered Doesn't Mean Correct: Why Data Quality Is a BCDR Problem

BCDR

Published Mar 9, 2026

Mahesh Chandran

CEO, Dataring

Most DR programs measure two things: how quickly the system came back (RTO) and how much data was lost in the cutover (RPO). Both are necessary. Neither tells you whether the data that came back is actually correct. In the post-failover reviews I've seen, that question — is the data trustworthy — is the one that's most often unanswered.

This post is about the validation layer that sits between "the system is up" and "the system is operating correctly." It's a layer that BCDR programs routinely under-invest in, and it's the layer that produces the worst surprises in real failover events. For the broader BCDR architecture context, see our cloud DR in the GCC pillar.

The thesis

A failover that brings the system back online with corrupted, inconsistent, or incomplete data is worse than a failover that takes longer but produces correct results. The first puts wrong data into customer-facing systems and downstream processes; the second is just a longer outage. Most DR programs optimize for RTO and RPO and have no measurement of correctness. That gap is where the costliest post-failover incidents live.

How data gets corrupted during failover

In the failover post-mortems I've reviewed, four mechanisms account for most of the data-quality problems.

Replication lag windows

Asynchronous replication has a lag, and the lag is variable. The transactions in flight at the moment of failover may have committed in primary but not yet replicated to DR. Whether those transactions are recoverable depends on whether they were idempotent, whether the upstream system retains them, and whether the downstream system can reconcile.

In the cleanest implementations, the lag window is bounded and known, the upstream system retains a replay log, and reconciliation is a documented procedure. In the messier ones, the lag window is unmeasured, the upstream retention is unclear, and reconciliation happens by hand from spreadsheets. The difference between the two emerges only after a failover.

Schema drift between environments

Production and DR environments are supposed to be identical. They rarely are. Schema migrations applied to production but not yet applied to DR (or vice versa), missing indexes, slightly different default values, divergent stored-procedure definitions — each of these can cause the failover to either fail outright or, more dangerously, succeed with subtle data inconsistencies that aren't visible until the system is operating.

The most common version of this I've seen is a column added to a production table to support a new feature, with the migration not yet propagated to DR. The failover succeeds, the application starts up, and the new feature silently writes nulls or defaults to the wrong column. The application is "up." Customer data is being corrupted in real time.

Encryption key and KMS state mismatches

Encryption keys, key versions, and KMS configurations have to be synchronized across primary and DR. Key rotations applied in primary but not propagated to DR, or DR-region KMS quirks that don't exactly match primary, can cause a failover to come up with data that decrypts incorrectly or partially. The application sees garbage where it expects coherent data.

Timestamp and sequence gaps

Auto-incrementing IDs, timestamp-based ordering, and sequence generators need to be consistent across regions. Failovers that don't carefully manage sequence state can produce duplicate IDs (catastrophic for relational integrity), out-of-order timestamps (catastrophic for audit trails), or sequence gaps (catastrophic for reconciliation). The most expensive post-failover surprises I've heard described in retrospectives sit in this category.

The five layers of post-failover validation

A validation layer worth running has five components, each catching a different class of problem. The order matters; later layers depend on earlier ones being clean.

Layer 1: Baseline comparison

Compare row counts, aggregate sums, and key statistics between the last known-good primary state and the recovered DR state. Differences within the expected RPO window are normal; differences outside it are signal. This is the cheapest layer and catches the largest class of problems (gross replication failure, partial restore, missed tables).

In practice: a one-page report that lists each Tier 0 dataset, the row count and aggregate signature in primary at last replication, and the row count and aggregate signature in DR after failover. Differences flagged automatically. Most institutions don't have this; building it is a one-week project that pays back the first time it's used.

Layer 2: Referential integrity

For relational data, validate foreign-key relationships, parent-child consistency, and constraint satisfaction. Tables that successfully replicated independently can produce broken references between them if the replication isn't transactionally consistent. Cross-table joins that should yield N rows and yield fewer are the diagnostic signal.

In practice: a set of canonical join queries with expected cardinalities, run automatically post-failover. Mismatches paged immediately.

Layer 3: Sequence continuity

Validate that auto-increment sequences, timestamps, and ordering invariants are intact. The DR system's sequence state should be ahead of any committed primary state, never behind. Timestamps should be monotonically forward, never backward. Both checks catch a different class of problem than Layer 1 or 2.

Layer 4: Encryption and crypto-state validation

Confirm that the DR environment can correctly decrypt a representative sample of encrypted data using the keys available post-failover. This is the layer that catches KMS-state mismatches before they corrupt new writes. Run a sample read of recently-encrypted records; confirm successful decryption; confirm round-trip encryption-decryption with current keys.

Layer 5: Application-level smoke tests

Run a documented set of read and write operations through the application's normal code paths. Read a customer record. Create and immediately delete a synthetic transaction. Run a representative report. The five-or-so smoke tests should exercise the application's most important data paths end-to-end. Failures here catch the problems that the lower layers miss — application-level state issues, cache coherence, third-party integration health.

When validation runs

In a Pattern A or Pattern B failover, the temptation is to bring the system online, accept user traffic, and run validation in the background. That ordering is wrong for Tier 0 systems. The right ordering is:

Phase 1 (validation only, no traffic): Layers 1–4 run against the recovered system before any user-facing traffic. If they pass, proceed. If they fail, investigate before opening to users.

Phase 2 (limited internal traffic): Layer 5 runs through the application. A small set of internal users (typically the on-call team) exercises the system end-to-end. Live monitoring tracks for anomalies.

Phase 3 (gradual external traffic): Traffic ramps from 1% to 10% to 50% to 100% over a defined window, with continuous monitoring of error rates and data-correctness signals.

This ordering trades some RTO for confidence. For most Tier 0 systems, the trade is correct. The few extra minutes of validation are dwarfed by the cost of opening corrupted data to customers.

Tradeoffs and honest limitations

Validation has its own cost. Maintaining a validation suite that exercises Tier 0 data paths is real engineering work, and the suite drifts as the application evolves. Most institutions under-invest in validation maintenance, with the result that the validation that runs in test isn't the same as the validation that should run in production failover.

Some validation cannot be automated. Some classes of data correctness require human judgment — does this report look right, does this customer record make sense in context. Pure-automation validation reaches a ceiling. The institutions that operate above that ceiling have human-in-the-loop validation for specific high-stakes datasets.

Phase ordering trades RTO for confidence. For some workloads, the RTO budget doesn't have room for full validation before user traffic. In those cases, validation runs in parallel with limited traffic, with clear stop-the-line authority if validation fails. The decision is workload-specific.

Validation is a control, not a guarantee. Even five-layer validation can miss subtle correctness problems. The mitigation is reconciliation that runs continuously after failover (not just at cutover) and that catches problems hours or days later.

A practical takeaway

If your DR plan documents RTO and RPO targets but does not document a validation procedure that runs before user traffic resumes, the highest-leverage 30-day project is writing the Layer 1 baseline-comparison report for one Tier 0 dataset. It's a one-page artifact. It catches the largest class of failover problems. Once one Tier 0 system has it, the pattern scales to the rest at low marginal cost. Layers 2 through 5 follow.

For backup-specific verification (which is upstream of failover validation), see our backup guide. For the SAMA CSF reading on data-team obligations, see our checklist. For pattern selection, see our pattern decision guide. If you'd like help building a validation layer for your DR program, Dataring's resilience practice does this engagement regularly. Get in touch.