
RTO and RPO Are Business Decisions, Not IT Metrics
BCDR

Mahesh Chandran
CEO Dataring
If you run a business function — operations, finance, customer success, a product line — you've probably been in a meeting where someone from IT mentioned your team's "RTO" or "RPO" and you nodded along. In the BCDR conversations I've been part of, this is a recurring pattern: the technical team treats the numbers as their decision to make, the business team treats them as a technical detail. Both sides are wrong.
Recovery Time Objective and Recovery Point Objective are business commitments with direct financial, operational, and reputational consequences. The only people who can set them correctly are the people who own the business processes they describe. This post is a working framework for thinking about downtime as a business economist would, so the numbers stop being someone else's problem.
For the broader BCDR architecture context, see our cloud DR in the GCC pillar. For pattern selection that follows from these decisions, see our pattern decision guide.
The thesis
Most organizations have RTO and RPO numbers that were set by IT, never re-examined, and don't match the business consequence of an actual outage. The numbers tend to be either uniformly aggressive (everything must be back in fifteen minutes) or uniformly slack (everything can wait a day). Both produce bad architecture decisions, because both ignore the wide variation in what different workloads are actually worth to the business at the moment of failure. Better numbers come from doing the business work, not from arguing about the technical work.
The headline statistics are not your number
If you've read any cybersecurity report, you've seen the per-minute downtime figures. Different sources cite different numbers, and they're usually averages drawn from large samples that include very different industries and company sizes. They're not wrong as averages. They're not your number.
The real cost of downtime for your business varies by orders of magnitude depending on your industry, the function affected, the time of day, and what cycle of work your team is in when the outage hits. A figure from a global stock exchange is irrelevant to a 200-person logistics business in Dubai. The first principle of downtime economics is that every business unit has its own cost profile, and that profile changes by hour and by season.
The point of this post isn't to give you a number. It's to give you the framework to calculate one.
RTO and RPO in plain language
Before the framework, get the two terms right.
Recovery Time Objective (RTO): how long can this workload be unavailable before the consequence becomes unacceptable. Measured in time. A restaurant kitchen during lunch service has an RTO of fifteen minutes; a quarterly board pack production has an RTO of two days.
Recovery Point Objective (RPO): how much in-flight work can you afford to lose. Also measured in time, but describing data loss rather than downtime. A payroll run has an RPO close to zero — you cannot lose payroll data. A marketing analytics dashboard has an RPO of hours — yesterday's data is fine.
The two are independent variables that often point in different directions. A payroll system might tolerate a 24-hour RTO (people can be paid a day late if needed) while requiring a near-zero RPO (you cannot lose the payment data). A trading dashboard might require a 1-minute RTO and tolerate a 1-hour RPO. Treating them as one number is a category error.
The Downtime Impact Equation
When a critical workload goes down, the cost is a composite of five components. Estimating each one separately is what turns a vague concern into a defensible number.
Total Downtime Cost per Hour = Lost Revenue + Lost Productivity + Recovery Costs + Reputational Damage + Regulatory and Contractual Penalties
Component 1: Lost revenue
The most obvious and the most often miscalculated. The right question is not "what did we earn in that hour?" It is "what did we earn in that hour that we cannot earn later?"
Some revenue is deferred (customers come back). Some is permanently lost (cart abandonment during a flash sale, a B2B deal that moves to a competitor because your sales team couldn't access the pipeline for two days). Honest calculations separate the two and only count the unrecoverable portion. Across the engagements I've seen, the unrecoverable share is rarely above 60% and is often well below.
Component 2: Lost productivity
The cost of paying people who can't do their work. A 20-person team idle for four hours costs four hours of fully-loaded salary, including benefits and overhead, and the catch-up cost when systems return (productive work rarely resumes the moment the system is back). This is the component leaders most consistently underestimate, and frequently it's the largest line item.
Component 3: Recovery costs
What you spend to fix the problem and clean up afterward. Overtime to clear backlogs, emergency communications, data reconstruction, expedited vendor services, the opportunity cost of senior staff spending a week on incident response instead of their actual job. Recovery costs frequently exceed direct revenue loss for mid-sized outages, especially the cost of customer concessions in the weeks after.
Component 4: Reputational damage
The hardest to quantify, which is why most calculations ignore it. Ignoring it understates true cost. The signals are: slightly elevated churn in the quarter after the incident, longer sales cycles for prospects who saw the outage in their research, increased discount pressure from customers who now view you as the vendor that went down, and occasional loss of major deals where procurement requires outage history. You may not be able to estimate this precisely, but you can almost always bound it. "We probably lost one to three enterprise deals" is more useful than zero.
Component 5: Regulatory and contractual penalties
For regulated businesses, this is the component most likely to dominate. Customer contracts that commit to 99.9% uptime have penalty clauses; a single outage can trigger them across the entire book. Sectoral regulators (SAMA in Saudi Arabia, CBUAE in the UAE, QCB in Qatar) impose consequences for BCDR failures that can dwarf the operating cost of the business. See our SAMA CSF guide and our GCC requirements comparison for the regional regulatory landscape.
The timing multiplier
The same workload, down for the same duration, can cost dramatically different amounts depending on when it happens. A four-hour ERP outage at 3 AM on a Sunday in mid-July might cost a few thousand dollars, mostly recovery effort. The same four-hour outage at 10 AM on the last business day of the quarter, during month-end close, can cost orders of magnitude more, because finance can't close the books on time, which delays invoicing, which breaches billing-timeline SLAs, which triggers contractual penalties.
Every business function has a timing multiplier. Finance has month-end and quarter-end. Operations has morning dispatch windows. Customer success has renewal cohorts. Sales has quarter close. Product has release windows. The right output of a downtime calculation is not one number; it's a typical-day number and a peak-period number. The peak-period number is usually three to ten times the typical, and it's the one that should drive RTO requirements.
The cost curve
Once you understand the impact equation, the next question is how much to spend to prevent downtime. The inconvenient truth: the cost of recovery protection is not linear.
As a rough order of magnitude, getting from a 24-hour RTO to a 4-hour RTO often adds a meaningful but bounded cost. Getting from 4 hours to 15 minutes can add an order of magnitude on top. Getting from 15 minutes to zero can add another large factor. The exact numbers vary, but the curve is real and worth understanding.
This is why blanket statements like "our DR should protect everything in 15 minutes" are economically unserious. Some workloads are worth that. Most aren't. A well-designed BCDR program protects different workloads at different levels — the tiered model that anchors most of the work in this space. See our pattern decision guide for how tiering maps to architecture pattern.
The key point: business leaders are the only people who can place workloads into tiers correctly. IT can build any tier. IT cannot know which tier a workload belongs in without the business saying so. When business leaders don't engage, IT defaults to one of two failure modes: protecting everything at Tier 1 (financially unsustainable) or protecting everything at Tier 3 (leaving critical workloads exposed). I see both, often in the same organization.
A 90-minute exercise
You can do most of this work without IT involvement. Block ninety minutes with your direct reports.
Step 1. Identify your three most critical workflows. Workflows, not systems. For finance, this might be "process vendor payments," "close the monthly books," "generate regulatory filings." For customer success, "manage active escalations," "conduct renewal conversations," "onboard new customers." Pick three.
Step 2. For each workflow, list the systems it depends on. Most depend on more than one. "Closing the monthly books" might depend on the ERP, the expense management tool, the banking portal, and an Excel model on a shared drive.
Step 3. For each system, estimate the impact equation at three durations: 1 hour, 4 hours, 24 hours. All five components. Rough numbers are fine. A rough number is more useful than no number.
Step 4. Apply the timing multiplier. Identify your peak period and recalculate for that window. The peak number is usually the binding one.
Step 5. The phone-tree test. For each critical system: "If this went down right now, who on my team would I call first, and what would I tell them to do?" If you don't have an answer, you haven't thought about the workflow enough.
At the end, you'll have a defensible cost profile for your function's most critical workflows. The most common reaction I see from leaders who do this exercise is mild surprise at the size of the peak-period numbers, followed by a productive conversation with IT.
The conversation with IT
When you take the output to your IT leader or CISO, the conversation is collaborative. You bring "what matters." They bring "what's possible." The questions worth asking:
What are our current RTOs and RPOs for the systems my team uses? Many leaders are surprised. Plenty of business-critical SaaS tools have default RTOs of 24+ hours that nobody has ever surfaced.
When was the recovery process last tested? A system that has never been through a real recovery exercise has an RTO of "unknown," not whatever the vendor advertises. See our post on DR testing for why this matters.
What would it cost to move this workload to a tighter RTO? Order-of-magnitude answer, not a quote. Does moving from 8-hour RTO to 1-hour RTO cost an extra $20K a year, or $200K a year? The answer determines whether the business case is obvious or requires careful work.
What was our actual recovery time during the last real incident? The single most revealing question. The gap between target and actual is where hidden risk lives.
Do our customer contracts commit us to uptime levels we can't reliably meet? The question that most often exposes the largest gap. A leader who has signed 99.9% uptime commitments while running on systems with unpublished RTOs is sitting on a material liability that the rest of the business doesn't see.
Common mistakes I see
Assuming IT has it covered. IT is generally protecting the infrastructure: servers, databases, network. They are not usually protecting your specific business process, because they don't know exactly how your team uses each system. The gap between "the database is back up" and "my team can do its job" is often hours.
Treating all systems as equally critical. If everything is Tier 1, nothing is. Leaders who refuse to prioritize end up with a budget that can't sustain the protection they claim to need, and the compromises get made under pressure during an actual crisis.
Setting targets once and forgetting them. Business processes change. Two years ago you didn't use that AI chatbot. A year ago your finance team wasn't consolidating data from three new acquisitions. Last quarter you signed a contract with stricter SLA terms than anything before it. Each of those should have updated your RTOs. Most don't. Targets set 18 months ago describe a business that no longer exists.
Making the business case upward
Once you have downtime estimates and a sense of the protection gap, you have the raw material for a serious business case. A few principles help it land:
Translate everything into revenue and obligation. "A four-hour outage during renewal season puts $1.2M of at-risk ARR on ice" lands better than "our RTO doesn't match our availability needs."
Pair cost with investment. "Moving this workload from 8-hour RTO to 2-hour RTO would cost approximately $60K annually and protect against an estimated $800K in peak-period exposure" is a conversation. "We might have a downtime problem" is not.
Make it quarterly. Downtime economics belongs in your QBR, not just an annual IT audit. Business context changes constantly; the RTO/RPO portfolio should be reviewed on the same cadence as revenue and headcount.
Tradeoffs and honest limitations
Rough numbers are sufficient; precision is not. The point of the impact equation is to put rough magnitudes on the page so prioritization can happen. Spending weeks getting a more precise number is usually less valuable than acting on the rough number now.
The framework assumes you have insight into your customers and operations. If you can't roughly estimate your peak periods, your contractual commitments, or your team's productive idle time, the equation produces noise. The fix is to do the operational diligence first.
Tier classification is iterative. The first time you run this, you'll classify some workloads wrong. That's fine. Re-classify after the first real incident or the first test — those events surface the workloads whose business value was misunderstood.
A practical takeaway
If you're a business unit leader, the highest-leverage 90-minute commitment you can make this quarter is the exercise above for your three most critical workflows. The output is a one-page artifact. Take it to your CISO or IT leader and ask the five questions. The conversation that follows is usually more productive than any number of strategy meetings about resilience in the abstract.
For the framework that follows from this work — which architecture pattern fits which tier — see our pattern decision guide. For the broader regional context, see our cloud DR in the GCC pillar. If you'd like a working session focused on your function's critical workflows, Dataring's resilience practice does this engagement regularly. Get in touch.




