/

BCDR

The Dependencies You Haven't Mapped: SaaS, AI, and Business Continuity

BCDR

Mahesh Chandran

CEO Dataring

Ask a business unit leader "how many SaaS tools does your team use?" and the answer is usually in the low teens. Ask their IT team the same question with visibility into actual usage and the number is typically several times higher. Most mid-sized organizations run on a few hundred SaaS applications, a substantial fraction of which were adopted by individual teams without going through a formal review.

Every one of those tools is a dependency. Every dependency is a potential failure point. In the BCDR engagements I've worked on, the SaaS and AI dependency chain is the part most consistently under-mapped — and it's where the more frequent disruptions actually live, even though the rare large incidents get all the planning attention.

This post is about the framework I use to make those dependencies visible. I call it the Dependency Chain Map. For the broader BCDR architecture context, see our cloud DR in the GCC pillar. For the related prioritization work, see our MVB post.

The thesis

Most business continuity programs cover the company's own systems — the data centers, the cloud accounts, the applications it builds and operates. They do not usually cover the stack of SaaS and AI services that the business actually runs on day to day. The result is a continuity plan that handles the rare, dramatic failures (a regional cloud outage, ransomware, physical infrastructure loss) and is silent on the routine failures (a CRM down for two hours, an LLM provider degraded for a day, a credential issue that locks the team out of a tool for a morning). The routine failures are more frequent and, in aggregate, more disruptive. Mapping the dependency chain is the work that surfaces them before they happen.

Your business runs on someone else's computers

The shift to SaaS was substantially complete by 2020. Most business operations now run on tools hosted, maintained, and operated by third parties: Salesforce or HubSpot for CRM, Slack or Teams for communication, Notion or Confluence for documentation, DocuSign for contracts, NetSuite or QuickBooks for finance, Workday or BambooHR for HR, Asana or Monday for project management, Zoom for meetings, and dozens of specialized tools beneath those.

The value proposition is obvious: you don't have to run the infrastructure or employ the engineers needed to keep it running. The trade-off is rarely discussed: you have lost direct control over everything. Your ability to function depends on the reliability of vendors you don't manage, running on infrastructure you don't see, in regions you may not know about.

For most leaders, this is invisible until something breaks — and things break more often than people think. Major SaaS providers have serious outages multiple times per year. Cloud providers have regional incidents regularly. As AI services become embedded in everyday tools, the reliability ceiling is dropping, because AI services from even the best providers deliver meaningfully lower uptime than mature SaaS.

The Dependency Chain Map

The map is a four-level structure that takes a business process and walks down to the underlying infrastructure that makes it work.

Level 1 — Business process. The work the team actually does. "Generate monthly customer invoices." "Respond to Tier 1 support tickets." "Close an enterprise sale." "Execute the quarterly close."

Level 2 — SaaS tools. The tools each process touches. Monthly invoicing might depend on the ERP for billing data, the CRM for customer records, a tax service for rate calculations, the banking portal for payment confirmation, and an email tool to send the invoices.

Level 3 — Cloud provider. Each SaaS tool runs on underlying cloud infrastructure. Salesforce runs primarily on AWS. HubSpot runs across AWS and GCP. Microsoft 365 runs on Azure. Most vendors publish this in their trust portals. If multiple tools you depend on share a cloud provider, you have concentration risk — one cloud incident can break many tools at once.

Level 4 — Geographic region. Within each provider, data and compute live in specific regions. If your tools all serve their EMEA customers from the same AWS Ireland or Frankfurt region, a single regional incident can take down nominally separate vendors simultaneously.

Walking this chain for one critical process usually produces an uncomfortable realization: the team's work depends on more things than people thought, and more of those things share underlying infrastructure than people realized.

The lesson of the July 2024 CrowdStrike incident

On July 19, 2024, a faulty configuration update from CrowdStrike crashed millions of Windows devices worldwide in a single morning. Public reporting put the affected device count in the millions and the global cost in the multi-billion-dollar range, with airlines, hospitals, payment terminals, and broadcast operations all affected. The cause was a routine update file, not an attack.

The lesson for business leaders is not "CrowdStrike is bad." The lesson is that a single trusted vendor pushing a routine update can take down critical operations for thousands of organizations simultaneously, and the failure cascades through dependency chains that most affected organizations had never mapped. The airlines and hospitals and banks that were hit did not have CrowdStrike on their disaster risk registers, because it was a background vendor doing routine work. The dependency was there, invisibly, and the consequences of its failure were enormous.

Every organization has vendors like this — tools that feel routine and invisible until they fail in unexpected ways.

SaaS outage scenarios worth thinking through

Rather than try to protect against every possible failure, walk through a small number of realistic scenarios and ask what your team would actually do. The goal isn't to solve each scenario; it's to discover the ones you have no answer for.

Slack or Teams down for 24 hours. Where does your team communicate? How do managers coordinate? How do customer escalations get routed? For most modern organizations the honest answer is "nowhere, we can't." There's usually a token email fallback that nobody actually uses, and most teams haven't communicated primarily by email in years.

Salesforce down for 8 hours during quarter-end week. The sales team can't see pipeline, update opportunities, access contact details, pull quotes, or run the reports leadership is asking for. What do they do? How do you communicate to leadership and customers? Which deals are at risk? How do you reconstruct what happened during the outage once systems return?

Google Workspace or Microsoft 365 down for 6 hours. Email, calendar, shared documents, and video conferencing freeze together. Rarer than the others, and it has happened. How does the team operate?

Payment or billing provider down for 2 hours during business hours. Transactions fail, customers see errors, revenue stops. What do you tell customers? How do you estimate the impact? What about transactions in flight when the outage started?

In the workshops I've run, walking through scenarios like these for thirty minutes each almost always reveals gaps the team had no idea existed. Those gaps are the point of the exercise.

The AI dependency layer

A newer and less-understood category of dependency is AI services. Over the past two years AI capabilities have been embedded into nearly every SaaS tool: Salesforce Einstein, HubSpot ChatSpot, Microsoft Copilot, Google Gemini, Notion AI, Slack AI summaries. Many organizations also use dedicated AI platforms directly — OpenAI, Anthropic, Google's Gemini API, Azure OpenAI — for support automation, content generation, internal chatbots, document analysis, and decision support.

Two characteristics make AI dependencies especially worth mapping.

AI services are typically less reliable than mature SaaS. The major LLM providers' actual measured uptime varies, but tends to run lower than the standard cloud-VM baseline that most teams are used to. Even small differences compound: 99.3% uptime translates to roughly five hours per month of downtime, several times what a 99.9% service produces. Teams building critical workflows on AI services are often building on infrastructure that's an order of magnitude less reliable than what they're accustomed to. Specific provider uptime should be checked from the provider's status page rather than estimated.

AI dependencies are often invisible. If a customer support team uses a tool that generates draft responses with AI, many users don't think of themselves as "using AI" — they think of themselves as using their normal support tool, which happens to suggest responses. When the underlying AI service degrades, the support tool's experience degrades in confusing ways: suggestions stop appearing, or appear but are bad, or the tool throws intermittent errors. The team doesn't know they've lost an AI service because they didn't know they were using one.

The practical question: for each workflow that benefits from AI, what is the manual fallback? How long does the manual version take compared to the AI-assisted version? Has anyone tested the fallback recently? In most engagements I've seen, the honest answers are "unclear," "much longer," and "no."

A 90-minute exercise to build the map

Step 1. Start with your five most critical processes. Use your MVB Canvas if you have one, or quickly identify five processes the team can't afford to lose for more than a day.

Step 2. For each process, list every SaaS tool it touches. Not just the main tool — all of them. A sales process typically touches CRM, email, calendar, call recording, contract management, e-signature, payment processing, and an analytics dashboard. Eight tools, minimum, for one process.

Step 3. For each tool, look up the underlying cloud provider. Most major vendors publish this in their trust portal. If you can't find it in 60 seconds, send a one-line email to your account manager: "Which cloud provider and region hosts our instance?" They'll know.

Step 4. Look for concentration. Count how many of your critical tools share a cloud provider. If more than half your stack runs on a single provider, you have meaningful concentration risk. If more than half runs in a single region of a single provider, the risk is high. This isn't necessarily bad — it's information — but it should inform your risk register.

Step 5. For each critical tool, fill out a Vendor Continuity Card. One page: vendor name, what it does, which processes depend on it, cloud provider and region, vendor's published RTO/RPO from their SLA, your manual workaround if the vendor is down, who on your team knows the workaround, and when it was last tested. Most fields will be blank the first time. The blanks are the work.

Step 6. Identify the three biggest gaps. Don't try to fix everything. Pick the three dependencies where an outage would be most damaging and where you currently have no plan, and remediate there first.

Questions to ask before adopting any new SaaS or AI tool

The best time to address SaaS continuity risk is before signing. Once you're locked into a tool, migration is painful and often impossible within the timeframe a crisis allows. Before adoption, get clear answers from the vendor:

Can we export our data in standard, non-proprietary formats? Vendor-only export formats mean you have no real exit plan.

What is the vendor's measured uptime for the past 12 months? Not their SLA target — the actual measured number. Mature vendors publish this on a status or trust page.

Where is our data hosted? Can we choose the region? Especially important for GCC organizations with residency obligations. See our residency guide.

What are the contractual data export procedures and timelines at termination? Specifically: how many days do you have to export after cancellation, and is there a charge?

What happens if the vendor is acquired or shuts down? Most contracts have change-of-control provisions worth reading.

For AI tools: what happens when the model is unavailable? Is there a fallback? What does the degraded experience look like? Has the vendor experienced an extended AI outage, and how did customers cope?

Does the vendor's DR plan cover our data, or only their infrastructure? The question vendors least like answering. The honest answer is often "only their infrastructure" — meaning if the vendor loses your data due to a disaster, their responsibility ends at rebuilding the servers, not at restoring your information. See our post on the gap between vendor SLAs and actual loss coverage.

Making it a habit

A one-time Dependency Chain Map is useful. An ongoing practice is what makes the difference.

Add dependency review to quarterly business planning. When the team adopts a new tool, update the map before adopting. When IT replaces a tool, update the map. When a vendor outage affects the team, document what happened, what the team did, and what would be done differently — then update the Vendor Continuity Card.

This hygiene isn't glamorous. It's the difference between a team that handles outages calmly and a team that scrambles every time a trusted vendor fails.

Tradeoffs and honest limitations

The map is only as accurate as the discovery work behind it. Shadow IT and team-purchased tools are usually missing from any first-draft map. Quarterly reviews and periodic IT-led tool discovery (browser extension audits, expense-report scanning) help close the gap.

Manual workarounds degrade over time. The fallback you documented two years ago for your CRM may rely on a process that has since been automated away. Test workarounds at least annually. Untested workarounds are theory, not capability.

Concentration risk is information, not necessarily a problem to fix. Diversifying across cloud providers is expensive and may not be the right call for most workloads. The point of identifying concentration is to make it a conscious, accepted risk rather than an invisible one.

A practical takeaway

If your function has not built a Dependency Chain Map, the homework is concrete: pick your most critical process, walk it through the four levels, and fill out Vendor Continuity Cards for the top three vendors. Two hours total. The output will be more useful than a 50-page continuity plan that doesn't exist yet.

For the prioritization work upstream of dependency mapping, see our MVB post. For the contractual layer downstream, see our vendor contracts post. For the broader regional context, see our cloud DR in the GCC pillar. If you'd like outside facilitation for the mapping session, Dataring's resilience practice runs these regularly. Get in touch.