incident-responsesreobservability

Incident playbook: responding to simultaneous Cloudflare, AWS and platform outages

ddeployed

2026-01-27

10 min read

Practical runbook and automation recipes for detecting, failing over, and postmorteming simultaneous Cloudflare, AWS and platform outages in 2026.

When Cloudflare, AWS and your platform simultaneously wobble — a practical incident playbook

Immediate losses in traffic, stalled deployments, and panicked teams — this is the reality SREs and platform engineers dread. In 2026, with tool sprawl and heavy reliance on managed edge and cloud platforms, simultaneous outage signals from Cloudflare, AWS and your application are no longer hypothetical. This runbook gives you step-by-step detection checks, automated failover recipes, and a postmortem template tuned for multi-provider incidents.

Executive summary (read first)

When multiple infrastructure providers spike in outage reports, follow a prioritized, automated workflow: 1) validate impact with independent synthetics and network-level checks, 2) trigger pre-authorized failovers for critical traffic paths (CDN, DNS, origin), 3) enable scoped degradation and feature flags, 4) communicate with stakeholders, and 5) collect evidence for a blameless postmortem. The sections below give runbook steps for 0–120 minutes, automation recipes (Cloudflare API, Route 53, CloudFront, S3, GitOps), and a postmortem template you can plug into your incident tooling.

Why this matters in 2026

Late 2024–2025 saw accelerated adoption of multi-edge architectures, and by 2026 most teams use at least one CDN + a major cloud provider. That reduces latency but increases correlated risk: if an edge provider like Cloudflare hits an issue while AWS API or control planes are degraded, you can lose both traffic routing and origin availability. Industry trends in 2025 also normalized multi-DNS strategies, multi-CDN pre-provisioning, and automated, API-driven failovers — best practices that should be part of your default runbook today.

Incident assumptions and scope

Assume simultaneous signals when you see: third-party status page alerts (Cloudflare, AWS), external outage aggregators (DownDetector, ThousandEyes), and your internal synthetic checks failing from multiple regions. This playbook covers web traffic and API-facing services; adapt the same patterns for data pipelines and background jobs.

Immediate 0–15 minute checklist: Triage and avoid noise

Confirm impact — don’t auto-escalate on a single source.
- Run two independent synthetics: one via your external synthetic provider (Checkly/Datadog/Uptrends) and one via a raw curl from a remote runner (GitHub Actions or an external EC2/VPS).
- Run network checks: traceroute, mtr and DNS lookups from multiple public locations (RIPE Atlas, Looking Glasses). Correlate these with edge observability and routing feeds to spot global routing anomalies quickly.
Classify scope
- Is it localized to a region or global?
- Are static assets failing while API endpoints work (indicates CDN/edge outage) or is everything down (possible DNS or control plane issue)?
Set incident channel and notify on-call (Slack + PagerDuty). Post a brief status with initial findings. Consider your mass-communication fallbacks and verify email/notification provider health (see guides on handling mass email provider changes).
Lock deployments (disable CI jobs that deploy infra changes) to avoid adding noise.

Decision matrix — quick actions by impact

Use this decision matrix to pick actions fast:

Edge/CDN only (Cloudflare degraded): Activate alternate CDN and switch DNS/cached assets to fallback origin or pre-warmed CDN. Pre-warming and resilient edge backends for live sellers and high-traffic endpoints pays off during failover.
DNS authoritative provider down (Cloudflare DNS): Switch to pre-provisioned secondary DNS if available; otherwise, use registrar API to rapidly swap NS if supported.
AWS API/region degraded: Promote cross-region replicas, failover Route 53 weighted records, or route to multi-cloud origin(s).
Both edge and cloud impaired: Enable static-only site mode (pre-built SPA/snapshot on multi-CDN), post maintenance page via third-party small provider or object-hosted page with pre-signed URLs.

Automated detection recipes

Automation is the force-multiplier during multi-provider incidents. Below are reliable detection building blocks you should have in place.

Synthetic checks across providers

Configure synthetic checks from at least three independent networks: one from your cloud provider, one from a third-party synthetic provider, and one from a consumer network (home ISP vantage point). Use short, frequent probes (<1m interval) for critical endpoints.

# Simple synthetic using curl in a GitHub Actions runner
name: external-synthetic
on: workflow_dispatch
jobs:
  probe:
    runs-on: ubuntu-latest
    steps:
      - name: curl homepage
        run: |
          status=$(curl -s -o /dev/null -w "%{http_code}" https://example.com/)
          echo "http_status=$status" >> $GITHUB_OUTPUT

Prometheus alerting rules

Raise alerts when synthetics fail in multiple regions or when RUM shows user error rate spike.

groups:
- name: synthetic.rules
  rules:
  - alert: MultiRegionSyntheticFailure
    expr: sum by(job) (probe_success{job="synthetic", region!=""} == 0) >= 2
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Synthetic checks failing in multiple regions"

Network and BGP signals

Correlate DNS failures (NXDOMAIN or SERVFAIL) and sudden increases in TCP RST/timeout rates. Subscribe to BGP monitoring feeds (RouteViews, BGPStream) or use managed services that report global routing anomalies. These routing signals are part of modern cloud-native observability stacks and edge monitoring programs.

Practical failover recipes

The following recipes assume you've pre-provisioned resources in alternate providers and tested failover. If you haven't, do those preparations now — automated failover without rehearsals is high risk.

1) CDN failover: Cloudflare -> CloudFront (pre-provisioned)

Preconditions: Your origin is reachable directly (public or via VPN), you have a CloudFront distribution ready with the same hostname via Route 53 or a second DNS provider, and TLS certs are available on the fallback CDN.

Automated detection triggers playbook.
Run an automated script to update DNS records TTL = 30s for the hostname and switch A/ALIAS to CloudFront distribution.

# example: use AWS CLI to change Route53 record to cloudfront
aws route53 change-resource-record-sets --hosted-zone-id Z123EXAMPLE --change-batch ' 
{
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "www.example.com",
      "Type": "A",
      "AliasTarget": {
        "HostedZoneId": "Z2FDTNDATAQYW2",
        "DNSName": "d111111abcdef8.cloudfront.net",
        "EvaluateTargetHealth": false
      }
    }
  }]
}'

Note: If Cloudflare is your authoritative DNS provider, you must either have a secondary authoritative DNS ready (strongly recommended) or rely on the registrar API to update NS — a slow option. Best practice: store critical records with dual-authoritative DNS (using services that support DNS failover or AXFR secondary setups).

2) DNS outage when Cloudflare is authoritative

DNS provider outages are uniquely disruptive. You need a tested plan to recover without manual registrar interventions wherever possible.

Pre-provision secondary DNS: Use a provider that supports zone transfers (AXFR) to keep a warm secondary. Many registrars and secondary DNS services support automated promotion.
Registrar automation: If your registrar has an API that allows swapping NS records, automate an approved runbook that executes only under a multi-signal confirmation — and keep in mind domain hijacking and reselling/registrar attack patterns when designing your guardrails.
Subdomain delegation: Delegate critical subdomains (api.example.com, static.example.com) to alternate DNS that you control to avoid full-zone NS changes.

3) Origin failover (multi-cloud origin)

When AWS origin regions go down, your CDN can route to a secondary origin in another cloud or region. Pre-create origin groups and health checks in your CDN or load balancer.

# Cloudflare API example: toggle origin pool health check (pseudo)
curl -X PATCH "https://api.cloudflare.com/client/v4/accounts/{account}/load_balancers/pools/{pool_id}" \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"origins":[{"name":"origin-aws","enabled":false},{"name":"origin-gcp","enabled":true}]}'

4) Static snapshot fallback (fastest guaranteed recovery)

If both edge and cloud control planes are impaired, serve a minimal static snapshot: pre-built static HTML hosted on multiple, independent object stores (S3 + Backblaze + Cloudflare R2) and replicated to smaller CDNs. Use a short TTL DNS entry to point to the nearest functioning provider.

Communication & stakeholder playbook

People remember how you communicated more than the technical complexity. Follow this flow:

Within 10 minutes: post an incident banner in the status channel, include scope, what you're doing, and ETA for next update.
Every 15 minutes: update external status page and customer-facing channels; avoid guessing — share confirmed facts only. If your primary notification channels fail, refer to playbooks that cover status page and edge routing resilience.
Escalate business-impacting decisions to product/ops leadership with a short options matrix and recommended action.

Evidence collection for postmortem (automate this)

Before you change infrastructure, capture a consistent snapshot:

Export synthetic run logs and timestamps.
Collect traceroute, mtr, and DNS query logs from each probe node.
Dump active cloud provider API responses and recent audit logs (CloudTrail, Cloudflare Audit Log).
Save current DNS zone file and load balancer config (download via APIs).
Archive Kubernetes events, controller logs and manifest versions (kubectl get events, kubectl get all -o yaml).

Postmortem template & automation

Use a template that enforces timelines, impact quantification, root cause, and corrective actions. Automate population wherever possible.

Title: [Incident ID] - Multi-provider outage: Cloudflare + AWS + Platform
Date: YYYY-MM-DD
Summary: Short summary of what happened
Impact: number of users, minutes of outage, revenue/SLI impact
Timeline: automated pull of alerts, DNS changes, API results (use tooling to populate)
Root Cause: statement (blameless)
Contributing Factors: list
Remediation: short-term fixes and long-term mitigations
Postmortem Owners: names and teams
Follow-ups: Jira tickets with owners and deadlines

Integrate your incident management tool (PagerDuty, Opsgenie) to automatically create the initial postmortem skeleton with time-series links (Grafana dashboards, traces) and attach collected evidence files. Consider pairing those dashboards with cloud-native observability patterns used in high-stakes environments to ensure reliable, auditable telemetry.

Advanced strategies and preventive hardening (2026-forward)

Here are strategies to reduce coupling and blast radius in the next 12–18 months.

Design for graceful degradation: architect features to be disabled in an outage (non-essential analytics, long-tail features) using feature flags.
Multi-CDN and multi-DNS by default: provision parallel CDNs and DNS providers for critical records and automate failover with pre-tested playbooks.
GitOps runbooks: store incident playbooks in version control and implement pull-request-reviewed changes for failover scripts; execute via authenticated automation tokens.
OpenTelemetry-first tracing: ensure traces include vendor-neutral IDs and sample spans at the edge so you can stitch requests across providers during postmortem.
Pre-authorized control-plane fallbacks: allow a small set of pre-approved automation jobs (with limited scope) that can be executed by incident runbooks without full SSO handshakes — consider the security and auth patterns described in MicroAuthJS adoption guides when designing those fallbacks.

Decision guidance: When to cut traffic vs degrade

Cutting traffic to a failing provider is disruptive but reduces user harm in some cases (for example, if a provider is returning corrupted responses). Use these heuristics:

If error rate > 5% and latency > 2x SLO for > 3 regions: failover to alternate CDN or origin.
If only non-essential assets fail: continue serving core API with degraded UX. Pre-warmed edge backends and small fallback CDNs can keep core flows alive while static assets are failing.
If DNS is failing globally: move to delegated subdomains or secondary DNS immediately.

Example GitOps failover workflow

Use a GitOps repository for emergency manifests. A pre-approved PR, merged by an incident manager, triggers automation that updates DNS/CDN routing. This keeps a clear audit trail.

# pseudo-workflow
1. Open incident-runbook repo
2. Create emergency PR that updates route53/terraform variables to point at fallback
3. A dedicated automation pipeline applies the change when PR merges
4. The pipeline posts status and reverts on manual cancel

Real-world heuristics and lessons learned

From response exercises run in late 2025 and early 2026, teams that pre-warmed multicloud origins and practiced DNS delegation had the fastest recovery times. Key lessons:

Testing failovers under load is essential; simple smoke tests are not enough.
Short DNS TTLs speed recovery but increase DNS query load — balance carefully.
Keep incident automation minimal and well-documented to avoid accidental escalation risks. Observability and routing feeds like the ones covered in edge observability programs can save hours of traceroute-based guesswork.

Actionable checklist to implement this week

Provision a warm secondary DNS or delegate critical subdomains.
Create a CloudFront (or other CDN) distribution and TLS cert for your hostnames; keep it ready.
Implement multi-region synthetic checks with network-level probes.
Store an executable postmortem template and automate evidence collection on incident start.
Run a tabletop exercise that simulates Cloudflare + AWS outage and iterate your runbook — include routing and header-level failover tests used in modern cloud-native observability scenarios.

Final takeaways

Preparation beats panic. In 2026, outages that impact multiple major providers are part of operating at scale. The difference between an embarrassing multi-hour outage and a contained event is pre-provisioned fallback, automated detection, and practiced runbooks. Prioritize dual-authoritative DNS, multi-CDN readiness, synthetic diversity, and GitOps-managed emergency changes. Automate evidence collection for reliable postmortems and assign concrete follow-ups to close the loop.

"Failover without rehearsal is a new failure mode." — operational lesson from multiple 2025 tabletop exercises

Call to action

Start a 30‑day resilience program: implement the 5-item checklist above, schedule a failover drill, and adopt at least one automation recipe from this article. If you want a tailored runbook review or Terraform/GitOps templates for your stack, contact our platform engineering team for a workshop and risk assessment.

deployed

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.