Files
knowledge-base/DR.en.md
Stanislav Hubacek b53714113c new files
2026-06-16 15:47:45 +02:00

16 KiB
Raw Blame History

🔄 Disaster Recovery and Business Continuity

Terminology

Abbreviation Meaning Description
RTO Recovery Time Objective Maximum time from outage to service recovery
RPO Recovery Point Objective Maximum acceptable data loss (time since last backup)
MTD Maximum Tolerable Downtime Total outage duration an organization can survive
WRT Work Recovery Time Time needed for full operations recovery after IT restoration
MTBF Mean Time Between Failures Mean time between failures
MTTR Mean Time To Repair Mean time to repair
SLA Service Level Agreement Contractual availability commitment
SLO Service Level Objective Internal availability target
SLI Service Level Indicator Measured availability value

Relationship between RTO, RPO, MTD, WRT

Outage ──── RPO ────► Data restored ──── RTO ────► Service running ──── WRT ────► Full operations
             │                      │                            │
             ▼                      ▼                            ▼
       Lost data          Time without service               Time to full capacity

       MTD = RTO + WRT (max. time the business tolerates)

Uptime calculation

Nines table

Level Uptime Downtime / year Downtime / month Downtime / week
90 % (one nine) 0.9 36.5 days 72 h 16.8 h
99 % (two nines) 0.99 3.65 days 7.2 h 1.68 h
99.5 % 0.995 1.83 days 3.6 h 50.4 min
99.9 % (three nines) 0.999 8.76 h 43.2 min 10.1 min
99.95 % 0.9995 4.38 h 21.6 min 5.04 min
99.99 % (four nines) 0.9999 52.6 min 4.32 min 1.01 min
99.995 % 0.99995 26.3 min 2.16 min 30.2 s
99.999 % (five nines) 0.99999 5.26 min 25.9 s 6.05 s
99.9999 % (six nines) 0.999999 31.6 s 2.59 s 0.605 s

Calculation

Availability = (Total time - Downtime) / Total time × 100 %

Example:
  Year = 365 × 24 × 60 = 525,600 minutes
  Target: 99.9 % → allowed downtime = 525,600 × (1 - 0.999) = 525.6 minutes = 8.76 h

Combined availability (chain of dependencies):
  A_web = 99.9 % (3 nines)
  A_api  = 99.99 % (4 nines)
  A_db   = 99.999 % (5 nines)

  A_total = 0.999 × 0.9999 × 0.99999 = 0.99889 ≈ 99.89 % (less than 3 nines!)

Parallel availability (redundancy):
  A_total = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)

  Example: 2 servers with 99% availability
  A_total = 1 - (1-0.99) × (1-0.99) = 1 - 0.01 × 0.01 = 0.9999 (99.99 %)

Calculator

def uptime_percent_to_downtime(pct, period_days=365):
    """Convert uptime percentage to downtime in given period."""
    total_minutes = period_days * 24 * 60
    allowed_downtime = total_minutes * (1 - pct / 100)
    return allowed_downtime  # minutes

def downtime_to_uptime_percent(downtime_minutes, period_days=365):
    """Convert downtime in minutes to uptime percentage."""
    total_minutes = period_days * 24 * 60
    return (1 - downtime_minutes / total_minutes) * 100

def combined_availability(availabilities):
    """Combined availability (series-connected components)."""
    result = 1.0
    for a in availabilities:
        result *= a
    return result

def redundant_availability(availabilities):
    """Redundant availability (parallel components)."""
    result = 1.0
    for a in availabilities:
        result *= (1 - a)
    return 1 - result

Calculation fallacies

  • Combined availability is not a sum — adding another dependency always reduces total availability
  • Redundancy is not free — adding a standby component requires failure detection + failover (MTTR does not improve automatically)
  • SLA is not a guarantee — providers often calculate SLA as a monthly average, not per-incident
  • Measurement is key — without SLI, SLO cannot be verified; "unmeasured availability does not exist"
  • Planned maintenance — sometimes counted as uptime, sometimes not (depends on SLA definition)

DR scenarios

Classification

Category Scenario Typical RTO Typical RPO Frequency
Site Entire DC / region outage hours minutes Low
Infrastructure HW failure (storage, switch, server) minuteshours seconds Medium
Software OS, application, DB failure minutes seconds High
Data Data corruption, deletion, cryptolocker hours backup point Lowmedium
Human Wrong deployment, config change minuteshours seconds Medium
Security Attack, breach, ransomware days before attack Low
Network Connectivity outage, DDoS minuteshours N/A Medium
Cloud provider Regional outage (AWS, Azure, GCP) hours minutes Very low

Scenario details

Site / Region failure

Aspect Description
Cause Blackout, fire, flood, earthquake, cloud provider outage
Prevention Multi-AZ architecture, multi-region deployment, active-active
Mitigation Automatic DNS failover (Route53, Azure Traffic Manager), replica in DR region
Testing Game day: shut down primary region, verify automatic failover

Data corruption / human error

Aspect Description
Cause Wrong SQL command (DELETE without WHERE), accidentally deleted bucket, bad migration
Prevention RBAC, MFA for destructive operations, change management, SQL peer review
Mitigation Point-in-time recovery (PITR), transaction log replay, immutable backups
Testing Restore backup to isolated environment, verify data integrity

Ransomware / cyber attack

Aspect Description
Cause Attack on production systems, data encryption, exfiltration
Prevention Immutable backups (object lock), air-gapped backups, network segmentation
Mitigation Restore from clean backup, rebuild infrastructure from IaC
Testing Regular restore in isolated network, verify backup is not infected

Prevention — strategies

Backup strategies

Approach Description Use case
3-2-1 rule 3 copies, 2 different media, 1 off-site Universal
3-2-1-0 + 0 errors after restore (testing) Enterprise, compliance
GFS (Grandfather-Father-Son) Daily, weekly, monthly rotation Long-term archive
Incremental forever Full backup 1×, then only changes Large data volumes
Reverse incremental Full + incremental, full is always current Fast recovery

Backup methods

Method RPO RTO Storage Suitable for
Full backup Last full Full restore time Large Small data, weekly
Incremental Last incremental Full + all incrementals Small Large data, daily
Differential Last diff Full + last diff Medium Compromise
Snapshot Snapshot point-in-time seconds Copy-on-write VM, storage array
Continuous (CDC) < 1 s Seconds Log stream DB (binlog, WAL)
PITR Any point in time Depends on volume Full + WAL RDS, PostgreSQL, SQL Server

Backup immutability

Key protection against ransomware:

Technique Description
Object Lock (WORM) Backup cannot be deleted or overwritten for a defined retention period (S3 Object Lock, Azure Blob Immutable)
Air gap Backup is physically separated from the production network (offline disk, tape, cloud without VPN)
Isolated backup network Backup traffic goes through a dedicated network without access from production VLAN
Out-of-band access Backup management console is not accessible from the production network

DR architectures

Multi-AZ (Single region)

Region ┌────────────────────────────────────┐
       │  AZ-1              AZ-2            │
       │  ┌──────────┐     ┌──────────┐     │
       │  │  App      │     │  App      │     │
       │  └─────┬────┘     └─────┬────┘     │
       │        │                │          │
       │  ┌─────▼────────────────▼─────┐    │
       │  │  Load Balancer (cross-AZ)  │    │
       │  └─────────────┬──────────────┘    │
       │                │                   │
       │  ┌─────────────▼──────────────┐    │
       │  │  DB Primary (AZ-1)         │    │
       │  │  DB Standby (AZ-2)         │    │
       │  │  Synchronous replication   │    │
       │  └────────────────────────────┘    │
       └────────────────────────────────────┘
  • RTO: minutes (automatic failover)
  • RPO: 0 (sync replication)
  • Protection: against AZ failure, not region failure

Multi-Region

Region A (Primary)                    Region B (DR)
┌─────────────────────┐              ┌─────────────────────┐
│  ┌───────────────┐  │              │  ┌───────────────┐  │
│  │  App + DB     │  │              │  │  App + DB     │  │
│  │  Active       │──┼──Async───────┼─►│  Standby      │  │
│  └───────────────┘  │  replication │  └───────────────┘  │
│         │           │              │         │           │
│  ┌──────▼───────┐  │              │  ┌──────▼───────┐  │
│  │  DNS / GSLB  │  │              │  │  DNS / GSLB  │  │
│  └──────┬───────┘  │              │  └──────┬───────┘  │
└─────────┼──────────┘              └─────────┼──────────┘
          │                                    │
          └──────────── Traffic Manager ───────┘
Variant RTO RPO Cost Failover
Active-Passive minuteshours seconds Medium Manual / auto
Active-Active seconds < 1 s High Automatic (DNS)
Pilot Light tens of minutes minutes Low Manual scaling
Warm Standby minutes seconds High Auto (reduced copy)
Backup & Restore hours 24 h Low Manual

On-prem → Cloud DR (Hybrid)

On-prem DC                              Cloud (DR)
┌─────────────────────┐              ┌─────────────────────┐
│  ┌───────────────┐  │              │  ┌───────────────┐  │
│  │  Application  │  │              │  │  VM / App     │  │
│  │  + DB         │  │              │  │  + DB replica │  │
│  └───────┬───────┘  │              │  └───────┬───────┘  │
│          │          │              │          │          │
│  ┌───────▼───────┐  │  site-to-site│  ┌───────▼───────┐  │
│  │  Backup proxy │──┼────VPN───────┼─►│  Backup store │  │
│  └───────────────┘  │              │  └───────────────┘  │
│                     │              │                     │
│  ┌───────────────┐  │              │  ┌───────────────┐  │
│  │  Tape / NAS   │  │              │  │  Veeam / Zerto│  │
│  └───────────────┘  │              │  └───────────────┘  │
└─────────────────────┘              └─────────────────────┘
  • RTO: tens of minutes (depends on VM startup)
  • RPO: minuteshours (depends on replication tool)
  • Tools: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
  • Use case: enterprise with on-prem DC that needs DR without a second DC

DR testing

Test types

Type Description Frequency Risk
Tabletop exercise Manual scenario walkthrough, no impact on production Monthly None
Walkthrough Runbook verification, ensure everyone knows what to do Quarterly None
Component test Test of a single component (e.g., restore one DB) Monthly Low
Integrated test Test of the entire stack in isolated environment Quarterly Low
Full failover test Production failover to DR site Annually High
Chaos experiment Targeted fault injection into production Continuous Medium

Runbook structure

Each DR scenario should have a runbook:

scenario: "Region A failure"
triggers:
  - "CloudWatch alarm: Region A health check 5× timeout"
  - "PagerDuty incident P0"
decision_tree: |
  1. Verify: is Region A really unavailable? (check from 3 different locations)
  2. Decide: is RTO at risk? If < 30 % RTO remaining → failover
  3. Failover: run playbook `dr-failover-region-b`
  4. Verification: smoke tests in Region B
  5. Communication: status page + stakeholders
rollback: |
  1. After Region A recovery → replicate changes from B back to A
  2. Repoint DNS to A
  3. Verify data consistency
  4. Shut down Region B (or keep as hot standby)
contacts:
  primary: "on-call@example.com"
  escalation: "infra-lead@example.com"
  management: "vp-engineering@example.com"

Best practices

  • Test recovery, not backup — a backup without tested recovery is not a backup
  • Automate DR — Terraform / Ansible for DR environment spin-up, DNS failover
  • Document runbooks — every scenario, contact, decision tree
  • Expect failure — design for failure, don't expect everything to work
  • Don't underestimate WRT — service recovery does not mean full operations (data warming, cache, connections)
  • Align RTO/RPO with business — technical capabilities must match business requirements
  • Monitor SLI — without data, SLO cannot be verified
  • DR is not just IT — communication, PR, legal, compliance

Sources

Odkazy, knihy a standardy: sources/infrastructure/sources.md

Last revised: 2026-06-11