Files

Stanislav Hubacek b53714113c new files

2026-06-16 15:47:45 +02:00

16 KiB

Raw Blame History

🔄 Disaster Recovery and Business Continuity

Terminology

Abbreviation	Meaning	Description
RTO	Recovery Time Objective	Maximum time from outage to service recovery
RPO	Recovery Point Objective	Maximum acceptable data loss (time since last backup)
MTD	Maximum Tolerable Downtime	Total outage duration an organization can survive
WRT	Work Recovery Time	Time needed for full operations recovery after IT restoration
MTBF	Mean Time Between Failures	Mean time between failures
MTTR	Mean Time To Repair	Mean time to repair
SLA	Service Level Agreement	Contractual availability commitment
SLO	Service Level Objective	Internal availability target
SLI	Service Level Indicator	Measured availability value

Relationship between RTO, RPO, MTD, WRT

Outage ──── RPO ────► Data restored ──── RTO ────► Service running ──── WRT ────► Full operations
             │                      │                            │
             ▼                      ▼                            ▼
       Lost data          Time without service               Time to full capacity

       MTD = RTO + WRT (max. time the business tolerates)

Uptime calculation

Nines table

Level	Uptime	Downtime / year	Downtime / month	Downtime / week
90 % (one nine)	0.9	36.5 days	72 h	16.8 h
99 % (two nines)	0.99	3.65 days	7.2 h	1.68 h
99.5 %	0.995	1.83 days	3.6 h	50.4 min
99.9 % (three nines)	0.999	8.76 h	43.2 min	10.1 min
99.95 %	0.9995	4.38 h	21.6 min	5.04 min
99.99 % (four nines)	0.9999	52.6 min	4.32 min	1.01 min
99.995 %	0.99995	26.3 min	2.16 min	30.2 s
99.999 % (five nines)	0.99999	5.26 min	25.9 s	6.05 s
99.9999 % (six nines)	0.999999	31.6 s	2.59 s	0.605 s

Calculation

Availability = (Total time - Downtime) / Total time × 100 %

Example:
  Year = 365 × 24 × 60 = 525,600 minutes
  Target: 99.9 % → allowed downtime = 525,600 × (1 - 0.999) = 525.6 minutes = 8.76 h

Combined availability (chain of dependencies):
  A_web = 99.9 % (3 nines)
  A_api  = 99.99 % (4 nines)
  A_db   = 99.999 % (5 nines)

  A_total = 0.999 × 0.9999 × 0.99999 = 0.99889 ≈ 99.89 % (less than 3 nines!)

Parallel availability (redundancy):
  A_total = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)

  Example: 2 servers with 99% availability
  A_total = 1 - (1-0.99) × (1-0.99) = 1 - 0.01 × 0.01 = 0.9999 (99.99 %)

Calculator

def uptime_percent_to_downtime(pct, period_days=365):
    """Convert uptime percentage to downtime in given period."""
    total_minutes = period_days * 24 * 60
    allowed_downtime = total_minutes * (1 - pct / 100)
    return allowed_downtime  # minutes

def downtime_to_uptime_percent(downtime_minutes, period_days=365):
    """Convert downtime in minutes to uptime percentage."""
    total_minutes = period_days * 24 * 60
    return (1 - downtime_minutes / total_minutes) * 100

def combined_availability(availabilities):
    """Combined availability (series-connected components)."""
    result = 1.0
    for a in availabilities:
        result *= a
    return result

def redundant_availability(availabilities):
    """Redundant availability (parallel components)."""
    result = 1.0
    for a in availabilities:
        result *= (1 - a)
    return 1 - result

Calculation fallacies

Combined availability is not a sum — adding another dependency always reduces total availability
Redundancy is not free — adding a standby component requires failure detection + failover (MTTR does not improve automatically)
SLA is not a guarantee — providers often calculate SLA as a monthly average, not per-incident
Measurement is key — without SLI, SLO cannot be verified; "unmeasured availability does not exist"
Planned maintenance — sometimes counted as uptime, sometimes not (depends on SLA definition)

DR scenarios

Classification

Category	Scenario	Typical RTO	Typical RPO	Frequency
Site	Entire DC / region outage	hours	minutes	Low
Infrastructure	HW failure (storage, switch, server)	minutes–hours	seconds	Medium
Software	OS, application, DB failure	minutes	seconds	High
Data	Data corruption, deletion, cryptolocker	hours	backup point	Low–medium
Human	Wrong deployment, config change	minutes–hours	seconds	Medium
Security	Attack, breach, ransomware	days	before attack	Low
Network	Connectivity outage, DDoS	minutes–hours	N/A	Medium
Cloud provider	Regional outage (AWS, Azure, GCP)	hours	minutes	Very low

Scenario details

Site / Region failure

Aspect	Description
Cause	Blackout, fire, flood, earthquake, cloud provider outage
Prevention	Multi-AZ architecture, multi-region deployment, active-active
Mitigation	Automatic DNS failover (Route53, Azure Traffic Manager), replica in DR region
Testing	Game day: shut down primary region, verify automatic failover

Data corruption / human error

Aspect	Description
Cause	Wrong SQL command (DELETE without WHERE), accidentally deleted bucket, bad migration
Prevention	RBAC, MFA for destructive operations, change management, SQL peer review
Mitigation	Point-in-time recovery (PITR), transaction log replay, immutable backups
Testing	Restore backup to isolated environment, verify data integrity

Ransomware / cyber attack

Aspect	Description
Cause	Attack on production systems, data encryption, exfiltration
Prevention	Immutable backups (object lock), air-gapped backups, network segmentation
Mitigation	Restore from clean backup, rebuild infrastructure from IaC
Testing	Regular restore in isolated network, verify backup is not infected

Prevention — strategies

Backup strategies

Approach	Description	Use case
3-2-1 rule	3 copies, 2 different media, 1 off-site	Universal
3-2-1-0	+ 0 errors after restore (testing)	Enterprise, compliance
GFS (Grandfather-Father-Son)	Daily, weekly, monthly rotation	Long-term archive
Incremental forever	Full backup 1×, then only changes	Large data volumes
Reverse incremental	Full + incremental, full is always current	Fast recovery

Backup methods

Method	RPO	RTO	Storage	Suitable for
Full backup	Last full	Full restore time	Large	Small data, weekly
Incremental	Last incremental	Full + all incrementals	Small	Large data, daily
Differential	Last diff	Full + last diff	Medium	Compromise
Snapshot	Snapshot point-in-time	seconds	Copy-on-write	VM, storage array
Continuous (CDC)	< 1 s	Seconds	Log stream	DB (binlog, WAL)
PITR	Any point in time	Depends on volume	Full + WAL	RDS, PostgreSQL, SQL Server

Backup immutability

Key protection against ransomware:

Technique	Description
Object Lock (WORM)	Backup cannot be deleted or overwritten for a defined retention period (S3 Object Lock, Azure Blob Immutable)
Air gap	Backup is physically separated from the production network (offline disk, tape, cloud without VPN)
Isolated backup network	Backup traffic goes through a dedicated network without access from production VLAN
Out-of-band access	Backup management console is not accessible from the production network

DR architectures

Multi-AZ (Single region)

Region ┌────────────────────────────────────┐
       │  AZ-1              AZ-2            │
       │  ┌──────────┐     ┌──────────┐     │
       │  │  App      │     │  App      │     │
       │  └─────┬────┘     └─────┬────┘     │
       │        │                │          │
       │  ┌─────▼────────────────▼─────┐    │
       │  │  Load Balancer (cross-AZ)  │    │
       │  └─────────────┬──────────────┘    │
       │                │                   │
       │  ┌─────────────▼──────────────┐    │
       │  │  DB Primary (AZ-1)         │    │
       │  │  DB Standby (AZ-2)         │    │
       │  │  Synchronous replication   │    │
       │  └────────────────────────────┘    │
       └────────────────────────────────────┘

RTO: minutes (automatic failover)
RPO: 0 (sync replication)
Protection: against AZ failure, not region failure

Multi-Region

Region A (Primary)                    Region B (DR)
┌─────────────────────┐              ┌─────────────────────┐
│  ┌───────────────┐  │              │  ┌───────────────┐  │
│  │  App + DB     │  │              │  │  App + DB     │  │
│  │  Active       │──┼──Async───────┼─►│  Standby      │  │
│  └───────────────┘  │  replication │  └───────────────┘  │
│         │           │              │         │           │
│  ┌──────▼───────┐  │              │  ┌──────▼───────┐  │
│  │  DNS / GSLB  │  │              │  │  DNS / GSLB  │  │
│  └──────┬───────┘  │              │  └──────┬───────┘  │
└─────────┼──────────┘              └─────────┼──────────┘
          │                                    │
          └──────────── Traffic Manager ───────┘

Variant	RTO	RPO	Cost	Failover
Active-Passive	minutes–hours	seconds	Medium	Manual / auto
Active-Active	seconds	< 1 s	High	Automatic (DNS)
Pilot Light	tens of minutes	minutes	Low	Manual scaling
Warm Standby	minutes	seconds	High	Auto (reduced copy)
Backup & Restore	hours	24 h	Low	Manual

On-prem → Cloud DR (Hybrid)

On-prem DC                              Cloud (DR)
┌─────────────────────┐              ┌─────────────────────┐
│  ┌───────────────┐  │              │  ┌───────────────┐  │
│  │  Application  │  │              │  │  VM / App     │  │
│  │  + DB         │  │              │  │  + DB replica │  │
│  └───────┬───────┘  │              │  └───────┬───────┘  │
│          │          │              │          │          │
│  ┌───────▼───────┐  │  site-to-site│  ┌───────▼───────┐  │
│  │  Backup proxy │──┼────VPN───────┼─►│  Backup store │  │
│  └───────────────┘  │              │  └───────────────┘  │
│                     │              │                     │
│  ┌───────────────┐  │              │  ┌───────────────┐  │
│  │  Tape / NAS   │  │              │  │  Veeam / Zerto│  │
│  └───────────────┘  │              │  └───────────────┘  │
└─────────────────────┘              └─────────────────────┘

RTO: tens of minutes (depends on VM startup)
RPO: minutes–hours (depends on replication tool)
Tools: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
Use case: enterprise with on-prem DC that needs DR without a second DC

DR testing

Test types

Type	Description	Frequency	Risk
Tabletop exercise	Manual scenario walkthrough, no impact on production	Monthly	None
Walkthrough	Runbook verification, ensure everyone knows what to do	Quarterly	None
Component test	Test of a single component (e.g., restore one DB)	Monthly	Low
Integrated test	Test of the entire stack in isolated environment	Quarterly	Low
Full failover test	Production failover to DR site	Annually	High
Chaos experiment	Targeted fault injection into production	Continuous	Medium

Runbook structure

Each DR scenario should have a runbook:

scenario: "Region A failure"
triggers:
  - "CloudWatch alarm: Region A health check 5× timeout"
  - "PagerDuty incident P0"
decision_tree: |
  1. Verify: is Region A really unavailable? (check from 3 different locations)
  2. Decide: is RTO at risk? If < 30 % RTO remaining → failover
  3. Failover: run playbook `dr-failover-region-b`
  4. Verification: smoke tests in Region B
  5. Communication: status page + stakeholders
rollback: |
  1. After Region A recovery → replicate changes from B back to A
  2. Repoint DNS to A
  3. Verify data consistency
  4. Shut down Region B (or keep as hot standby)
contacts:
  primary: "on-call@example.com"
  escalation: "infra-lead@example.com"
  management: "vp-engineering@example.com"

Best practices

Test recovery, not backup — a backup without tested recovery is not a backup
Automate DR — Terraform / Ansible for DR environment spin-up, DNS failover
Document runbooks — every scenario, contact, decision tree
Expect failure — design for failure, don't expect everything to work
Don't underestimate WRT — service recovery does not mean full operations (data warming, cache, connections)
Align RTO/RPO with business — technical capabilities must match business requirements
Monitor SLI — without data, SLO cannot be verified
DR is not just IT — communication, PR, legal, compliance

CLOUD.md — cloud DR strategy, AWS/Azure/GCP specific
DATACENTERS.md — DC redundancy, Tier classification
MONITORING.md — alerting, SLI/SLO/SLA
CICD.md — deployment strategy, rollback
STORAGE.md — backup storage, replication

Sources

Odkazy, knihy a standardy: sources/infrastructure/sources.md

Last revised: 2026-06-11

16 KiB Raw Blame History Unescape Escape

🔄 Disaster Recovery and Business Continuity

Terminology

Relationship between RTO, RPO, MTD, WRT

Uptime calculation

Nines table

Calculation

Calculator

Calculation fallacies

DR scenarios

Classification

Scenario details

Site / Region failure

Data corruption / human error

Ransomware / cyber attack

Prevention — strategies

Backup strategies

Backup methods

Backup immutability

DR architectures

Multi-AZ (Single region)

Multi-Region

On-prem → Cloud DR (Hybrid)

DR testing

Test types

Runbook structure

Best practices

Related

Sources

16 KiB

Raw Blame History