# πŸ”„ Disaster Recovery and Business Continuity ## Terminology | Abbreviation | Meaning | Description | |---------|--------|-------| | **RTO** | Recovery Time Objective | Maximum time from outage to service recovery | | **RPO** | Recovery Point Objective | Maximum acceptable data loss (time since last backup) | | **MTD** | Maximum Tolerable Downtime | Total outage duration an organization can survive | | **WRT** | Work Recovery Time | Time needed for full operations recovery after IT restoration | | **MTBF** | Mean Time Between Failures | Mean time between failures | | **MTTR** | Mean Time To Repair | Mean time to repair | | **SLA** | Service Level Agreement | Contractual availability commitment | | **SLO** | Service Level Objective | Internal availability target | | **SLI** | Service Level Indicator | Measured availability value | ### Relationship between RTO, RPO, MTD, WRT ``` Outage ──── RPO ────► Data restored ──── RTO ────► Service running ──── WRT ────► Full operations β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό Lost data Time without service Time to full capacity MTD = RTO + WRT (max. time the business tolerates) ``` --- ## Uptime calculation ### Nines table | Level | Uptime | Downtime / year | Downtime / month | Downtime / week | |--------|--------|---------------|------------------|------------------| | 90 % (one nine) | 0.9 | 36.5 days | 72 h | 16.8 h | | 99 % (two nines) | 0.99 | 3.65 days | 7.2 h | 1.68 h | | 99.5 % | 0.995 | 1.83 days | 3.6 h | 50.4 min | | 99.9 % (three nines) | 0.999 | 8.76 h | 43.2 min | 10.1 min | | 99.95 % | 0.9995 | 4.38 h | 21.6 min | 5.04 min | | 99.99 % (four nines) | 0.9999 | 52.6 min | 4.32 min | 1.01 min | | 99.995 % | 0.99995 | 26.3 min | 2.16 min | 30.2 s | | 99.999 % (five nines) | 0.99999 | 5.26 min | 25.9 s | 6.05 s | | 99.9999 % (six nines) | 0.999999 | 31.6 s | 2.59 s | 0.605 s | ### Calculation ``` Availability = (Total time - Downtime) / Total time Γ— 100 % Example: Year = 365 Γ— 24 Γ— 60 = 525,600 minutes Target: 99.9 % β†’ allowed downtime = 525,600 Γ— (1 - 0.999) = 525.6 minutes = 8.76 h Combined availability (chain of dependencies): A_web = 99.9 % (3 nines) A_api = 99.99 % (4 nines) A_db = 99.999 % (5 nines) A_total = 0.999 Γ— 0.9999 Γ— 0.99999 = 0.99889 β‰ˆ 99.89 % (less than 3 nines!) Parallel availability (redundancy): A_total = 1 - (1 - A_1) Γ— (1 - A_2) Γ— ... Γ— (1 - A_n) Example: 2 servers with 99% availability A_total = 1 - (1-0.99) Γ— (1-0.99) = 1 - 0.01 Γ— 0.01 = 0.9999 (99.99 %) ``` ### Calculator ```python def uptime_percent_to_downtime(pct, period_days=365): """Convert uptime percentage to downtime in given period.""" total_minutes = period_days * 24 * 60 allowed_downtime = total_minutes * (1 - pct / 100) return allowed_downtime # minutes def downtime_to_uptime_percent(downtime_minutes, period_days=365): """Convert downtime in minutes to uptime percentage.""" total_minutes = period_days * 24 * 60 return (1 - downtime_minutes / total_minutes) * 100 def combined_availability(availabilities): """Combined availability (series-connected components).""" result = 1.0 for a in availabilities: result *= a return result def redundant_availability(availabilities): """Redundant availability (parallel components).""" result = 1.0 for a in availabilities: result *= (1 - a) return 1 - result ``` ### Calculation fallacies - **Combined availability is not a sum** β€” adding another dependency always reduces total availability - **Redundancy is not free** β€” adding a standby component requires failure detection + failover (MTTR does not improve automatically) - **SLA is not a guarantee** β€” providers often calculate SLA as a monthly average, not per-incident - **Measurement is key** β€” without SLI, SLO cannot be verified; "unmeasured availability does not exist" - **Planned maintenance** β€” sometimes counted as uptime, sometimes not (depends on SLA definition) --- ## DR scenarios ### Classification | Category | Scenario | Typical RTO | Typical RPO | Frequency | |-----------|--------|-------------|-------------|-----------| | **Site** | Entire DC / region outage | hours | minutes | Low | | **Infrastructure** | HW failure (storage, switch, server) | minutes–hours | seconds | Medium | | **Software** | OS, application, DB failure | minutes | seconds | High | | **Data** | Data corruption, deletion, cryptolocker | hours | backup point | Low–medium | | **Human** | Wrong deployment, config change | minutes–hours | seconds | Medium | | **Security** | Attack, breach, ransomware | days | before attack | Low | | **Network** | Connectivity outage, DDoS | minutes–hours | N/A | Medium | | **Cloud provider** | Regional outage (AWS, Azure, GCP) | hours | minutes | Very low | ### Scenario details #### Site / Region failure | Aspect | Description | |--------|-------| | **Cause** | Blackout, fire, flood, earthquake, cloud provider outage | | **Prevention** | Multi-AZ architecture, multi-region deployment, active-active | | **Mitigation** | Automatic DNS failover (Route53, Azure Traffic Manager), replica in DR region | | **Testing** | Game day: shut down primary region, verify automatic failover | #### Data corruption / human error | Aspect | Description | |--------|-------| | **Cause** | Wrong SQL command (DELETE without WHERE), accidentally deleted bucket, bad migration | | **Prevention** | RBAC, MFA for destructive operations, change management, SQL peer review | | **Mitigation** | Point-in-time recovery (PITR), transaction log replay, immutable backups | | **Testing** | Restore backup to isolated environment, verify data integrity | #### Ransomware / cyber attack | Aspect | Description | |--------|-------| | **Cause** | Attack on production systems, data encryption, exfiltration | | **Prevention** | Immutable backups (object lock), air-gapped backups, network segmentation | | **Mitigation** | Restore from clean backup, rebuild infrastructure from IaC | | **Testing** | Regular restore in isolated network, verify backup is not infected | --- ## Prevention β€” strategies ### Backup strategies | Approach | Description | Use case | |---------|-------|----------| | **3-2-1 rule** | 3 copies, 2 different media, 1 off-site | Universal | | **3-2-1-0** | + 0 errors after restore (testing) | Enterprise, compliance | | **GFS (Grandfather-Father-Son)** | Daily, weekly, monthly rotation | Long-term archive | | **Incremental forever** | Full backup 1Γ—, then only changes | Large data volumes | | **Reverse incremental** | Full + incremental, full is always current | Fast recovery | ### Backup methods | Method | RPO | RTO | Storage | Suitable for | |--------|-----|-----|----------|------------| | **Full backup** | Last full | Full restore time | Large | Small data, weekly | | **Incremental** | Last incremental | Full + all incrementals | Small | Large data, daily | | **Differential** | Last diff | Full + last diff | Medium | Compromise | | **Snapshot** | Snapshot point-in-time | seconds | Copy-on-write | VM, storage array | | **Continuous (CDC)** | < 1 s | Seconds | Log stream | DB (binlog, WAL) | | **PITR** | Any point in time | Depends on volume | Full + WAL | RDS, PostgreSQL, SQL Server | ### Backup immutability Key protection against ransomware: | Technique | Description | |----------|-------| | **Object Lock (WORM)** | Backup cannot be deleted or overwritten for a defined retention period (S3 Object Lock, Azure Blob Immutable) | | **Air gap** | Backup is physically separated from the production network (offline disk, tape, cloud without VPN) | | **Isolated backup network** | Backup traffic goes through a dedicated network without access from production VLAN | | **Out-of-band access** | Backup management console is not accessible from the production network | --- ## DR architectures ### Multi-AZ (Single region) ``` Region β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ AZ-1 AZ-2 β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ App β”‚ β”‚ App β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Load Balancer (cross-AZ) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ DB Primary (AZ-1) β”‚ β”‚ β”‚ β”‚ DB Standby (AZ-2) β”‚ β”‚ β”‚ β”‚ Synchronous replication β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` - RTO: minutes (automatic failover) - RPO: 0 (sync replication) - Protection: against AZ failure, not region failure ### Multi-Region ``` Region A (Primary) Region B (DR) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ App + DB β”‚ β”‚ β”‚ β”‚ App + DB β”‚ β”‚ β”‚ β”‚ Active │──┼──Async───────┼─►│ Standby β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ replication β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ DNS / GSLB β”‚ β”‚ β”‚ β”‚ DNS / GSLB β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ └──────────── Traffic Manager β”€β”€β”€β”€β”€β”€β”€β”˜ ``` | Variant | RTO | RPO | Cost | Failover | |----------|-----|-----|---------|----------| | **Active-Passive** | minutes–hours | seconds | Medium | Manual / auto | | **Active-Active** | seconds | < 1 s | High | Automatic (DNS) | | **Pilot Light** | tens of minutes | minutes | Low | Manual scaling | | **Warm Standby** | minutes | seconds | High | Auto (reduced copy) | | **Backup & Restore** | hours | 24 h | Low | Manual | ### On-prem β†’ Cloud DR (Hybrid) ``` On-prem DC Cloud (DR) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Application β”‚ β”‚ β”‚ β”‚ VM / App β”‚ β”‚ β”‚ β”‚ + DB β”‚ β”‚ β”‚ β”‚ + DB replica β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”‚ site-to-siteβ”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Backup proxy │──┼────VPN───────┼─►│ Backup store β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Tape / NAS β”‚ β”‚ β”‚ β”‚ Veeam / Zertoβ”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` - **RTO**: tens of minutes (depends on VM startup) - **RPO**: minutes–hours (depends on replication tool) - **Tools**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault - **Use case**: enterprise with on-prem DC that needs DR without a second DC --- ## DR testing ### Test types | Type | Description | Frequency | Risk | |-----|-------|-----------|--------| | **Tabletop exercise** | Manual scenario walkthrough, no impact on production | Monthly | None | | **Walkthrough** | Runbook verification, ensure everyone knows what to do | Quarterly | None | | **Component test** | Test of a single component (e.g., restore one DB) | Monthly | Low | | **Integrated test** | Test of the entire stack in isolated environment | Quarterly | Low | | **Full failover test** | Production failover to DR site | Annually | High | | **Chaos experiment** | Targeted fault injection into production | Continuous | Medium | ### Runbook structure Each DR scenario should have a runbook: ```yaml scenario: "Region A failure" triggers: - "CloudWatch alarm: Region A health check 5Γ— timeout" - "PagerDuty incident P0" decision_tree: | 1. Verify: is Region A really unavailable? (check from 3 different locations) 2. Decide: is RTO at risk? If < 30 % RTO remaining β†’ failover 3. Failover: run playbook `dr-failover-region-b` 4. Verification: smoke tests in Region B 5. Communication: status page + stakeholders rollback: | 1. After Region A recovery β†’ replicate changes from B back to A 2. Repoint DNS to A 3. Verify data consistency 4. Shut down Region B (or keep as hot standby) contacts: primary: "on-call@example.com" escalation: "infra-lead@example.com" management: "vp-engineering@example.com" ``` --- ## Best practices - **Test recovery, not backup** β€” a backup without tested recovery is not a backup - **Automate DR** β€” Terraform / Ansible for DR environment spin-up, DNS failover - **Document runbooks** β€” every scenario, contact, decision tree - **Expect failure** β€” design for failure, don't expect everything to work - **Don't underestimate WRT** β€” service recovery does not mean full operations (data warming, cache, connections) - **Align RTO/RPO with business** β€” technical capabilities must match business requirements - **Monitor SLI** β€” without data, SLO cannot be verified - **DR is not just IT** β€” communication, PR, legal, compliance --- ## Related - [CLOUD.en.md](CLOUD.en.md) β€” cloud DR strategy, AWS/Azure/GCP specific - [DATACENTERS.en.md](DATACENTERS.en.md) β€” DC redundancy, Tier classification - [MONITORING.en.md](MONITORING.en.md) β€” alerting, SLI/SLO/SLA - [CICD.en.md](CICD.en.md) β€” deployment strategy, rollback - [STORAGE.en.md](STORAGE.en.md) β€” backup storage, replication ## Sources Odkazy, knihy a standardy: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md) *Last revised: 2026-06-11*