🔄 Disaster Recovery and Business Continuity
Terminology
| Abbreviation |
Meaning |
Description |
| RTO |
Recovery Time Objective |
Maximum time from outage to service recovery |
| RPO |
Recovery Point Objective |
Maximum acceptable data loss (time since last backup) |
| MTD |
Maximum Tolerable Downtime |
Total outage duration an organization can survive |
| WRT |
Work Recovery Time |
Time needed for full operations recovery after IT restoration |
| MTBF |
Mean Time Between Failures |
Mean time between failures |
| MTTR |
Mean Time To Repair |
Mean time to repair |
| SLA |
Service Level Agreement |
Contractual availability commitment |
| SLO |
Service Level Objective |
Internal availability target |
| SLI |
Service Level Indicator |
Measured availability value |
Relationship between RTO, RPO, MTD, WRT
Uptime calculation
Nines table
| Level |
Uptime |
Downtime / year |
Downtime / month |
Downtime / week |
| 90 % (one nine) |
0.9 |
36.5 days |
72 h |
16.8 h |
| 99 % (two nines) |
0.99 |
3.65 days |
7.2 h |
1.68 h |
| 99.5 % |
0.995 |
1.83 days |
3.6 h |
50.4 min |
| 99.9 % (three nines) |
0.999 |
8.76 h |
43.2 min |
10.1 min |
| 99.95 % |
0.9995 |
4.38 h |
21.6 min |
5.04 min |
| 99.99 % (four nines) |
0.9999 |
52.6 min |
4.32 min |
1.01 min |
| 99.995 % |
0.99995 |
26.3 min |
2.16 min |
30.2 s |
| 99.999 % (five nines) |
0.99999 |
5.26 min |
25.9 s |
6.05 s |
| 99.9999 % (six nines) |
0.999999 |
31.6 s |
2.59 s |
0.605 s |
Calculation
Calculator
Calculation fallacies
- Combined availability is not a sum — adding another dependency always reduces total availability
- Redundancy is not free — adding a standby component requires failure detection + failover (MTTR does not improve automatically)
- SLA is not a guarantee — providers often calculate SLA as a monthly average, not per-incident
- Measurement is key — without SLI, SLO cannot be verified; "unmeasured availability does not exist"
- Planned maintenance — sometimes counted as uptime, sometimes not (depends on SLA definition)
DR scenarios
Classification
| Category |
Scenario |
Typical RTO |
Typical RPO |
Frequency |
| Site |
Entire DC / region outage |
hours |
minutes |
Low |
| Infrastructure |
HW failure (storage, switch, server) |
minutes–hours |
seconds |
Medium |
| Software |
OS, application, DB failure |
minutes |
seconds |
High |
| Data |
Data corruption, deletion, cryptolocker |
hours |
backup point |
Low–medium |
| Human |
Wrong deployment, config change |
minutes–hours |
seconds |
Medium |
| Security |
Attack, breach, ransomware |
days |
before attack |
Low |
| Network |
Connectivity outage, DDoS |
minutes–hours |
N/A |
Medium |
| Cloud provider |
Regional outage (AWS, Azure, GCP) |
hours |
minutes |
Very low |
Scenario details
Site / Region failure
| Aspect |
Description |
| Cause |
Blackout, fire, flood, earthquake, cloud provider outage |
| Prevention |
Multi-AZ architecture, multi-region deployment, active-active |
| Mitigation |
Automatic DNS failover (Route53, Azure Traffic Manager), replica in DR region |
| Testing |
Game day: shut down primary region, verify automatic failover |
Data corruption / human error
| Aspect |
Description |
| Cause |
Wrong SQL command (DELETE without WHERE), accidentally deleted bucket, bad migration |
| Prevention |
RBAC, MFA for destructive operations, change management, SQL peer review |
| Mitigation |
Point-in-time recovery (PITR), transaction log replay, immutable backups |
| Testing |
Restore backup to isolated environment, verify data integrity |
Ransomware / cyber attack
| Aspect |
Description |
| Cause |
Attack on production systems, data encryption, exfiltration |
| Prevention |
Immutable backups (object lock), air-gapped backups, network segmentation |
| Mitigation |
Restore from clean backup, rebuild infrastructure from IaC |
| Testing |
Regular restore in isolated network, verify backup is not infected |
Prevention — strategies
Backup strategies
| Approach |
Description |
Use case |
| 3-2-1 rule |
3 copies, 2 different media, 1 off-site |
Universal |
| 3-2-1-0 |
+ 0 errors after restore (testing) |
Enterprise, compliance |
| GFS (Grandfather-Father-Son) |
Daily, weekly, monthly rotation |
Long-term archive |
| Incremental forever |
Full backup 1×, then only changes |
Large data volumes |
| Reverse incremental |
Full + incremental, full is always current |
Fast recovery |
Backup methods
| Method |
RPO |
RTO |
Storage |
Suitable for |
| Full backup |
Last full |
Full restore time |
Large |
Small data, weekly |
| Incremental |
Last incremental |
Full + all incrementals |
Small |
Large data, daily |
| Differential |
Last diff |
Full + last diff |
Medium |
Compromise |
| Snapshot |
Snapshot point-in-time |
seconds |
Copy-on-write |
VM, storage array |
| Continuous (CDC) |
< 1 s |
Seconds |
Log stream |
DB (binlog, WAL) |
| PITR |
Any point in time |
Depends on volume |
Full + WAL |
RDS, PostgreSQL, SQL Server |
Backup immutability
Key protection against ransomware:
| Technique |
Description |
| Object Lock (WORM) |
Backup cannot be deleted or overwritten for a defined retention period (S3 Object Lock, Azure Blob Immutable) |
| Air gap |
Backup is physically separated from the production network (offline disk, tape, cloud without VPN) |
| Isolated backup network |
Backup traffic goes through a dedicated network without access from production VLAN |
| Out-of-band access |
Backup management console is not accessible from the production network |
DR architectures
Multi-AZ (Single region)
- RTO: minutes (automatic failover)
- RPO: 0 (sync replication)
- Protection: against AZ failure, not region failure
Multi-Region
| Variant |
RTO |
RPO |
Cost |
Failover |
| Active-Passive |
minutes–hours |
seconds |
Medium |
Manual / auto |
| Active-Active |
seconds |
< 1 s |
High |
Automatic (DNS) |
| Pilot Light |
tens of minutes |
minutes |
Low |
Manual scaling |
| Warm Standby |
minutes |
seconds |
High |
Auto (reduced copy) |
| Backup & Restore |
hours |
24 h |
Low |
Manual |
On-prem → Cloud DR (Hybrid)
- RTO: tens of minutes (depends on VM startup)
- RPO: minutes–hours (depends on replication tool)
- Tools: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
- Use case: enterprise with on-prem DC that needs DR without a second DC
DR testing
Test types
| Type |
Description |
Frequency |
Risk |
| Tabletop exercise |
Manual scenario walkthrough, no impact on production |
Monthly |
None |
| Walkthrough |
Runbook verification, ensure everyone knows what to do |
Quarterly |
None |
| Component test |
Test of a single component (e.g., restore one DB) |
Monthly |
Low |
| Integrated test |
Test of the entire stack in isolated environment |
Quarterly |
Low |
| Full failover test |
Production failover to DR site |
Annually |
High |
| Chaos experiment |
Targeted fault injection into production |
Continuous |
Medium |
Runbook structure
Each DR scenario should have a runbook:
Best practices
- Test recovery, not backup — a backup without tested recovery is not a backup
- Automate DR — Terraform / Ansible for DR environment spin-up, DNS failover
- Document runbooks — every scenario, contact, decision tree
- Expect failure — design for failure, don't expect everything to work
- Don't underestimate WRT — service recovery does not mean full operations (data warming, cache, connections)
- Align RTO/RPO with business — technical capabilities must match business requirements
- Monitor SLI — without data, SLO cannot be verified
- DR is not just IT — communication, PR, legal, compliance
Related
Sources
Odkazy, knihy a standardy: sources/infrastructure/sources.en.md
Last revised: 2026-06-11