Files
knowledge-base/DR.en.md
Stanislav Hubacek ef3c2f75b1 18.6.2026
2026-06-18 16:25:33 +02:00

337 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🔄 Disaster Recovery and Business Continuity
## Terminology
| Abbreviation | Meaning | Description |
|---------|--------|-------|
| **RTO** | Recovery Time Objective | Maximum time from outage to service recovery |
| **RPO** | Recovery Point Objective | Maximum acceptable data loss (time since last backup) |
| **MTD** | Maximum Tolerable Downtime | Total outage duration an organization can survive |
| **WRT** | Work Recovery Time | Time needed for full operations recovery after IT restoration |
| **MTBF** | Mean Time Between Failures | Mean time between failures |
| **MTTR** | Mean Time To Repair | Mean time to repair |
| **SLA** | Service Level Agreement | Contractual availability commitment |
| **SLO** | Service Level Objective | Internal availability target |
| **SLI** | Service Level Indicator | Measured availability value |
### Relationship between RTO, RPO, MTD, WRT
```
Outage ──── RPO ────► Data restored ──── RTO ────► Service running ──── WRT ────► Full operations
│ │ │
▼ ▼ ▼
Lost data Time without service Time to full capacity
MTD = RTO + WRT (max. time the business tolerates)
```
---
## Uptime calculation
### Nines table
| Level | Uptime | Downtime / year | Downtime / month | Downtime / week |
|--------|--------|---------------|------------------|------------------|
| 90 % (one nine) | 0.9 | 36.5 days | 72 h | 16.8 h |
| 99 % (two nines) | 0.99 | 3.65 days | 7.2 h | 1.68 h |
| 99.5 % | 0.995 | 1.83 days | 3.6 h | 50.4 min |
| 99.9 % (three nines) | 0.999 | 8.76 h | 43.2 min | 10.1 min |
| 99.95 % | 0.9995 | 4.38 h | 21.6 min | 5.04 min |
| 99.99 % (four nines) | 0.9999 | 52.6 min | 4.32 min | 1.01 min |
| 99.995 % | 0.99995 | 26.3 min | 2.16 min | 30.2 s |
| 99.999 % (five nines) | 0.99999 | 5.26 min | 25.9 s | 6.05 s |
| 99.9999 % (six nines) | 0.999999 | 31.6 s | 2.59 s | 0.605 s |
### Calculation
```
Availability = (Total time - Downtime) / Total time × 100 %
Example:
Year = 365 × 24 × 60 = 525,600 minutes
Target: 99.9 % → allowed downtime = 525,600 × (1 - 0.999) = 525.6 minutes = 8.76 h
Combined availability (chain of dependencies):
A_web = 99.9 % (3 nines)
A_api = 99.99 % (4 nines)
A_db = 99.999 % (5 nines)
A_total = 0.999 × 0.9999 × 0.99999 = 0.99889 ≈ 99.89 % (less than 3 nines!)
Parallel availability (redundancy):
A_total = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)
Example: 2 servers with 99% availability
A_total = 1 - (1-0.99) × (1-0.99) = 1 - 0.01 × 0.01 = 0.9999 (99.99 %)
```
### Calculator
```python
def uptime_percent_to_downtime(pct, period_days=365):
"""Convert uptime percentage to downtime in given period."""
total_minutes = period_days * 24 * 60
allowed_downtime = total_minutes * (1 - pct / 100)
return allowed_downtime # minutes
def downtime_to_uptime_percent(downtime_minutes, period_days=365):
"""Convert downtime in minutes to uptime percentage."""
total_minutes = period_days * 24 * 60
return (1 - downtime_minutes / total_minutes) * 100
def combined_availability(availabilities):
"""Combined availability (series-connected components)."""
result = 1.0
for a in availabilities:
result *= a
return result
def redundant_availability(availabilities):
"""Redundant availability (parallel components)."""
result = 1.0
for a in availabilities:
result *= (1 - a)
return 1 - result
```
### Calculation fallacies
- **Combined availability is not a sum** — adding another dependency always reduces total availability
- **Redundancy is not free** — adding a standby component requires failure detection + failover (MTTR does not improve automatically)
- **SLA is not a guarantee** — providers often calculate SLA as a monthly average, not per-incident
- **Measurement is key** — without SLI, SLO cannot be verified; "unmeasured availability does not exist"
- **Planned maintenance** — sometimes counted as uptime, sometimes not (depends on SLA definition)
---
## DR scenarios
### Classification
| Category | Scenario | Typical RTO | Typical RPO | Frequency |
|-----------|--------|-------------|-------------|-----------|
| **Site** | Entire DC / region outage | hours | minutes | Low |
| **Infrastructure** | HW failure (storage, switch, server) | minuteshours | seconds | Medium |
| **Software** | OS, application, DB failure | minutes | seconds | High |
| **Data** | Data corruption, deletion, cryptolocker | hours | backup point | Lowmedium |
| **Human** | Wrong deployment, config change | minuteshours | seconds | Medium |
| **Security** | Attack, breach, ransomware | days | before attack | Low |
| **Network** | Connectivity outage, DDoS | minuteshours | N/A | Medium |
| **Cloud provider** | Regional outage (AWS, Azure, GCP) | hours | minutes | Very low |
### Scenario details
#### Site / Region failure
| Aspect | Description |
|--------|-------|
| **Cause** | Blackout, fire, flood, earthquake, cloud provider outage |
| **Prevention** | Multi-AZ architecture, multi-region deployment, active-active |
| **Mitigation** | Automatic DNS failover (Route53, Azure Traffic Manager), replica in DR region |
| **Testing** | Game day: shut down primary region, verify automatic failover |
#### Data corruption / human error
| Aspect | Description |
|--------|-------|
| **Cause** | Wrong SQL command (DELETE without WHERE), accidentally deleted bucket, bad migration |
| **Prevention** | RBAC, MFA for destructive operations, change management, SQL peer review |
| **Mitigation** | Point-in-time recovery (PITR), transaction log replay, immutable backups |
| **Testing** | Restore backup to isolated environment, verify data integrity |
#### Ransomware / cyber attack
| Aspect | Description |
|--------|-------|
| **Cause** | Attack on production systems, data encryption, exfiltration |
| **Prevention** | Immutable backups (object lock), air-gapped backups, network segmentation |
| **Mitigation** | Restore from clean backup, rebuild infrastructure from IaC |
| **Testing** | Regular restore in isolated network, verify backup is not infected |
---
## Prevention — strategies
### Backup strategies
| Approach | Description | Use case |
|---------|-------|----------|
| **3-2-1 rule** | 3 copies, 2 different media, 1 off-site | Universal |
| **3-2-1-0** | + 0 errors after restore (testing) | Enterprise, compliance |
| **GFS (Grandfather-Father-Son)** | Daily, weekly, monthly rotation | Long-term archive |
| **Incremental forever** | Full backup 1×, then only changes | Large data volumes |
| **Reverse incremental** | Full + incremental, full is always current | Fast recovery |
### Backup methods
| Method | RPO | RTO | Storage | Suitable for |
|--------|-----|-----|----------|------------|
| **Full backup** | Last full | Full restore time | Large | Small data, weekly |
| **Incremental** | Last incremental | Full + all incrementals | Small | Large data, daily |
| **Differential** | Last diff | Full + last diff | Medium | Compromise |
| **Snapshot** | Snapshot point-in-time | seconds | Copy-on-write | VM, storage array |
| **Continuous (CDC)** | < 1 s | Seconds | Log stream | DB (binlog, WAL) |
| **PITR** | Any point in time | Depends on volume | Full + WAL | RDS, PostgreSQL, SQL Server |
### Backup immutability
Key protection against ransomware:
| Technique | Description |
|----------|-------|
| **Object Lock (WORM)** | Backup cannot be deleted or overwritten for a defined retention period (S3 Object Lock, Azure Blob Immutable) |
| **Air gap** | Backup is physically separated from the production network (offline disk, tape, cloud without VPN) |
| **Isolated backup network** | Backup traffic goes through a dedicated network without access from production VLAN |
| **Out-of-band access** | Backup management console is not accessible from the production network |
---
## DR architectures
### Multi-AZ (Single region)
```
Region ┌────────────────────────────────────┐
│ AZ-1 AZ-2 │
│ ┌──────────┐ ┌──────────┐ │
│ │ App │ │ App │ │
│ └─────┬────┘ └─────┬────┘ │
│ │ │ │
│ ┌─────▼────────────────▼─────┐ │
│ │ Load Balancer (cross-AZ) │ │
│ └─────────────┬──────────────┘ │
│ │ │
│ ┌─────────────▼──────────────┐ │
│ │ DB Primary (AZ-1) │ │
│ │ DB Standby (AZ-2) │ │
│ │ Synchronous replication │ │
│ └────────────────────────────┘ │
└────────────────────────────────────┘
```
- RTO: minutes (automatic failover)
- RPO: 0 (sync replication)
- Protection: against AZ failure, not region failure
### Multi-Region
```
Region A (Primary) Region B (DR)
┌─────────────────────┐ ┌─────────────────────┐
│ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ App + DB │ │ │ │ App + DB │ │
│ │ Active │──┼──Async───────┼─►│ Standby │ │
│ └───────────────┘ │ replication │ └───────────────┘ │
│ │ │ │ │ │
│ ┌──────▼───────┐ │ │ ┌──────▼───────┐ │
│ │ DNS / GSLB │ │ │ │ DNS / GSLB │ │
│ └──────┬───────┘ │ │ └──────┬───────┘ │
└─────────┼──────────┘ └─────────┼──────────┘
│ │
└──────────── Traffic Manager ───────┘
```
| Variant | RTO | RPO | Cost | Failover |
|----------|-----|-----|---------|----------|
| **Active-Passive** | minuteshours | seconds | Medium | Manual / auto |
| **Active-Active** | seconds | < 1 s | High | Automatic (DNS) |
| **Pilot Light** | tens of minutes | minutes | Low | Manual scaling |
| **Warm Standby** | minutes | seconds | High | Auto (reduced copy) |
| **Backup & Restore** | hours | 24 h | Low | Manual |
### On-prem → Cloud DR (Hybrid)
```
On-prem DC Cloud (DR)
┌─────────────────────┐ ┌─────────────────────┐
│ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ Application │ │ │ │ VM / App │ │
│ │ + DB │ │ │ │ + DB replica │ │
│ └───────┬───────┘ │ │ └───────┬───────┘ │
│ │ │ │ │ │
│ ┌───────▼───────┐ │ site-to-site│ ┌───────▼───────┐ │
│ │ Backup proxy │──┼────VPN───────┼─►│ Backup store │ │
│ └───────────────┘ │ │ └───────────────┘ │
│ │ │ │
│ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ Tape / NAS │ │ │ │ Veeam / Zerto│ │
│ └───────────────┘ │ │ └───────────────┘ │
└─────────────────────┘ └─────────────────────┘
```
- **RTO**: tens of minutes (depends on VM startup)
- **RPO**: minuteshours (depends on replication tool)
- **Tools**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
- **Use case**: enterprise with on-prem DC that needs DR without a second DC
---
## DR testing
### Test types
| Type | Description | Frequency | Risk |
|-----|-------|-----------|--------|
| **Tabletop exercise** | Manual scenario walkthrough, no impact on production | Monthly | None |
| **Walkthrough** | Runbook verification, ensure everyone knows what to do | Quarterly | None |
| **Component test** | Test of a single component (e.g., restore one DB) | Monthly | Low |
| **Integrated test** | Test of the entire stack in isolated environment | Quarterly | Low |
| **Full failover test** | Production failover to DR site | Annually | High |
| **Chaos experiment** | Targeted fault injection into production | Continuous | Medium |
### Runbook structure
Each DR scenario should have a runbook:
```yaml
scenario: "Region A failure"
triggers:
- "CloudWatch alarm: Region A health check 5× timeout"
- "PagerDuty incident P0"
decision_tree: |
1. Verify: is Region A really unavailable? (check from 3 different locations)
2. Decide: is RTO at risk? If < 30 % RTO remaining → failover
3. Failover: run playbook `dr-failover-region-b`
4. Verification: smoke tests in Region B
5. Communication: status page + stakeholders
rollback: |
1. After Region A recovery → replicate changes from B back to A
2. Repoint DNS to A
3. Verify data consistency
4. Shut down Region B (or keep as hot standby)
contacts:
primary: "on-call@example.com"
escalation: "infra-lead@example.com"
management: "vp-engineering@example.com"
```
---
## Best practices
- **Test recovery, not backup** — a backup without tested recovery is not a backup
- **Automate DR** — Terraform / Ansible for DR environment spin-up, DNS failover
- **Document runbooks** — every scenario, contact, decision tree
- **Expect failure** — design for failure, don't expect everything to work
- **Don't underestimate WRT** — service recovery does not mean full operations (data warming, cache, connections)
- **Align RTO/RPO with business** — technical capabilities must match business requirements
- **Monitor SLI** — without data, SLO cannot be verified
- **DR is not just IT** — communication, PR, legal, compliance
---
## Related
- [CLOUD.en.md](CLOUD.en.md) — cloud DR strategy, AWS/Azure/GCP specific
- [DATACENTERS.en.md](DATACENTERS.en.md) — DC redundancy, Tier classification
- [MONITORING.en.md](MONITORING.en.md) — alerting, SLI/SLO/SLA
- [CICD.en.md](CICD.en.md) — deployment strategy, rollback
- [STORAGE.en.md](STORAGE.en.md) — backup storage, replication
## Sources
Odkazy, knihy a standardy: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
*Last revised: 2026-06-11*