337 lines
16 KiB
Markdown
337 lines
16 KiB
Markdown
# 🔄 Disaster Recovery and Business Continuity
|
||
|
||
## Terminology
|
||
|
||
| Abbreviation | Meaning | Description |
|
||
|---------|--------|-------|
|
||
| **RTO** | Recovery Time Objective | Maximum time from outage to service recovery |
|
||
| **RPO** | Recovery Point Objective | Maximum acceptable data loss (time since last backup) |
|
||
| **MTD** | Maximum Tolerable Downtime | Total outage duration an organization can survive |
|
||
| **WRT** | Work Recovery Time | Time needed for full operations recovery after IT restoration |
|
||
| **MTBF** | Mean Time Between Failures | Mean time between failures |
|
||
| **MTTR** | Mean Time To Repair | Mean time to repair |
|
||
| **SLA** | Service Level Agreement | Contractual availability commitment |
|
||
| **SLO** | Service Level Objective | Internal availability target |
|
||
| **SLI** | Service Level Indicator | Measured availability value |
|
||
|
||
### Relationship between RTO, RPO, MTD, WRT
|
||
|
||
```
|
||
Outage ──── RPO ────► Data restored ──── RTO ────► Service running ──── WRT ────► Full operations
|
||
│ │ │
|
||
▼ ▼ ▼
|
||
Lost data Time without service Time to full capacity
|
||
|
||
MTD = RTO + WRT (max. time the business tolerates)
|
||
```
|
||
|
||
---
|
||
|
||
## Uptime calculation
|
||
|
||
### Nines table
|
||
|
||
| Level | Uptime | Downtime / year | Downtime / month | Downtime / week |
|
||
|--------|--------|---------------|------------------|------------------|
|
||
| 90 % (one nine) | 0.9 | 36.5 days | 72 h | 16.8 h |
|
||
| 99 % (two nines) | 0.99 | 3.65 days | 7.2 h | 1.68 h |
|
||
| 99.5 % | 0.995 | 1.83 days | 3.6 h | 50.4 min |
|
||
| 99.9 % (three nines) | 0.999 | 8.76 h | 43.2 min | 10.1 min |
|
||
| 99.95 % | 0.9995 | 4.38 h | 21.6 min | 5.04 min |
|
||
| 99.99 % (four nines) | 0.9999 | 52.6 min | 4.32 min | 1.01 min |
|
||
| 99.995 % | 0.99995 | 26.3 min | 2.16 min | 30.2 s |
|
||
| 99.999 % (five nines) | 0.99999 | 5.26 min | 25.9 s | 6.05 s |
|
||
| 99.9999 % (six nines) | 0.999999 | 31.6 s | 2.59 s | 0.605 s |
|
||
|
||
### Calculation
|
||
|
||
```
|
||
Availability = (Total time - Downtime) / Total time × 100 %
|
||
|
||
Example:
|
||
Year = 365 × 24 × 60 = 525,600 minutes
|
||
Target: 99.9 % → allowed downtime = 525,600 × (1 - 0.999) = 525.6 minutes = 8.76 h
|
||
|
||
Combined availability (chain of dependencies):
|
||
A_web = 99.9 % (3 nines)
|
||
A_api = 99.99 % (4 nines)
|
||
A_db = 99.999 % (5 nines)
|
||
|
||
A_total = 0.999 × 0.9999 × 0.99999 = 0.99889 ≈ 99.89 % (less than 3 nines!)
|
||
|
||
Parallel availability (redundancy):
|
||
A_total = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)
|
||
|
||
Example: 2 servers with 99% availability
|
||
A_total = 1 - (1-0.99) × (1-0.99) = 1 - 0.01 × 0.01 = 0.9999 (99.99 %)
|
||
```
|
||
|
||
### Calculator
|
||
|
||
```python
|
||
def uptime_percent_to_downtime(pct, period_days=365):
|
||
"""Convert uptime percentage to downtime in given period."""
|
||
total_minutes = period_days * 24 * 60
|
||
allowed_downtime = total_minutes * (1 - pct / 100)
|
||
return allowed_downtime # minutes
|
||
|
||
def downtime_to_uptime_percent(downtime_minutes, period_days=365):
|
||
"""Convert downtime in minutes to uptime percentage."""
|
||
total_minutes = period_days * 24 * 60
|
||
return (1 - downtime_minutes / total_minutes) * 100
|
||
|
||
def combined_availability(availabilities):
|
||
"""Combined availability (series-connected components)."""
|
||
result = 1.0
|
||
for a in availabilities:
|
||
result *= a
|
||
return result
|
||
|
||
def redundant_availability(availabilities):
|
||
"""Redundant availability (parallel components)."""
|
||
result = 1.0
|
||
for a in availabilities:
|
||
result *= (1 - a)
|
||
return 1 - result
|
||
```
|
||
|
||
### Calculation fallacies
|
||
|
||
- **Combined availability is not a sum** — adding another dependency always reduces total availability
|
||
- **Redundancy is not free** — adding a standby component requires failure detection + failover (MTTR does not improve automatically)
|
||
- **SLA is not a guarantee** — providers often calculate SLA as a monthly average, not per-incident
|
||
- **Measurement is key** — without SLI, SLO cannot be verified; "unmeasured availability does not exist"
|
||
- **Planned maintenance** — sometimes counted as uptime, sometimes not (depends on SLA definition)
|
||
|
||
---
|
||
|
||
## DR scenarios
|
||
|
||
### Classification
|
||
|
||
| Category | Scenario | Typical RTO | Typical RPO | Frequency |
|
||
|-----------|--------|-------------|-------------|-----------|
|
||
| **Site** | Entire DC / region outage | hours | minutes | Low |
|
||
| **Infrastructure** | HW failure (storage, switch, server) | minutes–hours | seconds | Medium |
|
||
| **Software** | OS, application, DB failure | minutes | seconds | High |
|
||
| **Data** | Data corruption, deletion, cryptolocker | hours | backup point | Low–medium |
|
||
| **Human** | Wrong deployment, config change | minutes–hours | seconds | Medium |
|
||
| **Security** | Attack, breach, ransomware | days | before attack | Low |
|
||
| **Network** | Connectivity outage, DDoS | minutes–hours | N/A | Medium |
|
||
| **Cloud provider** | Regional outage (AWS, Azure, GCP) | hours | minutes | Very low |
|
||
|
||
### Scenario details
|
||
|
||
#### Site / Region failure
|
||
|
||
| Aspect | Description |
|
||
|--------|-------|
|
||
| **Cause** | Blackout, fire, flood, earthquake, cloud provider outage |
|
||
| **Prevention** | Multi-AZ architecture, multi-region deployment, active-active |
|
||
| **Mitigation** | Automatic DNS failover (Route53, Azure Traffic Manager), replica in DR region |
|
||
| **Testing** | Game day: shut down primary region, verify automatic failover |
|
||
|
||
#### Data corruption / human error
|
||
|
||
| Aspect | Description |
|
||
|--------|-------|
|
||
| **Cause** | Wrong SQL command (DELETE without WHERE), accidentally deleted bucket, bad migration |
|
||
| **Prevention** | RBAC, MFA for destructive operations, change management, SQL peer review |
|
||
| **Mitigation** | Point-in-time recovery (PITR), transaction log replay, immutable backups |
|
||
| **Testing** | Restore backup to isolated environment, verify data integrity |
|
||
|
||
#### Ransomware / cyber attack
|
||
|
||
| Aspect | Description |
|
||
|--------|-------|
|
||
| **Cause** | Attack on production systems, data encryption, exfiltration |
|
||
| **Prevention** | Immutable backups (object lock), air-gapped backups, network segmentation |
|
||
| **Mitigation** | Restore from clean backup, rebuild infrastructure from IaC |
|
||
| **Testing** | Regular restore in isolated network, verify backup is not infected |
|
||
|
||
---
|
||
|
||
## Prevention — strategies
|
||
|
||
### Backup strategies
|
||
|
||
| Approach | Description | Use case |
|
||
|---------|-------|----------|
|
||
| **3-2-1 rule** | 3 copies, 2 different media, 1 off-site | Universal |
|
||
| **3-2-1-0** | + 0 errors after restore (testing) | Enterprise, compliance |
|
||
| **GFS (Grandfather-Father-Son)** | Daily, weekly, monthly rotation | Long-term archive |
|
||
| **Incremental forever** | Full backup 1×, then only changes | Large data volumes |
|
||
| **Reverse incremental** | Full + incremental, full is always current | Fast recovery |
|
||
|
||
### Backup methods
|
||
|
||
| Method | RPO | RTO | Storage | Suitable for |
|
||
|--------|-----|-----|----------|------------|
|
||
| **Full backup** | Last full | Full restore time | Large | Small data, weekly |
|
||
| **Incremental** | Last incremental | Full + all incrementals | Small | Large data, daily |
|
||
| **Differential** | Last diff | Full + last diff | Medium | Compromise |
|
||
| **Snapshot** | Snapshot point-in-time | seconds | Copy-on-write | VM, storage array |
|
||
| **Continuous (CDC)** | < 1 s | Seconds | Log stream | DB (binlog, WAL) |
|
||
| **PITR** | Any point in time | Depends on volume | Full + WAL | RDS, PostgreSQL, SQL Server |
|
||
|
||
### Backup immutability
|
||
|
||
Key protection against ransomware:
|
||
|
||
| Technique | Description |
|
||
|----------|-------|
|
||
| **Object Lock (WORM)** | Backup cannot be deleted or overwritten for a defined retention period (S3 Object Lock, Azure Blob Immutable) |
|
||
| **Air gap** | Backup is physically separated from the production network (offline disk, tape, cloud without VPN) |
|
||
| **Isolated backup network** | Backup traffic goes through a dedicated network without access from production VLAN |
|
||
| **Out-of-band access** | Backup management console is not accessible from the production network |
|
||
|
||
---
|
||
|
||
## DR architectures
|
||
|
||
### Multi-AZ (Single region)
|
||
|
||
```
|
||
Region ┌────────────────────────────────────┐
|
||
│ AZ-1 AZ-2 │
|
||
│ ┌──────────┐ ┌──────────┐ │
|
||
│ │ App │ │ App │ │
|
||
│ └─────┬────┘ └─────┬────┘ │
|
||
│ │ │ │
|
||
│ ┌─────▼────────────────▼─────┐ │
|
||
│ │ Load Balancer (cross-AZ) │ │
|
||
│ └─────────────┬──────────────┘ │
|
||
│ │ │
|
||
│ ┌─────────────▼──────────────┐ │
|
||
│ │ DB Primary (AZ-1) │ │
|
||
│ │ DB Standby (AZ-2) │ │
|
||
│ │ Synchronous replication │ │
|
||
│ └────────────────────────────┘ │
|
||
└────────────────────────────────────┘
|
||
```
|
||
|
||
- RTO: minutes (automatic failover)
|
||
- RPO: 0 (sync replication)
|
||
- Protection: against AZ failure, not region failure
|
||
|
||
### Multi-Region
|
||
|
||
```
|
||
Region A (Primary) Region B (DR)
|
||
┌─────────────────────┐ ┌─────────────────────┐
|
||
│ ┌───────────────┐ │ │ ┌───────────────┐ │
|
||
│ │ App + DB │ │ │ │ App + DB │ │
|
||
│ │ Active │──┼──Async───────┼─►│ Standby │ │
|
||
│ └───────────────┘ │ replication │ └───────────────┘ │
|
||
│ │ │ │ │ │
|
||
│ ┌──────▼───────┐ │ │ ┌──────▼───────┐ │
|
||
│ │ DNS / GSLB │ │ │ │ DNS / GSLB │ │
|
||
│ └──────┬───────┘ │ │ └──────┬───────┘ │
|
||
└─────────┼──────────┘ └─────────┼──────────┘
|
||
│ │
|
||
└──────────── Traffic Manager ───────┘
|
||
```
|
||
|
||
| Variant | RTO | RPO | Cost | Failover |
|
||
|----------|-----|-----|---------|----------|
|
||
| **Active-Passive** | minutes–hours | seconds | Medium | Manual / auto |
|
||
| **Active-Active** | seconds | < 1 s | High | Automatic (DNS) |
|
||
| **Pilot Light** | tens of minutes | minutes | Low | Manual scaling |
|
||
| **Warm Standby** | minutes | seconds | High | Auto (reduced copy) |
|
||
| **Backup & Restore** | hours | 24 h | Low | Manual |
|
||
|
||
### On-prem → Cloud DR (Hybrid)
|
||
|
||
```
|
||
On-prem DC Cloud (DR)
|
||
┌─────────────────────┐ ┌─────────────────────┐
|
||
│ ┌───────────────┐ │ │ ┌───────────────┐ │
|
||
│ │ Application │ │ │ │ VM / App │ │
|
||
│ │ + DB │ │ │ │ + DB replica │ │
|
||
│ └───────┬───────┘ │ │ └───────┬───────┘ │
|
||
│ │ │ │ │ │
|
||
│ ┌───────▼───────┐ │ site-to-site│ ┌───────▼───────┐ │
|
||
│ │ Backup proxy │──┼────VPN───────┼─►│ Backup store │ │
|
||
│ └───────────────┘ │ │ └───────────────┘ │
|
||
│ │ │ │
|
||
│ ┌───────────────┐ │ │ ┌───────────────┐ │
|
||
│ │ Tape / NAS │ │ │ │ Veeam / Zerto│ │
|
||
│ └───────────────┘ │ │ └───────────────┘ │
|
||
└─────────────────────┘ └─────────────────────┘
|
||
```
|
||
|
||
- **RTO**: tens of minutes (depends on VM startup)
|
||
- **RPO**: minutes–hours (depends on replication tool)
|
||
- **Tools**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
|
||
- **Use case**: enterprise with on-prem DC that needs DR without a second DC
|
||
|
||
---
|
||
|
||
## DR testing
|
||
|
||
### Test types
|
||
|
||
| Type | Description | Frequency | Risk |
|
||
|-----|-------|-----------|--------|
|
||
| **Tabletop exercise** | Manual scenario walkthrough, no impact on production | Monthly | None |
|
||
| **Walkthrough** | Runbook verification, ensure everyone knows what to do | Quarterly | None |
|
||
| **Component test** | Test of a single component (e.g., restore one DB) | Monthly | Low |
|
||
| **Integrated test** | Test of the entire stack in isolated environment | Quarterly | Low |
|
||
| **Full failover test** | Production failover to DR site | Annually | High |
|
||
| **Chaos experiment** | Targeted fault injection into production | Continuous | Medium |
|
||
|
||
### Runbook structure
|
||
|
||
Each DR scenario should have a runbook:
|
||
|
||
```yaml
|
||
scenario: "Region A failure"
|
||
triggers:
|
||
- "CloudWatch alarm: Region A health check 5× timeout"
|
||
- "PagerDuty incident P0"
|
||
decision_tree: |
|
||
1. Verify: is Region A really unavailable? (check from 3 different locations)
|
||
2. Decide: is RTO at risk? If < 30 % RTO remaining → failover
|
||
3. Failover: run playbook `dr-failover-region-b`
|
||
4. Verification: smoke tests in Region B
|
||
5. Communication: status page + stakeholders
|
||
rollback: |
|
||
1. After Region A recovery → replicate changes from B back to A
|
||
2. Repoint DNS to A
|
||
3. Verify data consistency
|
||
4. Shut down Region B (or keep as hot standby)
|
||
contacts:
|
||
primary: "on-call@example.com"
|
||
escalation: "infra-lead@example.com"
|
||
management: "vp-engineering@example.com"
|
||
```
|
||
|
||
---
|
||
|
||
## Best practices
|
||
|
||
- **Test recovery, not backup** — a backup without tested recovery is not a backup
|
||
- **Automate DR** — Terraform / Ansible for DR environment spin-up, DNS failover
|
||
- **Document runbooks** — every scenario, contact, decision tree
|
||
- **Expect failure** — design for failure, don't expect everything to work
|
||
- **Don't underestimate WRT** — service recovery does not mean full operations (data warming, cache, connections)
|
||
- **Align RTO/RPO with business** — technical capabilities must match business requirements
|
||
- **Monitor SLI** — without data, SLO cannot be verified
|
||
- **DR is not just IT** — communication, PR, legal, compliance
|
||
|
||
---
|
||
|
||
## Related
|
||
|
||
- [CLOUD.md](CLOUD.md) — cloud DR strategy, AWS/Azure/GCP specific
|
||
- [DATACENTERS.md](DATACENTERS.md) — DC redundancy, Tier classification
|
||
- [MONITORING.md](MONITORING.md) — alerting, SLI/SLO/SLA
|
||
- [CICD.md](CICD.md) — deployment strategy, rollback
|
||
- [STORAGE.md](STORAGE.md) — backup storage, replication
|
||
|
||
## Sources
|
||
|
||
Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
|
||
|
||
*Last revised: 2026-06-11*
|