new files

2026-06-16 15:47:45 +02:00
parent 3fa11ef0f6
commit b53714113c
11 changed files with 2298 additions and 7 deletions
--- a/DATACENTERS.en.md
+++ b/DATACENTERS.en.md
@@ -658,6 +658,281 @@ flowchart TD
    CLIM -->|"Cold (SE, NO)"| FC3["Free cooling 7000+ h/year<br/>Air-side economizer<br/>PUE < 1.2"]
 ```

+## Secondary data center topologies
+
+When planning a second DC, the choice of topology is key based on distance, RPO/RTO, and budget.
+
+### Distance classification
+
+| Category | Distance | Latency (round-trip) | Use case |
+|-----------|-----------|---------------------|----------|
+| **Metro (Campus)** | 1–20 km | < 1 ms | Synchronous replication, stretched cluster |
+| **Metro** | 20–100 km | 1–5 ms | Metro cluster, mostly sync replication |
+| **Regional** | 100–500 km | 5–20 ms | Asynchronous replication, warm standby |
+| **Continent** | 500–3000 km | 20–100 ms | Asynchronous replication, cold standby |
+| **Global** | 3000+ km | > 100 ms | Async only, no real-time dependencies |
+
+### Topologies by operational mode
+
+#### Active-Active (Hot-Hot)
+
+```
+DC-A (Primary)                 DC-B (Active)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  App Active        │
+│  DB Active         │◄─sync─►│  DB Active         │
+│  Users → LB → A    │        │  Users → LB → B    │
+└────────────────────┘        └────────────────────┘
+           │                         │
+           └──── Global Load Balancer ────┘
+```
+
+| Parameter | Value |
+|----------|---------|
+| **RTO** | 0–seconds (automatic failover, traffic is redirected) |
+| **RPO** | 0 (sync replication, commit is confirmed only after write to both DCs) |
+| **Max distance** | < 100 km (latency < 5 ms RTT for sync DB replication) |
+| **Operating costs** | 2× (both DCs fully active, both fully equipped) |
+| **Advantages** | Zero downtime, instant switchover, full utilization of both DCs |
+| **Disadvantages** | Requires synchronous replication → distance limit, complex networking, split-brain risk |
+
+**Split-brain solutions**: STONITH (Shoot The Other Node In The Head), watchdog, quorum (3rd node in 3rd location / cloud), fencing, SCSI-3 persistent reservation.
+
+**Use case**: Financial services, telco, payment gateways — where even a minute of downtime = millions.
+
+#### Active-Passive (Hot-Warm, MetroCluster)
+
+```
+DC-A (Primary)                 DC-B (Standby)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  App Standby       │
+│  DB Primary        │──sync──►│  DB Standby        │
+│  Users → LB → A    │        │  ~~~ (waiting) ~~~ │
+│  DNS: A-record     │        │  DNS: health check │
+└────────────────────┘        └────────────────────┘
+```
+
+| Parameter | Value |
+|----------|---------|
+| **RTO** | tens of seconds–minutes (DNS failover + App startup) |
+| **RPO** | 0 (sync) or seconds (async) |
+| **Max distance** | sync < 100 km, async unlimited |
+| **Operating costs** | 1.5–1.8× (second DC has reduced or idle compute) |
+| **MetroCluster** | Specific implementation: FC SAN over DWDM, sync mirror, automatic failover |
+
+**MetroCluster** (NetApp, Dell EMC, HPE):
+- Storage-based cluster with synchronous mirroring between DCs
+- Automatic failover on entire DC failure
+- Requires dedicated DWDM or dark fiber interconnection
+- Typical distance: up to 50 km (for latency < 1 ms RTT)
+- Use case: enterprise storage, primary+secondary DC in metropolitan area
+
+#### Hot-Cold (Warm Standby → Cold)
+
+```
+DC-A (Primary)                 DC-B (Cold Standby)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  ~~~ powered off ~~~│
+│  DB Active         │──async─►│  Backup storage    │
+│  Users → A         │        │  ~~~ no compute ~~~│
+└────────────────────┘        └────────────────────┘
+```
+
+| Parameter | Value |
+|----------|---------|
+| **RTO** | hours–days (purchase/rent HW, restore from backup) |
+| **RPO** | hours (last backup) |
+| **Max distance** | unlimited |
+| **Operating costs** | 1.1–1.3× (only storage and facility, compute only at failover) |
+| **Typical use case** | Low-cost DR, compliance, last resort |
+
+#### Pilot Light
+
+```
+DC-A (Primary)                 DC-B (Pilot Light)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  ~~~ off ~~~       │
+│  DB Active         │──async─►│  DB replica (mini) │
+│  All services      │        │  Core services only│
+│                    │        │  (DNS, LDAP, mon)  │
+└────────────────────┘        └────────────────────┘
+                              On DR: spin-up compute
+                              from IaC, rest from backup
+```
+
+- DC-B runs with minimum compute (only core services and DB replica)
+- Application layer is spun up from IaC (Terraform, Ansible) only during DR
+- Compromise between cost and RTO
+
+### Comparison table
+
+| Topology | RTO | RPO | Cost (× primary) | Max distance | Failover |
+|-----------|-----|-----|-------------------|-------------|----------|
+| **Active-Active** | 0–s | 0 | 2.0× | < 100 km | Auto (traffic) |
+| **MetroCluster** | s–min | 0 | 1.8–2.0× | < 50 km | Auto (storage) |
+| **Active-Passive (sync)** | min | 0 | 1.5–1.8× | < 100 km | Semi-auto |
+| **Active-Passive (async)** | min–h | s–min | 1.3–1.5× | unlimited | Semi-auto |
+| **Pilot Light** | h | min–h | 1.2–1.4× | unlimited | Manual |
+| **Warm Standby** | min–h | s–min | 1.5–1.8× | unlimited | Semi-auto |
+| **Cold Standby** | days | h | 1.1–1.3× | unlimited | Manual |
+
+### Stretched Cluster
+
+```
+┌──── Site A (50 km) ────┐    ┌──── Site B ──────────┐
+│  ┌──────────────────┐   │    │  ┌──────────────────┐ │
+│  │  ESXi / Hyper-V  │   │    │  │  ESXi / Hyper-V  │ │
+│  │  VM               │   │    │  │  VM (complement) │ │
+│  └────────┬─────────┘   │    │  └────────┬─────────┘ │
+│           │             │    │           │            │
+│  ┌────────▼─────────┐  │    │  ┌────────▼─────────┐  │
+│  │  Storage (SAN)   │──┼────┼──│  Storage (SAN)   │  │
+│  │  MetroCluster    │  │    │  │  MetroCluster    │  │
+│  └──────────────────┘  │    │  └──────────────────┘  │
+└────────────────────────┘    └────────────────────────┘
+                │
+          ┌─────▼──────┐
+          │  vCenter / │
+          │  Cluster   │
+          │  (single)  │
+          └────────────┘
+```
+
+- One cluster stretched across two sites (single management domain)
+- VMs can live-migrate between sites (vMotion over distance)
+- Storage synchronously mirrored (MetroCluster, VPLEX, vSAN延伸)
+- **Requirements**: dark fiber / DWDM, low latency (< 5 ms), high link reliability
+- **Risks**: split-brain, brain drain (split-site cluster), network dependency
+- **Use case**: enterprise with own dark fiber between two DCs in a metropolitan area
+
+### Decision tree
+
+```mermaid
+flowchart TD
+    Start(["Secondary DC"]) --> RPO{"Required RPO?"}
+    RPO -->|"0 (no data loss)"| SYNC{"Sync replication possible?"}
+    SYNC -->|"Yes, < 100 km"| ACT{"Want zero downtime?"}
+    ACT -->|"Yes"| AA["Active-Active<br/>RTO=0, RPO=0, 2× cost"]
+    ACT -->|"No"| AP["Active-Passive<br/>RTO=min, RPO=0, 1.5×"]
+    SYNC -->|"No, > 100 km"| ASYNC["Active-Passive (async)<br/>RTO=min, RPO=s, 1.3×"]
+
+    RPO -->|"minutes–hours"| WARM{"Want fast failover?"}
+    WARM -->|"Yes"| PILOT["Pilot Light<br/>RTO=h, RPO=min, 1.2×"]
+    WARM -->|"No"| COLD["Cold Standby<br/>RTO=days, RPO=h, 1.1×"]
+
+    Start --> DIST{"Distance between DCs"}
+    DIST -->|"< 50 km, own fiber"| MC["MetroCluster / Stretched Cluster<br/>Single management, sync storage"]
+    DIST -->|"50–300 km"| REG["Regional DR<br/>Active-Passive, async replication"]
+    DIST -->|"> 300 km"| GLOBAL["Global DR<br/>Cold standby, backup & restore"]
+```
+
+### Physical infrastructure for DC interconnection
+
+| Technology | Bandwidth | Max distance | Latency | Use case |
+|------------|-----------|-------------|---------|----------|
+| **Dark fiber** | 100 GbE–800 GbE | 10–80 km (single-mode) | < 0.1 ms | MetroCluster, stretched cluster |
+| **DWDM** | 400 GbE–1.6 TbE (per lambda) | 80–120 km (without amplifier) | < 0.5 ms | Metro, metro cluster |
+| **CWDM** | 10–25 GbE (per channel) | 10–40 km | < 0.3 ms | Campus, smaller metro |
+| **MPLS L2VPN** | 10–100 GbE | unlimited | 1–10 ms | Regional DR, async replication |
+| **Internet IPsec** | 1–10 GbE | unlimited | 5–50 ms | Cold standby, backup |
+
+### Impact of individual technologies on DC topology selection
+
+Choosing a secondary DC topology is not purely an infrastructure decision — each layer (DB, hypervisor, orchestration, messaging) brings its own constraints.
+
+#### Databases
+
+| DB technology | Sync replication | Max distance | Auto-failover | Split-brain handling | Note |
+|---------------|---------------|-------------|---------------|-------------------|----------|
+| **PostgreSQL** | Synchronous commit (synchronous_standby_names) | < 100 km (latency < 10 ms) | Patroni / repmgr + etcd | Quorum (etcd, 3+ node) | Streaming replication, needs wal_keep_segments |
+| **MySQL** | Group Replication (multi-primary, single-primary) | < 100 km | MySQL InnoDB Cluster + MySQL Router | Paxos (Group Replication, 3+ node) | Semi-sync as compromise |
+| **Oracle** | Data Guard (SYNC/FASTSYNC/ASYNC), RAC extended | sync < 100 km, async unlimited | Data Guard Broker / FSFO (Fast Start Failover) | Observer (3rd node) | Far Sync for remote DCs |
+| **MSSQL** | AlwaysOn Availability Groups (SYNCHRONOUS_COMMIT) | < 100 km | AlwaysOn + Cluster quorum | File share majority / cloud witness | Multi-site cluster support |
+| **MongoDB** | Majority write concern + journaling | < 100 km | Replica set auto-election | Arbitration node (voting member) | Priority-based failover |
+| **Cassandra** | N/A (multi-master, eventual consistency) | unlimited | Yes (peer-to-peer) | None (multi-master, gossip protocol) | Snitch-aware topology, NetworkTopologyStrategy |
+| **Redis** | Redis Sentinel / Redis Cluster (async) | unlimited (async) | Sentinel / Cluster failover | Quorum (Sentinel, majority) | PSYNC replication, replication lag |
+
+Key limitation for **sync replication**: latency < 5 ms RTT (commit must wait for confirmation from both DCs). At 100 km RTT ~1 ms — OK. At 1000 km (~10 ms RTT) sync replication reduces transaction throughput by 80+ %.
+
+Suitable for **Active-Active**:
+- **Cassandra / ScyllaDB** — native multi-DC, eventual consistency, no split-brain
+- **MySQL Group Replication (multi-primary)** — 3+ DC for quorum
+- **CockroachDB / TiDB** — native multi-region, ACID across DCs
+- **Redis Enterprise** — Active-Active (CRDT-based)
+
+Suitable for **Active-Passive**:
+- **PostgreSQL + Patroni** — auto-failover, etcd quorum
+- **Oracle Data Guard** — FSFO, far sync for remote DCs
+- **MSSQL AlwaysOn** — cloud witness
+- **MongoDB Replica Set** — arbitration node in 3rd location
+
+#### Hypervisors
+
+| Hypervisor | Cluster technology | Stretched cluster | Max distance | Split-brain |
+|-----------|-------------------|-------------------|-------------|-------------|
+| **VMware vSphere** | vSAN延伸, Metro vCenter, Site Recovery Manager | Yes (vSAN延伸, Metro Cluster) | < 50 km (vSAN延伸), < 10 ms RTT | Fencing (STONITH), witness host |
+| **Hyper-V** | Storage Replica + Failover Cluster | Yes (Cluster Sets) | < 50 km (sync), unlimited (async) | File share witness / cloud witness |
+| **Proxmox VE** | Proxmox HA + Ceph | Limited (Ceph stretch cluster) | < 50 km (Ceph sync) | Ceph monitor quorum (3+ DC) |
+| **XCP-ng / XenServer** | Xen Orchestra HA + SR (Storage Repository) replication | Limited | depends on storage replication | — |
+| **Nutanix AHV** | Metro Availability (sync), Async DR | Yes (Metro) | < 100 km (sync), unlimited (async) | Witness VM (cloud / 3rd site) |
+| **KVM / oVirt** | oVirt HA + GlusterFS / NFS | Limited | depends on storage replication | — |
+
+**vSAN延伸 specific requirements:**
+- Dedicated vSAN network (25 GbE min., < 5 ms RTT)
+- Witness host in 3rd location (or cloud witness)
+- All VM policies (FTT=1, mirroring striped)
+- Storage policy: `site-A + site-B + witness`
+
+#### Kubernetes and container platforms
+
+| Platform | Multi-cluster DR | Replication | Max distance | Failover |
+|-----------|-----------------|-----------|-------------|----------|
+| **Vanilla K8s** | KubeFed, Cluster API, Velero + Restic | Velero (backup/restore), Rook (Ceph) | unlimited | Manual (Velero restore) |
+| **OpenShift** | ACM (Advanced Cluster Management), Velero | OADP (OpenShift API for Data Protection) | unlimited | ACM failover (subscription) |
+| **Rancher** | Rancher Multi-Cluster App, Velero | Longhorn (sync/async DR), Velero | unlimited | Semi-auto |
+| **Google GKE** | Multi-cluster Services, Backup for GKE | Config Sync, Backup for GKE | unlimited | Manual |
+| **Azure AKS** | Azure ARC + Velero + Azure Traffic Manager | AKS backup (velero), Azure Site Recovery | unlimited | Manual (Velero) |
+| **AWS EKS** | EKS multi-cluster, Velero + S3 cross-region | Velero (S3), Rook (EBS snapshots) | unlimited | Manual |
+
+**Key K8s DR principles:**
+- **Applications must be stateless** (or state externalized to DB/storage)
+- **Velero** — backup/restore entire cluster (PV, resources, helm releases)
+- **Rook/Ceph** — cross-region mirroring RBD volumes
+- **KubeFed / ACM** — subscription-based deploy to multiple clusters
+- **Ingress/Gateway API** — traffic routing between clusters
+- **External DNS** — DNS failover on cluster outage
+
+#### Messaging / streaming
+
+| Platform | Replication | Topology | DR support | Max distance |
+|-----------|-----------|-----------|------------|-------------|
+| **Apache Kafka** | MirrorMaker 2, Confluent Cluster Linking, KRaft quorum | Active-Passive (MM2), Active-Active (Cluster Linking) | MM2: async, Cluster Linking: async | unlimited |
+| **RabbitMQ** | Classic Queue Mirroring, Quorum Queues | Active-Passive (Warm Standby) | Federation / Shovel (async) | unlimited |
+| **Red Hat AMQ** | (Artemis) Cluster + HA | Active-Passive (shared store / replication) | Live-backup pair | < 100 km (sync) |
+| **NATS** | NATS JetStream (cluster + cross-account) | Active-Active (Leaf nodes, cross-account) | Super-cluster, failover | unlimited |
+| **Apache Pulsar** | BookKeeper (bookie rack-aware), geo-replication | Active-Active (geo-replication) | Built-in (cluster-level) | unlimited (async) |
+| **AWS SQS/SNS** | Managed, AWS region pairs | Active-Active (multi-region) | Built-in (AWS managed) | unlimited |
+| **Azure Service Bus** | Managed, paired region | Active-Passive (paired region) | Built-in (geo-recovery) | unlimited |
+| **Oracle Service Bus (OSB)** | Oracle WebLogic Cluster + JDBC store + AQ | Active-Passive (WebLogic Cluster + Data Guard) | OSB/WLS cluster + Oracle RAC/Data Guard sync | < 100 km (Data Guard sync), unlimited (async) |
+
+**Messaging DR recommendations:**
+- **Kafka**: use Cluster Linking for Active-Active, or MirrorMaker 2 for Active-Passive; replicate only critical topics
+- **RabbitMQ**: Quorum Queues + Federation upstream for DR; avoid Classic Queue Mirroring (deprecated)
+- **Pulsar**: native geo-replication, bookie rack-aware for stretched cluster; easiest DR among messaging platforms
+- **OSB**: WebLogic cluster + Oracle RAC/Data Guard; DR depends on DB layer, not on OSB itself
+
+### Per-layer limitations summary table
+
+| Layer | Limiting factor for secondary DC | Max distance for sync | Impact on topology selection |
+|--------|-----------------------------------|----------------------|--------------------------|
+| **Storage** | Sync mirror latency, DWDM cost | < 50 km (MetroCluster) | Stretched cluster only in metro |
+| **Databases** | Commit wait for sync replication | < 100 km (5 ms RTT) | Active-Active only with multi-master DB |
+| **Hypervisor** | Stretched cluster quorum + fencing | < 50 km (vSAN, 5 ms) | MetroCluster / stretched cluster |
+| **Kubernetes** | Velero restore time, Rook mirror latency | unlimited (async) | Active-Passive, cold standby |
+| **Messaging** | Replication lag, offset management | unlimited (async) | Active-Active (Kafka, Pulsar, NATS) or Active-Passive |
+| **Network** | Dark fiber/DWDM cost, latency | < 100 km (metro fiber) | Limits sync replication options |
+| **Application** | Stateful/stateless, connection draining | depends on architecture | Stateless app → any topology |
+
 ## Disk monitoring — S.M.A.R.T.

 Self-Monitoring, Analysis and Reporting Technology — predictive monitoring of HDD/SSD.
--- a/DATACENTERS.md
+++ b/DATACENTERS.md
@@ -658,6 +658,281 @@ flowchart TD
    CLIM -->|"Chladná (SE, NO)"| FC3["Free cooling 7000+ h/rok<br/>Air-side economizer<br/>PUE < 1.2"]
 ```

+## Topologie sekundárního datového centra
+
+Při plánování druhého DC je klíčová volba topologie podle vzdálenosti, RPO/RTO a rozpočtu.
+
+### Klasifikace vzdáleností
+
+| Kategorie | Vzdálenost | Latence (round-trip) | Use case |
+|-----------|-----------|---------------------|----------|
+| **Metro (Campus)** | 1–20 km | < 1 ms | Synchronní replikace, stretched cluster |
+| **Metro** | 20–100 km | 1–5 ms | Metro cluster, většinou sync replikace |
+| **Regional** | 100–500 km | 5–20 ms | Asynchronní replikace, warm standby |
+| **Continent** | 500–3000 km | 20–100 ms | Asynchronní replikace, cold standby |
+| **Global** | 3000+ km | > 100 ms | Pouze async, žádné real-time závislosti |
+
+### Topologie podle provozního režimu
+
+#### Active-Active (Hot-Hot)
+
+```
+DC-A (Primary)                 DC-B (Active)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  App Active        │
+│  DB Active         │◄─sync─►│  DB Active         │
+│  Users → LB → A    │        │  Users → LB → B    │
+└────────────────────┘        └────────────────────┘
+           │                         │
+           └──── Global Load Balancer ────┘
+```
+
+| Parametr | Hodnota |
+|----------|---------|
+| **RTO** | 0–vteřiny (automatický failover, traffic se přesměruje) |
+| **RPO** | 0 (sync replikace, commit je potvrzen až po zápisu do obou DC) |
+| **Max distance** | < 100 km (latence < 5 ms RTT pro sync DB replikaci) |
+| **Provozní náklady** | 2× (obě DC plně aktivní, obě plně vybavené) |
+| **Výhody** | Nulový výpadek, okamžité přepnutí, plné využití obou DC |
+| **Nevýhody** | Nutná synchronní replikace → limit vzdálenosti, komplexní networking, split-brain risk |
+
+**Split-brain řešení**: STONITH (Shoot The Other Node In The Head), watchdog, quorum (3. node v 3. lokaci / cloud), fencing, SCSI-3 persistent reservation.
+
+**Use case**: Finanční služby, telco, platební brány — kde i minuta výpadku = miliony.
+
+#### Active-Passive (Hot-Warm, MetroCluster)
+
+```
+DC-A (Primary)                 DC-B (Standby)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  App Standby       │
+│  DB Primary        │──sync──►│  DB Standby        │
+│  Users → LB → A    │        │  ~~~ (čeká) ~~~    │
+│  DNS: A-record     │        │  DNS: health check │
+└────────────────────┘        └────────────────────┘
+```
+
+| Parametr | Hodnota |
+|----------|---------|
+| **RTO** | desítky vteřin–minuty (DNS failover + startup App) |
+| **RPO** | 0 (sync) nebo sekundy (async) |
+| **Max distance** | sync < 100 km, async neomezeně |
+| **Provozní náklady** | 1,5–1,8× (druhé DC má zmenšený nebo idle compute) |
+| **MetroCluster** | Specifická implementace: FC SAN přes DWDM, sync mirror, automatický failover |
+
+**MetroCluster** (NetApp, Dell EMC, HPE):
+- Storage-based cluster se synchronním mirroringem mezi DC
+- Automatic failover při selhání celého DC
+- Vyžaduje dedikované DWDM nebo dark fiber propojení
+- Typická vzdálenost: do 50 km (pro latenci < 1 ms RTT)
+- Use case: enterprise storage, primary+secondary DC v metropolitní oblasti
+
+#### Hot-Cold (Warm Standby → Cold)
+
+```
+DC-A (Primary)                 DC-B (Cold Standby)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  ~~~ powered off ~~~│
+│  DB Active         │──async─►│  Backup storage    │
+│  Users → A         │        │  ~~~ no compute ~~~│
+└────────────────────┘        └────────────────────┘
+```
+
+| Parametr | Hodnota |
+|----------|---------|
+| **RTO** | hodiny–dny (nákup/najmutí HW, obnova z backupu) |
+| **RPO** | hodiny (poslední backup) |
+| **Max distance** | neomezena |
+| **Provozní náklady** | 1,1–1,3× (jen storage a facility, compute až při failoveru) |
+| **Typ use case** | Low-cost DR, compliance, poslední záchrana |
+
+#### Pilot Light
+
+```
+DC-A (Primary)                 DC-B (Pilot Light)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  ~~~ off ~~~       │
+│  DB Active         │──async─►│  DB replica (mini) │
+│  Všechny služby    │        │  Core services jen │
+│                    │        │  (DNS, LDAP, mon)  │
+└────────────────────┘        └────────────────────┘
+                              Při DR: spin-up compute
+                              z IaC, zbytek z backupu
+```
+
+- DC-B běží s minimem compute (jen core služby a DB replica)
+- Aplikační vrstva se spin-up z IaC (Terraform, Ansible) až při DR
+- Kompromis mezi náklady a RTO
+
+### Srovnávací tabulka
+
+| Topologie | RTO | RPO | Náklady (× primár) | Max distance | Failover |
+|-----------|-----|-----|-------------------|-------------|----------|
+| **Active-Active** | 0–s | 0 | 2,0× | < 100 km | Auto (traffic) |
+| **MetroCluster** | s–min | 0 | 1,8–2,0× | < 50 km | Auto (storage) |
+| **Active-Passive (sync)** | min | 0 | 1,5–1,8× | < 100 km | Polo-auto |
+| **Active-Passive (async)** | min–h | s–min | 1,3–1,5× | neomezena | Polo-auto |
+| **Pilot Light** | h | min–h | 1,2–1,4× | neomezena | Manuální |
+| **Warm Standby** | min–h | s–min | 1,5–1,8× | neomezena | Polo-auto |
+| **Cold Standby** | dny | h | 1,1–1,3× | neomezena | Manuální |
+
+### Stretched Cluster
+
+```
+┌──── Site A (50 km) ────┐    ┌──── Site B ──────────┐
+│  ┌──────────────────┐   │    │  ┌──────────────────┐ │
+│  │  ESXi / Hyper-V  │   │    │  │  ESXi / Hyper-V  │ │
+│  │  VM               │   │    │  │  VM (komplement) │ │
+│  └────────┬─────────┘   │    │  └────────┬─────────┘ │
+│           │             │    │           │            │
+│  ┌────────▼─────────┐  │    │  ┌────────▼─────────┐  │
+│  │  Storage (SAN)   │──┼────┼──│  Storage (SAN)   │  │
+│  │  MetroCluster    │  │    │  │  MetroCluster    │  │
+│  └──────────────────┘  │    │  └──────────────────┘  │
+└────────────────────────┘    └────────────────────────┘
+                │
+          ┌─────▼──────┐
+          │  vCenter / │
+          │  Cluster   │
+          │  (single)  │
+          └────────────┘
+```
+
+- Jeden cluster roztažený přes dvě lokality (single management domain)
+- VM mohou live-migrovat mezi site (vMotion nad vzdálenost)
+- Storage synchronně mirrorovaná (MetroCluster, VPLEX, vSAN延伸)
+- **Požadavky**: dark fiber / DWDM, nízká latence (< 5 ms), vysoká spolehlivost linky
+- **Riziko**: split-brain, brain drain (split-site cluster), závislost na síti
+- **Use case**: enterprise s vlastní dark fiber mezi dvěma DC v metropolitní oblasti
+
+### Rozhodovací strom
+
+```mermaid
+flowchart TD
+    Start(["Sekundární DC"]) --> RPO{"Požadované RPO?"}
+    RPO -->|"0 (žádná ztráta dat)"| SYNC{"Sync replikace možná?"}
+    SYNC -->|"Ano, < 100 km"| ACT{"Chceš nulový výpadek?"}
+    ACT -->|"Ano"| AA["Active-Active<br/>RTO=0, RPO=0, 2× náklady"]
+    ACT -->|"Ne"| AP["Active-Passive<br/>RTO=min, RPO=0, 1,5×"]
+    SYNC -->|"Ne, > 100 km"| ASYNC["Active-Passive (async)<br/>RTO=min, RPO=s, 1,3×"]
+
+    RPO -->|"minuty–hodiny"| WARM{"Chceš rychlý failover?"}
+    WARM -->|"Ano"| PILOT["Pilot Light<br/>RTO=h, RPO=min, 1,2×"]
+    WARM -->|"Ne"| COLD["Cold Standby<br/>RTO=dny, RPO=h, 1,1×"]
+
+    Start --> DIST{"Vzdálenost mezi DC"}
+    DIST -->|"< 50 km, vlastní fiber"| MC["MetroCluster / Stretched Cluster<br/>Single management, sync storage"]
+    DIST -->|"50–300 km"| REG["Regionální DR<br/>Active-Passive, async replikace"]
+    DIST -->|"> 300 km"| GLOBAL["Globální DR<br/>Cold standby, backup & restore"]
+```
+
+### Fyzická infrastruktura pro propojení DC
+
+| Technologie | Bandwidth | Max distance | Latence | Use case |
+|------------|-----------|-------------|---------|----------|
+| **Dark fiber** | 100 GbE–800 GbE | 10–80 km (single-mode) | < 0,1 ms | MetroCluster, stretched cluster |
+| **DWDM** | 400 GbE–1,6 TbE (per lambda) | 80–120 km (bez zesilovače) | < 0,5 ms | Metro, metro cluster |
+| **CWDM** | 10–25 GbE (per channel) | 10–40 km | < 0,3 ms | Campus, menší metro |
+| **MPLS L2VPN** | 10–100 GbE | neomezena | 1–10 ms | Regional DR, async replikace |
+| **Internet IPsec** | 1–10 GbE | neomezena | 5–50 ms | Cold standby, backup |
+
+### Vliv jednotlivých technologií na výběr DC topologie
+
+Volba topologie sekundárního DC není čistě infrastrukturní rozhodnutí — každá vrstva (DB, hypervisor, orchestrace, messaging) přináší vlastní omezení.
+
+#### Databáze
+
+| DB technologie | Sync replikace | Max distance | Auto-failover | Split-brain řešení | Poznámka |
+|---------------|---------------|-------------|---------------|-------------------|----------|
+| **PostgreSQL** | Synchronous commit (synchronous_standby_names) | < 100 km (latence < 10 ms) | Patroni / repmgr + etcd | Quorum (etcd, 3+ node) | Streaming replication, nutné wal_keep_segments |
+| **MySQL** | Group Replication (multi-primary, single-primary) | < 100 km | MySQL InnoDB Cluster + MySQL Router | Paxos (Group Replication, 3+ node) | Semi-sync jako kompromis |
+| **Oracle** | Data Guard (SYNC/FASTSYNC/ASYNC), RAC extended | sync < 100 km, async neomezena | Data Guard Broker / FSFO (Fast Start Failover) | Observer (3. node) | Far Sync pro vzdálená DC |
+| **MSSQL** | AlwaysOn Availability Groups (SYNCHRONOUS_COMMIT) | < 100 km | AlwaysOn + Cluster quorum | File share majority / cloud witness | Multi-site cluster podpora |
+| **MongoDB** | Majority write concern + journaling | < 100 km | Replica set auto-election | Arbitration node (voting member) | Priority-based failover |
+| **Cassandra** | N/A (multi-master, eventual consistency) | neomezena | Ano (peer-to-peer) | Žádné (multi-master, gossip protokol) | Snitch-aware topologie, NetworkTopologyStrategy |
+| **Redis** | Redis Sentinel / Redis Cluster (async) | neomezena (async) | Sentinel / Cluster failover | Quorum (Sentinel, majority) | PSYNC replikace, replication lag |
+
+Klíčové omezení pro **sync replikaci**: latence < 5 ms RTT (commit musí počkat na potvrzení z obou DC). Při 100 km je RTT ~1 ms – v pořádku. Při 1000 km (~10 ms RTT) sync replikace snižuje výkon transakcí o 80+ %.
+
+Pro **Active-Active** jsou vhodné:
+- **Cassandra / ScyllaDB** — nativní multi-DC, eventual consistency, žádný split-brain
+- **MySQL Group Replication (multi-primary)** — 3+ DC pro kvorum
+- **CockroachDB / TiDB** — nativní multi-region, ACID napříč DC
+- **Redis Enterprise** — Active-Active (CRDT-based)
+
+Pro **Active-Passive** jsou vhodné:
+- **PostgreSQL + Patroni** — auto-failover, etcd kvorum
+- **Oracle Data Guard** — FSFO, far sync pro vzdálené DC
+- **MSSQL AlwaysOn** — cloud witness
+- **MongoDB Replica Set** — arbitration node v 3. lokaci
+
+#### Hypervisory
+
+| Hypervisor | Cluster technologie | Stretched cluster | Max distance | Split-brain |
+|-----------|-------------------|-------------------|-------------|-------------|
+| **VMware vSphere** | vSAN延伸, Metro vCenter, Site Recovery Manager | Ano (vSAN延伸, Metro Cluster) | < 50 km (vSAN延伸), < 10 ms RTT | Fencing (STONITH), witness host |
+| **Hyper-V** | Storage Replica + Failover Cluster | Ano (Cluster Sets) | < 50 km (sync), neomezena (async) | File share witness / cloud witness |
+| **Proxmox VE** | Proxmox HA + Ceph | Omezeně (Ceph stretch cluster) | < 50 km (Ceph sync) | Ceph monitor quorum (3+ DC) |
+| **XCP-ng / XenServer** | Xen Orchestra HA + SR (Storage Repository) replication | Omezeně | závisí na storage replikaci | — |
+| **Nutanix AHV** | Metro Availability (sync), Async DR | Ano (Metro) | < 100 km (sync), neomezena (async) | Witness VM (cloud / 3. site) |
+| **KVM / oVirt** | oVirt HA + GlusterFS / NFS | Omezeně | závisí na storage replikaci | — |
+
+**vSAN延伸** specifické požadavky:
+- Dedikovaná síť pro vSAN (25 GbE min., < 5 ms RTT)
+- Witness host v 3. lokaci (nebo cloud witness)
+- Všechny VM protokoly (FTT=1, mirroring striped)
+- Storage policy: `site-A + site-B + witness`
+
+#### Kubernetes a kontejnerové platformy
+
+| Platforma | Multi-cluster DR | Replikace | Max distance | Failover |
+|-----------|-----------------|-----------|-------------|----------|
+| **Vanilla K8s** | KubeFed, Cluster API, Velero + Restic | Velero (backup/restore), Rook (Ceph) | neomezena | Manuální (Velero restore) |
+| **OpenShift** | ACM (Advanced Cluster Management), Velero | OADP (OpenShift API for Data Protection) | neomezena | ACM failover (subscription) |
+| **Rancher** | Rancher Multi-Cluster App, Velero | Longhorn (sync/async DR), Velero | neomezena | Polo-auto |
+| **Google GKE** | Multi-cluster Services, Backup for GKE | Config Sync, Backup for GKE | neomezena | Manuální |
+| **Azure AKS** | Azure ARC + Velero + Azure Traffic Manager | AKS backup (velero), Azure Site Recovery | neomezena | Manuální (Velero) |
+| **AWS EKS** | EKS multi-cluster, Velero + S3 cross-region | Velero (S3), Rook (EBS snapshots) | neomezena | Manuální |
+
+**Klíčové principy K8s DR:**
+- **Aplikace musí být stateless** (nebo state externalizovaný do DB/storage)
+- **Velero** — backup/restore celého clusteru (PV, resources, helm releases)
+- **Rook/Ceph** — cross-region mirroring RBD volumes
+- **KubeFed / ACM** — subscription-based deploy do více clusterů
+- **Ingress/Gateway API** — traffic routing mezi clustery
+- **External DNS** — DNS failover při výpadku clusteru
+
+#### Messaging / streaming
+
+| Platforma | Replikace | Topologie | DR podpora | Max distance |
+|-----------|-----------|-----------|------------|-------------|
+| **Apache Kafka** | MirrorMaker 2, Confluent Cluster Linking, KRaft quorum | Active-Passive (MM2), Active-Active (Cluster Linking) | MM2: async, Cluster Linking: async | neomezena |
+| **RabbitMQ** | Classic Queue Mirroring, Quorum Queues | Active-Passive (Warm Standby) | Federation / Shovel (async) | neomezena |
+| **Red Hat AMQ** | (Artemis) Cluster + HA | Active-Passive (shared store / replication) | Live-backup pair | < 100 km (sync) |
+| **NATS** | NATS JetStream (cluster + cross-account) | Active-Active (Leaf nodes, cross-account) | Super-cluster, failover | neomezena |
+| **Apache Pulsar** | BookKeeper (bookie rack-aware), geo-replication | Active-Active (geo-replication) | Built-in (cluster-level) | neomezena (async) |
+| **AWS SQS/SNS** | Managed, AWS region pairs | Active-Active (multi-region) | Built-in (AWS managed) | neomezena |
+| **Azure Service Bus** | Managed, paired region | Active-Passive (paired region) | Built-in (geo-recovery) | neomezena |
+| **Oracle Service Bus (OSB)** | Oracle WebLogic Cluster + JDBC store + AQ | Active-Passive (WebLogic Cluster + Data Guard) | OSB/WLS cluster + Oracle RAC/Data Guard sync | < 100 km (Data Guard sync), neomezena (async) |
+
+**Doporučení pro DR messagingu:**
+- **Kafka**: použít Cluster Linking pro Active-Active, nebo MirrorMaker 2 pro Active-Passive; replikovat jen kritická témata
+- **RabbitMQ**: Quorum Queues + Federation upstream pro DR; vyhnout se Classic Queue Mirroring (deprecated)
+- **Pulsar**: nativní geo-replication, bookie rack-aware pro stretch cluster; nejjednodušší DR mezi messaging platformami
+- **OSB**: WebLogic cluster + Oracle RAC/Data Guard; DR závisí na DB vrstvě, ne na OSB samotném
+
+### Hlavní omezení per vrstva (shrnující tabulka)
+
+| Vrstva | Omezující faktor pro sekundární DC | Max distance pro sync | Dopad na výběr topologie |
+|--------|-----------------------------------|----------------------|--------------------------|
+| **Storage** | Latence sync mirroru, DWDM náklady | < 50 km (MetroCluster) | Stretched cluster jen v metru |
+| **Databáze** | Commit wait pro sync replikaci | < 100 km (5 ms RTT) | Active-Active jen s DB podporující multi-master |
+| **Hypervisor** | Stretched cluster quorum + fencing | < 50 km (vSAN, 5 ms) | MetroCluster / stretched cluster |
+| **Kubernetes** | Velero restore time, Rook mirror latency | neomezena (async) | Active-Passive, cold standby |
+| **Messaging** | Replication lag, offset management | neomezena (async) | Active-Active (Kafka, Pulsar, NATS) nebo Active-Passive |
+| **Network** | Dark fiber/DWDM náklady, latency | < 100 km (metro fiber) | Omezuje možnosti sync replikace |
+| **Aplikace** | Stateful/stateless, connection draining | závisí na architektuře | Stateless app → libovolná topologie |
+
 ## Monitoring disků — S.M.A.R.T.

 Self-Monitoring, Analysis and Reporting Technology — prediktivní monitoring HDD/SSD.
@@ -785,4 +1060,4 @@ OpenStack přináší do DC softwarovou abstrakční vrstvu, která umožňuje m
 - Akademické / HPC clustery (Ironic, Cyborg, Manila)
 - Government / regulated prostředí (on-prem, audit trail)

-*Poslední revize: 2026-06-03*
+*Poslední revize: 2026-06-12*
--- a/DC-MIGRATION.en.md
+++ b/DC-MIGRATION.en.md
@@ -0,0 +1,246 @@
+# 🏗️ Data Center Migration
+
+## Migration strategies
+
+| Strategy | RTO | RPO | Risk | Cost | Duration | Description |
+|-----------|-----|-----|--------|---------|-------------|-------|
+| **Cold / Big Bang** | hours–days | days | High | Low | days | Shut everything down, move, power up |
+| **Phased / Wave** | minutes (per wave) | minutes | Medium | Medium | weeks–months | Workloads moved in waves |
+| **Rolling** | 0 (live) | 0 | Low | High | months | Live migration per VM/service |
+| **Parallel Run** | 0 | 0 | Low | Very high | months | Both DCs operational, gradual cutover |
+| **Pilot Light** | hours | minutes | Medium | Low | weeks | Critical services in new DC, rest migrates |
+| **Lift & Shift** | hours | minutes | Medium | Low | weeks | VMs/servers moved without configuration changes |
+| **Re-platform** | hours | minutes | Low | Medium | months | Optimization during migration (OS upgrade, resize) |
+| **Re-architect** | 0 | 0 | Low | High | months–years | Application redesigned for new platform |
+
+---
+
+## Decision tree
+
+```mermaid
+flowchart TD
+    Start(["DC Migration"]) --> APP{"Application\nstateful?"}
+    APP -->|"Yes"| DOWNTIME{"Tolerates\ndowntime?"}
+    APP -->|"No"| ROLLING["Rolling / Parallel Run"]
+
+    DOWNTIME -->|"Yes, hours+"| COLD["Cold / Big Bang\nSimplest, cheapest\nRisk: all at once"]
+    DOWNTIME -->|"Yes, minutes"| PHASED["Phased / Wave\nBy application / business unit"]
+    DOWNTIME -->|"No (zero downtime)"| SYNC{"Sync replication\npossible?"}
+
+    SYNC -->|"Yes, < 100 km"| ROLLING
+    SYNC -->|"No"| PARALLEL["Parallel Run\nBoth DCs active, gradual cutover"]
+
+    ROLLING --> ROLL_HA{"VMware,\nHyper-V?"}
+    ROLL_HA -->|"Yes"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"]
+    ROLL_HA -->|"No"| ROLL_REPL["Storage + DB replication\nGradual workload migration"]
+```
+
+---
+
+## Migration phases
+
+### 1. Discovery and assessment
+
+| Task | Tools | Output |
+|------|----------|--------|
+| HW and SW inventory | RVTools, NetBox, CMDB | Server, VM, and service list |
+| Dependency mapping | ServiceNow, AppDynamics, manual | Application dependency graph |
+| Traffic analysis | NetFlow, sFlow, vRNI | Bandwidth, latency, peak usage |
+| Performance baseline | Prometheus, Zabbix, vRealize | CPU/RAM/disk/network per workload |
+| License audit | Flexera, SAM | Licenses, support, compliance |
+
+**Output:** workload list with RTO/RPO, dependencies, and criticality.
+
+### 2. Planning
+
+- **Wave plan** — workload division into migration waves (10–50 VMs per wave)
+- **Dependency ordering** — DNS, NTP, LDAP, PKI first
+- **Cutover window** — time window for switching (typically weekend)
+- **Rollback plan** — conditions and procedure for reversal
+- **Test plan** — what and how to test post-migration
+- **Communication plan** — who, when, how is informed
+
+### 3. New DC preparation
+
+- **Infrastructure** — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (see [DATACENTERS.en.md](DATACENTERS.en.md) — deployment order)
+- **Network** — BGP peering, VXLAN/VLAN, firewall rules, load balancers
+- **Storage** — SAN zoning, NAS exports, Ceph cluster
+- **Virtualization** — vCenter, Hyper-V cluster, Proxmox
+
+### 4. Replication and synchronization
+
+| Layer | Method | Tools |
+|--------|--------|----------|
+| **Storage (block)** | SAN sync/async mirror, LUN replication | NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster |
+| **Storage (file)** | DFS-R, rsync, robocopy | Windows DFS, Rsync |
+| **Storage (object)** | Cross-region replication | MinIO replication, S3 CRR |
+| **Databases** | Log shipping, CDC, streaming replication | PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication |
+| **VM** | Storage vMotion, replication | VMware vSphere Replication, Hyper-V Replica, Zerto |
+| **Kubernetes** | Velero + Restic, Rook Ceph mirror | Velero, Rook |
+
+### 5. Workload migration
+
+#### Wave migration (recommended for medium/large DCs)
+
+```mermaid
+gantt
+    title Wave migration
+    dateFormat  YYYY-MM-DD
+    section Wave 1 - Core
+    DNS, NTP, LDAP    :done, w1a, 2026-07-01, 3d
+    Monitoring + logging :done, w1b, after w1a, 2d
+    section Wave 2 - Network
+    Load balancers     :active, w2a, 2026-07-06, 2d
+    Firewalls          :active, w2b, 2026-07-08, 2d
+    section Wave 3 - Storage
+    NAS migration      :w3a, 2026-07-10, 5d
+    SAN replication    :w3b, 2026-07-10, 3d
+    section Wave 4 - Dev/Test
+    Dev VMs            :w4a, 2026-07-15, 5d
+    section Wave 5 - Prod tier 3
+    Internal apps      :w5a, 2026-07-22, 5d
+    section Wave 6 - Prod tier 2
+    Business apps      :w6a, 2026-07-29, 5d
+    section Wave 7 - Prod tier 1
+    Critical apps      :w7a, 2026-08-05, 5d
+```
+
+#### Typical single wave procedure:
+
+1. **Day -7**: Sync data replication (initial seed)
+2. **Day -1**: Incremental sync, final test
+3. **Day 0 (cutover)**:
+   - Stop application in source DC
+   - Final sync (last delta)
+   - Start application in target DC
+   - DNS/Traffic switch
+   - Smoke test
+4. **Day +1**: Monitoring (performance, errors, lag)
+5. **Day +7**: Rollback window end (success confirmation)
+
+### 6. Network strategies
+
+#### IP re-addressing
+
+| Approach | Description | Pros | Cons |
+|---------|-------|--------|----------|
+| **Keep IP** | Same IPs, BGP anycast or stretch VLAN | No application config changes | Stretched VLAN/L2 limitations |
+| **Change IP** | New IP range, DNS/BGP routing change | Clean architecture | Config changes, DNS TTL |
+| **NAT translation** | NAT between old and new IP space | No application changes | Latency, troubleshooting complexity |
+
+**Keep IP** is only possible with:
+- L2 stretch between DCs (VXLAN, OTV) — distance limited
+- BGP anycast for VIPs (load balancers)
+- Applications tolerant to ARP cache changes
+
+#### DNS cutover
+
+```
+1. Lower TTL to 60–300 s (one week ahead)
+2. At cutover, change A/AAAA records to new IPs
+3. Wait for propagation (per TTL)
+4. Monitor traffic
+```
+
+#### Traffic steering
+
+| Technique | Use case |
+|----------|----------|
+| **BGP** | Change AS path / local pref for traffic steering |
+| **DNS** | Lower TTL, change A records |
+| **Load balancer** | Change pool members, health check |
+| **GSLB** | Global Server Load Balancing (F5 GTM, NSX ALB) |
+| **Cloud DNS** | AWS Route53, Azure Traffic Manager, Google Cloud DNS |
+
+### 7. Database migration
+
+See individual DB files for details. Summary table:
+
+| DB | Method | RPO | RTO | Note |
+|----|--------|-----|-----|----------|
+| **PostgreSQL** | Streaming replication + Patroni switchover | 0 (sync) / ~MB (async) | min | Patroni auto-failover |
+| **MySQL** | Group Replication / async replication | 0 (sync) / seconds | min | InnoDB Cluster |
+| **Oracle** | Data Guard switchover | 0 (sync) | min | Far sync for remote DCs |
+| **MSSQL** | AlwaysOn AG failover | 0 (sync) | min | Cloud witness |
+| **MongoDB** | Replica set election | seconds | < 1 min | Priority-based failover |
+| **Cassandra** | Multi-DC replication | eventual | 0 | Native multi-master |
+
+### 8. Testing
+
+| Phase | What to test | Method |
+|------|-------------|--------|
+| **Pre-migration** | Application in new DC (isolated) | Dry run on replicated data |
+| **Cutover** | Functionality, availability, latency | Smoke test, synthetic transactions |
+| **Post-migration** | Performance, integration, monitoring | A/B comparison with baseline, canary traffic |
+| **Rollback** | Return to old DC | Tested rollback plan |
+
+### 9. Rollback plan
+
+Each wave must have a defined rollback:
+
+| Condition | Action |
+|----------|------|
+| Application fails to start in new DC | DNS switch back, stop replication |
+| Performance worse than baseline (> 20 %) | Rollback, root cause analysis |
+| Integration failure (API timeout, DB connection) | Rollback, dependency check |
+| Security incident | Rollback, forensic analysis |
+
+Rollback must be tested **before** the real cutover.
+
+---
+
+## Special cases
+
+### Mainframe migration
+
+- **IBM z/OS** — GDPS (Geographically Dispersed Parallel Sysplex)
+- HyperSwap for storage mirroring
+- Cross-system coupling facility (XCF)
+- Often the last migrated component
+
+### COTS applications (Oracle EBS, SAP)
+
+- Require vendor-specific migration procedures
+- Oracle EBS: Autoconfig, cloning (ADXLC)
+- SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM
+- License re-licensing on HW change
+
+### Cloud migration (On-prem → Cloud)
+
+See [CLOUD.en.md](CLOUD.en.md) — migration strategies (6 Rs):
+
+| Strategy | Description |
+|-----------|-------|
+| **Re-host (Lift & Shift)** | VM → Cloud VM (AWS MGN, Azure Migrate) |
+| **Re-platform** | OS upgrade, managed DB (RDS, Cloud SQL) |
+| **Re-architect** | Application rewritten as cloud-native |
+| **Retire** | Decommission unnecessary applications |
+| **Retain** | Application stays on-prem (review later) |
+| **Repurchase** | SaaS replacement |
+
+---
+
+## Recommended approach per DC size
+
+| DC Size | VM Count | Recommended strategy | Duration | Team |
+|-------------|----------|---------------------|-------------|-----|
+| **Small** | < 50 | Big Bang (weekend) | 2–4 days | 3–5 people |
+| **Medium** | 50–500 | Phased (5–10 waves) | 2–8 weeks | 5–10 people |
+| **Large** | 500–5000 | Phased + Rolling | 3–12 months | 10–30 people |
+| **Enterprise** | 5000+ | Parallel Run / Rolling | 12–36 months | 30+ people |
+
+---
+
+## Related
+
+- [DATACENTERS.en.md](DATACENTERS.en.md) — DC topologies, secondary DC, deployment order
+- [CLOUD.en.md](CLOUD.en.md) — cloud migration strategies (6 Rs)
+- [DR.en.md](DR.en.md) — disaster recovery, RTO/RPO
+- [NETWORKING.en.md](NETWORKING.en.md) — BGP, DNS, VXLAN, traffic steering
+- [STORAGE.en.md](STORAGE.en.md) — storage replication
+
+## Sources
+
+Links, books, and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Last revision: 2026-06-12*
--- a/DC-MIGRATION.md
+++ b/DC-MIGRATION.md
@@ -0,0 +1,246 @@
+# 🏗️ Migrace datových center
+
+## Strategie migrace
+
+| Strategie | RTO | RPO | Riziko | Náklady | Doba trvání | Popis |
+|-----------|-----|-----|--------|---------|-------------|-------|
+| **Cold / Big Bang** | hodiny–dny | dny | Vysoké | Nízké | dny | Vše najednou vypnout, přesunout, zapnout |
+| **Phased / Wave** | minuty (per wave) | minuty | Střední | Střední | týdny–měsíce | Workloady po vlnách |
+| **Rolling** | 0 (live) | 0 | Nízké | Vysoké | měsíce | Live migration per VM/služba |
+| **Parallel Run** | 0 | 0 | Nízké | Velmi vysoké | měsíce | Oba DC v provozu, postupný přechod |
+| **Pilot Light** | hodiny | minuty | Střední | Nízké | týdny | Kritické služby v novém DC, ostatní se přesouvají |
+| **Lift & Shift** | hodiny | minuty | Střední | Nízké | týdny | VM/servery přesunuty bez změny konfigurace |
+| **Re-platform** | hodiny | minuty | Nízké | Střední | měsíce | Optimalizace během migrace (OS upgrade, resize) |
+| **Re-architect** | 0 | 0 | Nízké | Vysoké | měsíce–roky | Aplikace přepracována pro novou platformu |
+
+---
+
+## Rozhodovací strom
+
+```mermaid
+flowchart TD
+    Start(["Migrace DC"]) --> APP{"Aplikace\nstateful?"}
+    APP -->|"Ano"| DOWNTIME{"Toleruje\nvýpadek?"}
+    APP -->|"Ne"| ROLLING["Rolling / Parallel Run"]
+
+    DOWNTIME -->|"Ano, hodiny+"| COLD["Cold / Big Bang\nNejjednodušší, nejlevnější\nRiziko: vše najednou"]
+    DOWNTIME -->|"Ano, minuty"| PHASED["Phased / Wave\nPo aplikacích / byznys jednotkách"]
+    DOWNTIME -->|"Ne (zero downtime)"| SYNC{"Sync replikace\nmožná?"}
+
+    SYNC -->|"Ano, < 100 km"| ROLLING
+    SYNC -->|"Ne"| PARALLEL["Parallel Run\nOba DC aktivní, postupný cutover"]
+
+    ROLLING --> ROLL_HA{"VMware,\nHyper-V?"}
+    ROLL_HA -->|"Ano"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"]
+    ROLL_HA -->|"Ne"| ROLL_REPL["Storage + DB replikace\nPostupný přesun workloadů"]
+```
+
+---
+
+## Fáze migrace
+
+### 1. Discovery a assessment
+
+| Úkol | Nástroje | Výstup |
+|------|----------|--------|
+| Inventarizace HW a SW | RVTools, NetBox, CMDB | Seznam všech serverů, VM, služeb |
+| Dependency mapping | ServiceNow, AppDynamics, manual | Aplikační dependency graf |
+| Traffic analysis | NetFlow, sFlow, vRNI | BANDWIDTH, latency, peak usage |
+| Výkonnostní baseline | Prometheus, Zabbix, vRealize | CPU/RAM/disk/network per workload |
+| Licenční audit | Flexera, SAM | Licence, support, compliance |
+
+**Výstupem je:** seznam workloadů s RTO/RPO, závislostmi a kritičností. Bez toho nelze naplánovat migraci.
+
+### 2. Plánování
+
+- **Wave plán** — rozdělení workloadů do migračních vln (10–50 VM na vlnu)
+- **Závislostní řazení** — DNS, NTP, LDAP, PKI musí být první
+- **Cutover okno** — časové okno pro přepnutí (typicky víkend)
+- **Rollback plán** — podmínky a postup pro vrácení
+- **Testovací plán** — co a jak testovat po migraci
+- **Komunikační plán** — kdo, kdy, jak je informován
+
+### 3. Příprava nového DC
+
+- **Infrastruktura** — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (viz [DATACENTERS.md](DATACENTERS.md) — deployment order)
+- **Network** — BGP peering, VXLAN/VLAN, firewall pravidla, load balancery
+- **Storage** — SAN zoning, NAS exports, Ceph cluster
+- **Virtualizace** — vCenter, Hyper-V cluster, Proxmox
+
+### 4. Replikace a synchronizace
+
+| Vrstva | Metoda | Nástroje |
+|--------|--------|----------|
+| **Storage (block)** | SAN sync/async mirror, LUN replication | NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster |
+| **Storage (file)** | DFS-R, rsync, robocopy | Windows DFS, Rsync |
+| **Storage (object)** | Cross-region replication | MinIO replication, S3 CRR |
+| **Databáze** | Log shipping, CDC, streaming replication | PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication |
+| **VM** | Storage vMotion, replication | VMware vSphere Replication, Hyper-V Replica, Zerto |
+| **Kubernetes** | Velero + Restic, Rook Ceph mirror | Velero, Rook |
+
+### 5. Migrace workloadů
+
+#### Wave migrace (doporučeno pro střední/větší DC)
+
+```mermaid
+gantt
+    title Wave migrace
+    dateFormat  YYYY-MM-DD
+    section Wave 1 - Core
+    DNS, NTP, LDAP    :done, w1a, 2026-07-01, 3d
+    Monitoring + logging :done, w1b, after w1a, 2d
+    section Wave 2 - Network
+    Load balancers     :active, w2a, 2026-07-06, 2d
+    Firewalls          :active, w2b, 2026-07-08, 2d
+    section Wave 3 - Storage
+    NAS migrace        :w3a, 2026-07-10, 5d
+    SAN replication    :w3b, 2026-07-10, 3d
+    section Wave 4 - Dev/Test
+    Dev VMs            :w4a, 2026-07-15, 5d
+    section Wave 5 - Prod tier 3
+    Internal apps      :w5a, 2026-07-22, 5d
+    section Wave 6 - Prod tier 2
+    Business apps      :w6a, 2026-07-29, 5d
+    section Wave 7 - Prod tier 1
+    Critical apps      :w7a, 2026-08-05, 5d
+```
+
+#### Typický postup jedné vlny:
+
+1. **Den -7**: Sync replikace dat (initial seed)
+2. **Den -1**: Incremental sync, final test
+3. **Den 0 (cutover)**:
+   - Zastavení aplikace ve zdrojovém DC
+   - Final sync (poslední delta)
+   - Start aplikace v cílovém DC
+   - DNS/Traffic switch
+   - Smoke test
+4. **Den +1**: Monitorování (výkon, chyby, lag)
+5. **Den +7**: Rollback window end (potvrzení úspěchu)
+
+### 6. Síťové strategie
+
+#### IP re-addressing
+
+| Přístup | Popis | Výhody | Nevýhody |
+|---------|-------|--------|----------|
+| **Keep IP** | Stejné IP, BGP anycast nebo stretch VLAN | Není třeba měnit konfiguraci aplikací | Stretched VLAN/L2 omezení |
+| **Change IP** | Nový IP rozsah, DNS/BGP routing změna | Čistá architektura | Změny konfigurací, DNS TTL |
+| **NAT překlad** | NAT mezi starým a novým IP spacem | Bez změny aplikací | Latence, komplexita troubleshooting |
+
+**Keep IP** je možný jen:
+- L2 stretch mezi DC (VXLAN, OTV) — omezeno vzdáleností
+- BGP anycast pro VIP (load balancery)
+- Aplikace tolerující ARP cache změny
+
+#### DNS cutover
+
+```
+1. Snížit TTL na 60–300 s (týden předem)
+2. Při cutoveru změnit A/AAAA záznamy na nové IP
+3. Počkat na propagaci (dle TTL)
+4. Monitorovat traffic
+```
+
+#### Traffic steering
+
+| Technika | Use case |
+|----------|----------|
+| **BGP** | Změna AS path / local pref pro přesměrování trafficu |
+| **DNS** | Snížení TTL, change A records |
+| **Load balancer** | Změna pool members, health check |
+| **GSLB** | Global Server Load Balancing (F5 GTM, NSX ALB) |
+| **Cloud DNS** | AWS Route53, Azure Traffic Manager, Google Cloud DNS |
+
+### 7. Databázová migrace
+
+Viz detail v jednotlivých DB souborech. Tabulka shrnutí:
+
+| DB | Metoda | RPO | RTO | Poznámka |
+|----|--------|-----|-----|----------|
+| **PostgreSQL** | Streaming replication + Patroni switchover | 0 (sync) / ~MB (async) | min | Patroni auto-failover |
+| **MySQL** | Group Replication / async replication | 0 (sync) / sekundy | min | InnoDB Cluster |
+| **Oracle** | Data Guard switchover | 0 (sync) | min | Far sync pro vzdálené DC |
+| **MSSQL** | AlwaysOn AG failover | 0 (sync) | min | Cloud witness |
+| **MongoDB** | Replica set election | sekundy | < 1 min | Priority-based failover |
+| **Cassandra** | Multi-DC replication | eventual | 0 | Nativní multi-master |
+
+### 8. Testování
+
+| Fáze | Co testovat | Metoda |
+|------|-------------|--------|
+| **Pre-migrace** | Aplikace v novém DC (izolovaně) | Dry run na replikovaných datech |
+| **Cutover** | Funkčnost, dostupnost, latence | Smoke test, synthetic transactions |
+| **Post-migrace** | Výkon, integrace, monitoring | A/B comparison s baseline, canary traffic |
+| **Rollback** | Návrat ke starému DC | Testovaný rollback plán |
+
+### 9. Rollback plán
+
+Každá vlna musí mít definovaný rollback:
+
+| Podmínka | Akce |
+|----------|------|
+| Aplikace nestartuje v novém DC | Přepnutí DNS zpět, zastavení replikace |
+| Výkon horší než baseline (o > 20 %) | Rollback, analýza příčiny |
+| Integrační selhání (API timeout, DB connection) | Rollback, dependency check |
+| Bezpečnostní incident | Rollback, forenzní analýza |
+
+Rollback by měl být otestován **před** reálným cutoverem.
+
+---
+
+## Speciální případy
+
+### Mainframe migrace
+
+- **IBM z/OS** — GDPS (Geographically Dispersed Parallel Sysplex)
+- HyperSwap pro storage mirroring
+- Cross-system coupling facility (XCF)
+- Často poslední migrovaná komponenta
+
+### COTS aplikace (Oracle EBS, SAP)
+
+- Vyžadují specifické migrační postupy výrobce
+- Oracle EBS: Autoconfig, cloning (ADXLC)
+- SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM
+- Licenční re-licensing při změně HW
+
+### Cloud migrace (On-prem → Cloud)
+
+Viz [CLOUD.md](CLOUD.md) — migrační strategie (6 Rs):
+
+| Strategie | Popis |
+|-----------|-------|
+| **Re-host (Lift & Shift)** | VM → Cloud VM (AWS MGN, Azure Migrate) |
+| **Re-platform** | OS upgrade, managed DB (RDS, Cloud SQL) |
+| **Re-architect** | Aplikace přepsána na cloud-native |
+| **Retire** | Zastavení nepotřebných aplikací |
+| **Retain** | Aplikace zůstává on-prem (revize později) |
+| **Repurchase** | SaaS náhrada |
+
+---
+
+## Doporučený postup per velikost DC
+
+| Velikost DC | Počet VM | Doporučená strategie | Doba trvání | Tým |
+|-------------|----------|---------------------|-------------|-----|
+| **Small** | < 50 | Big Bang (víkend) | 2–4 dny | 3–5 lidí |
+| **Medium** | 50–500 | Phased (5–10 wave) | 2–8 týdnů | 5–10 lidí |
+| **Large** | 500–5000 | Phased + Rolling | 3–12 měsíců | 10–30 lidí |
+| **Enterprise** | 5000+ | Parallel Run / Rolling | 12–36 měsíců | 30+ lidí |
+
+---
+
+## Související
+
+- [DATACENTERS.md](DATACENTERS.md) — DC topologie, sekundární DC, deployment order
+- [CLOUD.md](CLOUD.md) — cloud migrační strategie (6 Rs)
+- [DR.md](DR.md) — disaster recovery, RTO/RPO
+- [NETWORKING.md](NETWORKING.md) — BGP, DNS, VXLAN, traffic steering
+- [STORAGE.md](STORAGE.md) — storage replikace
+
+## Zdroje
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Poslední revize: 2026-06-12*
--- a/DR.en.md
+++ b/DR.en.md
@@ -0,0 +1,336 @@
+# 🔄 Disaster Recovery and Business Continuity
+
+## Terminology
+
+| Abbreviation | Meaning | Description |
+|---------|--------|-------|
+| **RTO** | Recovery Time Objective | Maximum time from outage to service recovery |
+| **RPO** | Recovery Point Objective | Maximum acceptable data loss (time since last backup) |
+| **MTD** | Maximum Tolerable Downtime | Total outage duration an organization can survive |
+| **WRT** | Work Recovery Time | Time needed for full operations recovery after IT restoration |
+| **MTBF** | Mean Time Between Failures | Mean time between failures |
+| **MTTR** | Mean Time To Repair | Mean time to repair |
+| **SLA** | Service Level Agreement | Contractual availability commitment |
+| **SLO** | Service Level Objective | Internal availability target |
+| **SLI** | Service Level Indicator | Measured availability value |
+
+### Relationship between RTO, RPO, MTD, WRT
+
+```
+Outage ──── RPO ────► Data restored ──── RTO ────► Service running ──── WRT ────► Full operations
+             │                      │                            │
+             ▼                      ▼                            ▼
+       Lost data          Time without service               Time to full capacity
+
+       MTD = RTO + WRT (max. time the business tolerates)
+```
+
+---
+
+## Uptime calculation
+
+### Nines table
+
+| Level | Uptime | Downtime / year | Downtime / month | Downtime / week |
+|--------|--------|---------------|------------------|------------------|
+| 90 % (one nine) | 0.9 | 36.5 days | 72 h | 16.8 h |
+| 99 % (two nines) | 0.99 | 3.65 days | 7.2 h | 1.68 h |
+| 99.5 % | 0.995 | 1.83 days | 3.6 h | 50.4 min |
+| 99.9 % (three nines) | 0.999 | 8.76 h | 43.2 min | 10.1 min |
+| 99.95 % | 0.9995 | 4.38 h | 21.6 min | 5.04 min |
+| 99.99 % (four nines) | 0.9999 | 52.6 min | 4.32 min | 1.01 min |
+| 99.995 % | 0.99995 | 26.3 min | 2.16 min | 30.2 s |
+| 99.999 % (five nines) | 0.99999 | 5.26 min | 25.9 s | 6.05 s |
+| 99.9999 % (six nines) | 0.999999 | 31.6 s | 2.59 s | 0.605 s |
+
+### Calculation
+
+```
+Availability = (Total time - Downtime) / Total time × 100 %
+
+Example:
+  Year = 365 × 24 × 60 = 525,600 minutes
+  Target: 99.9 % → allowed downtime = 525,600 × (1 - 0.999) = 525.6 minutes = 8.76 h
+
+Combined availability (chain of dependencies):
+  A_web = 99.9 % (3 nines)
+  A_api  = 99.99 % (4 nines)
+  A_db   = 99.999 % (5 nines)
+
+  A_total = 0.999 × 0.9999 × 0.99999 = 0.99889 ≈ 99.89 % (less than 3 nines!)
+
+Parallel availability (redundancy):
+  A_total = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)
+
+  Example: 2 servers with 99% availability
+  A_total = 1 - (1-0.99) × (1-0.99) = 1 - 0.01 × 0.01 = 0.9999 (99.99 %)
+```
+
+### Calculator
+
+```python
+def uptime_percent_to_downtime(pct, period_days=365):
+    """Convert uptime percentage to downtime in given period."""
+    total_minutes = period_days * 24 * 60
+    allowed_downtime = total_minutes * (1 - pct / 100)
+    return allowed_downtime  # minutes
+
+def downtime_to_uptime_percent(downtime_minutes, period_days=365):
+    """Convert downtime in minutes to uptime percentage."""
+    total_minutes = period_days * 24 * 60
+    return (1 - downtime_minutes / total_minutes) * 100
+
+def combined_availability(availabilities):
+    """Combined availability (series-connected components)."""
+    result = 1.0
+    for a in availabilities:
+        result *= a
+    return result
+
+def redundant_availability(availabilities):
+    """Redundant availability (parallel components)."""
+    result = 1.0
+    for a in availabilities:
+        result *= (1 - a)
+    return 1 - result
+```
+
+### Calculation fallacies
+
+- **Combined availability is not a sum** — adding another dependency always reduces total availability
+- **Redundancy is not free** — adding a standby component requires failure detection + failover (MTTR does not improve automatically)
+- **SLA is not a guarantee** — providers often calculate SLA as a monthly average, not per-incident
+- **Measurement is key** — without SLI, SLO cannot be verified; "unmeasured availability does not exist"
+- **Planned maintenance** — sometimes counted as uptime, sometimes not (depends on SLA definition)
+
+---
+
+## DR scenarios
+
+### Classification
+
+| Category | Scenario | Typical RTO | Typical RPO | Frequency |
+|-----------|--------|-------------|-------------|-----------|
+| **Site** | Entire DC / region outage | hours | minutes | Low |
+| **Infrastructure** | HW failure (storage, switch, server) | minutes–hours | seconds | Medium |
+| **Software** | OS, application, DB failure | minutes | seconds | High |
+| **Data** | Data corruption, deletion, cryptolocker | hours | backup point | Low–medium |
+| **Human** | Wrong deployment, config change | minutes–hours | seconds | Medium |
+| **Security** | Attack, breach, ransomware | days | before attack | Low |
+| **Network** | Connectivity outage, DDoS | minutes–hours | N/A | Medium |
+| **Cloud provider** | Regional outage (AWS, Azure, GCP) | hours | minutes | Very low |
+
+### Scenario details
+
+#### Site / Region failure
+
+| Aspect | Description |
+|--------|-------|
+| **Cause** | Blackout, fire, flood, earthquake, cloud provider outage |
+| **Prevention** | Multi-AZ architecture, multi-region deployment, active-active |
+| **Mitigation** | Automatic DNS failover (Route53, Azure Traffic Manager), replica in DR region |
+| **Testing** | Game day: shut down primary region, verify automatic failover |
+
+#### Data corruption / human error
+
+| Aspect | Description |
+|--------|-------|
+| **Cause** | Wrong SQL command (DELETE without WHERE), accidentally deleted bucket, bad migration |
+| **Prevention** | RBAC, MFA for destructive operations, change management, SQL peer review |
+| **Mitigation** | Point-in-time recovery (PITR), transaction log replay, immutable backups |
+| **Testing** | Restore backup to isolated environment, verify data integrity |
+
+#### Ransomware / cyber attack
+
+| Aspect | Description |
+|--------|-------|
+| **Cause** | Attack on production systems, data encryption, exfiltration |
+| **Prevention** | Immutable backups (object lock), air-gapped backups, network segmentation |
+| **Mitigation** | Restore from clean backup, rebuild infrastructure from IaC |
+| **Testing** | Regular restore in isolated network, verify backup is not infected |
+
+---
+
+## Prevention — strategies
+
+### Backup strategies
+
+| Approach | Description | Use case |
+|---------|-------|----------|
+| **3-2-1 rule** | 3 copies, 2 different media, 1 off-site | Universal |
+| **3-2-1-0** | + 0 errors after restore (testing) | Enterprise, compliance |
+| **GFS (Grandfather-Father-Son)** | Daily, weekly, monthly rotation | Long-term archive |
+| **Incremental forever** | Full backup 1×, then only changes | Large data volumes |
+| **Reverse incremental** | Full + incremental, full is always current | Fast recovery |
+
+### Backup methods
+
+| Method | RPO | RTO | Storage | Suitable for |
+|--------|-----|-----|----------|------------|
+| **Full backup** | Last full | Full restore time | Large | Small data, weekly |
+| **Incremental** | Last incremental | Full + all incrementals | Small | Large data, daily |
+| **Differential** | Last diff | Full + last diff | Medium | Compromise |
+| **Snapshot** | Snapshot point-in-time | seconds | Copy-on-write | VM, storage array |
+| **Continuous (CDC)** | < 1 s | Seconds | Log stream | DB (binlog, WAL) |
+| **PITR** | Any point in time | Depends on volume | Full + WAL | RDS, PostgreSQL, SQL Server |
+
+### Backup immutability
+
+Key protection against ransomware:
+
+| Technique | Description |
+|----------|-------|
+| **Object Lock (WORM)** | Backup cannot be deleted or overwritten for a defined retention period (S3 Object Lock, Azure Blob Immutable) |
+| **Air gap** | Backup is physically separated from the production network (offline disk, tape, cloud without VPN) |
+| **Isolated backup network** | Backup traffic goes through a dedicated network without access from production VLAN |
+| **Out-of-band access** | Backup management console is not accessible from the production network |
+
+---
+
+## DR architectures
+
+### Multi-AZ (Single region)
+
+```
+Region ┌────────────────────────────────────┐
+       │  AZ-1              AZ-2            │
+       │  ┌──────────┐     ┌──────────┐     │
+       │  │  App      │     │  App      │     │
+       │  └─────┬────┘     └─────┬────┘     │
+       │        │                │          │
+       │  ┌─────▼────────────────▼─────┐    │
+       │  │  Load Balancer (cross-AZ)  │    │
+       │  └─────────────┬──────────────┘    │
+       │                │                   │
+       │  ┌─────────────▼──────────────┐    │
+       │  │  DB Primary (AZ-1)         │    │
+       │  │  DB Standby (AZ-2)         │    │
+       │  │  Synchronous replication   │    │
+       │  └────────────────────────────┘    │
+       └────────────────────────────────────┘
+```
+
+- RTO: minutes (automatic failover)
+- RPO: 0 (sync replication)
+- Protection: against AZ failure, not region failure
+
+### Multi-Region
+
+```
+Region A (Primary)                    Region B (DR)
+┌─────────────────────┐              ┌─────────────────────┐
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  App + DB     │  │              │  │  App + DB     │  │
+│  │  Active       │──┼──Async───────┼─►│  Standby      │  │
+│  └───────────────┘  │  replication │  └───────────────┘  │
+│         │           │              │         │           │
+│  ┌──────▼───────┐  │              │  ┌──────▼───────┐  │
+│  │  DNS / GSLB  │  │              │  │  DNS / GSLB  │  │
+│  └──────┬───────┘  │              │  └──────┬───────┘  │
+└─────────┼──────────┘              └─────────┼──────────┘
+          │                                    │
+          └──────────── Traffic Manager ───────┘
+```
+
+| Variant | RTO | RPO | Cost | Failover |
+|----------|-----|-----|---------|----------|
+| **Active-Passive** | minutes–hours | seconds | Medium | Manual / auto |
+| **Active-Active** | seconds | < 1 s | High | Automatic (DNS) |
+| **Pilot Light** | tens of minutes | minutes | Low | Manual scaling |
+| **Warm Standby** | minutes | seconds | High | Auto (reduced copy) |
+| **Backup & Restore** | hours | 24 h | Low | Manual |
+
+### On-prem → Cloud DR (Hybrid)
+
+```
+On-prem DC                              Cloud (DR)
+┌─────────────────────┐              ┌─────────────────────┐
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  Application  │  │              │  │  VM / App     │  │
+│  │  + DB         │  │              │  │  + DB replica │  │
+│  └───────┬───────┘  │              │  └───────┬───────┘  │
+│          │          │              │          │          │
+│  ┌───────▼───────┐  │  site-to-site│  ┌───────▼───────┐  │
+│  │  Backup proxy │──┼────VPN───────┼─►│  Backup store │  │
+│  └───────────────┘  │              │  └───────────────┘  │
+│                     │              │                     │
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  Tape / NAS   │  │              │  │  Veeam / Zerto│  │
+│  └───────────────┘  │              │  └───────────────┘  │
+└─────────────────────┘              └─────────────────────┘
+```
+
+- **RTO**: tens of minutes (depends on VM startup)
+- **RPO**: minutes–hours (depends on replication tool)
+- **Tools**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
+- **Use case**: enterprise with on-prem DC that needs DR without a second DC
+
+---
+
+## DR testing
+
+### Test types
+
+| Type | Description | Frequency | Risk |
+|-----|-------|-----------|--------|
+| **Tabletop exercise** | Manual scenario walkthrough, no impact on production | Monthly | None |
+| **Walkthrough** | Runbook verification, ensure everyone knows what to do | Quarterly | None |
+| **Component test** | Test of a single component (e.g., restore one DB) | Monthly | Low |
+| **Integrated test** | Test of the entire stack in isolated environment | Quarterly | Low |
+| **Full failover test** | Production failover to DR site | Annually | High |
+| **Chaos experiment** | Targeted fault injection into production | Continuous | Medium |
+
+### Runbook structure
+
+Each DR scenario should have a runbook:
+
+```yaml
+scenario: "Region A failure"
+triggers:
+  - "CloudWatch alarm: Region A health check 5× timeout"
+  - "PagerDuty incident P0"
+decision_tree: |
+  1. Verify: is Region A really unavailable? (check from 3 different locations)
+  2. Decide: is RTO at risk? If < 30 % RTO remaining → failover
+  3. Failover: run playbook `dr-failover-region-b`
+  4. Verification: smoke tests in Region B
+  5. Communication: status page + stakeholders
+rollback: |
+  1. After Region A recovery → replicate changes from B back to A
+  2. Repoint DNS to A
+  3. Verify data consistency
+  4. Shut down Region B (or keep as hot standby)
+contacts:
+  primary: "on-call@example.com"
+  escalation: "infra-lead@example.com"
+  management: "vp-engineering@example.com"
+```
+
+---
+
+## Best practices
+
+- **Test recovery, not backup** — a backup without tested recovery is not a backup
+- **Automate DR** — Terraform / Ansible for DR environment spin-up, DNS failover
+- **Document runbooks** — every scenario, contact, decision tree
+- **Expect failure** — design for failure, don't expect everything to work
+- **Don't underestimate WRT** — service recovery does not mean full operations (data warming, cache, connections)
+- **Align RTO/RPO with business** — technical capabilities must match business requirements
+- **Monitor SLI** — without data, SLO cannot be verified
+- **DR is not just IT** — communication, PR, legal, compliance
+
+---
+
+## Related
+
+- [CLOUD.md](CLOUD.md) — cloud DR strategy, AWS/Azure/GCP specific
+- [DATACENTERS.md](DATACENTERS.md) — DC redundancy, Tier classification
+- [MONITORING.md](MONITORING.md) — alerting, SLI/SLO/SLA
+- [CICD.md](CICD.md) — deployment strategy, rollback
+- [STORAGE.md](STORAGE.md) — backup storage, replication
+
+## Sources
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Last revised: 2026-06-11*
--- a/DR.md
+++ b/DR.md
@@ -0,0 +1,336 @@
+# 🔄 Disaster Recovery a Business Continuity
+
+## Terminologie
+
+| Zkratka | Význam | Popis |
+|---------|--------|-------|
+| **RTO** | Recovery Time Objective | Maximální doba od výpadku do obnovení služby |
+| **RPO** | Recovery Point Objective | Maximální přípustná ztráta dat (čas od poslední zálohy) |
+| **MTD** | Maximum Tolerable Downtime | Celková doba výpadku, kterou organizace přežije |
+| **WRT** | Work Recovery Time | Čas potřebný k plnému obnovení provozu po obnovení IT |
+| **MTBF** | Mean Time Between Failures | Střední doba mezi poruchami |
+| **MTTR** | Mean Time To Repair | Střední doba opravy |
+| **SLA** | Service Level Agreement | Smluvní závazek dostupnosti |
+| **SLO** | Service Level Objective | Interní cíl dostupnosti |
+| **SLI** | Service Level Indicator | Naměřená hodnota dostupnosti |
+
+### Vztah RTO, RPO, MTD, WRT
+
+```
+Výpadek ──── RPO ────► Obnova dat ──── RTO ────► Služba běží ──── WRT ────► Plný provoz
+             │                      │                            │
+             ▼                      ▼                            ▼
+       Ztracená data          Čas bez služby               Čas do plného výkonu
+
+       MTD = RTO + WRT (max. doba, kterou firma toleruje)
+```
+
+---
+
+## Výpočet uptimu
+
+### Tabulka devítek
+
+| Úroveň | Uptime | Downtime / rok | Downtime / měsíc | Downtime / týden |
+|--------|--------|---------------|------------------|------------------|
+| 90 % (jedna devítka) | 0.9 | 36,5 dne | 72 h | 16,8 h |
+| 99 % (dvě devítky) | 0.99 | 3,65 dne | 7,2 h | 1,68 h |
+| 99,5 % | 0.995 | 1,83 dne | 3,6 h | 50,4 min |
+| 99,9 % (tři devítky) | 0.999 | 8,76 h | 43,2 min | 10,1 min |
+| 99,95 % | 0.9995 | 4,38 h | 21,6 min | 5,04 min |
+| 99,99 % (čtyři devítky) | 0.9999 | 52,6 min | 4,32 min | 1,01 min |
+| 99,995 % | 0.99995 | 26,3 min | 2,16 min | 30,2 s |
+| 99,999 % (pět devítek) | 0.99999 | 5,26 min | 25,9 s | 6,05 s |
+| 99,9999 % (šest devítek) | 0.999999 | 31,6 s | 2,59 s | 0,605 s |
+
+### Výpočet
+
+```
+Dostupnost = (Celkový čas - Downtime) / Celkový čas × 100 %
+
+Příklad:
+  Rok = 365 × 24 × 60 = 525 600 minut
+  Cíl: 99,9 % → povolený downtime = 525 600 × (1 - 0,999) = 525,6 minut = 8,76 h
+
+Složená dostupnost (řetězec závislostí):
+  A_web = 99,9 % (3 devítky)
+  A_api  = 99,99 % (4 devítky)
+  A_db   = 99,999 % (5 devítek)
+
+  A_celkem = 0,999 × 0,9999 × 0,99999 = 0,99889 ≈ 99,89 % (méně než 3 devítky!)
+
+Paralelní dostupnost (redundance):
+  A_celkem = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)
+
+  Příklad: 2 servery s 99% dostupností
+  A_celkem = 1 - (1-0,99) × (1-0,99) = 1 - 0,01 × 0,01 = 0,9999 (99,99 %)
+```
+
+### Kalkulačka
+
+```python
+def uptime_percent_to_downtime(pct, period_days=365):
+    """Převede procento uptimu na downtime v daném období."""
+    total_minutes = period_days * 24 * 60
+    allowed_downtime = total_minutes * (1 - pct / 100)
+    return allowed_downtime  # minutes
+
+def downtime_to_uptime_percent(downtime_minutes, period_days=365):
+    """Převede downtime v minutách na procento uptimu."""
+    total_minutes = period_days * 24 * 60
+    return (1 - downtime_minutes / total_minutes) * 100
+
+def combined_availability(availabilities):
+    """Složená dostupnost (sériově zapojené komponenty)."""
+    result = 1.0
+    for a in availabilities:
+        result *= a
+    return result
+
+def redundant_availability(availabilities):
+    """Paralelní dostupnost (redundantní komponenty)."""
+    result = 1.0
+    for a in availabilities:
+        result *= (1 - a)
+    return 1 - result
+```
+
+### Fallacies výpočtu
+
+- **Složená dostupnost není součet** — přidání další závislosti vždy snižuje celkovou dostupnost
+- **Redundance není zadarmo** — přidání standby komponenty vyžaduje detekci selhání + failover (MTTR se nezlepší automaticky)
+- **SLA není garance** — poskytovatelé často počítají SLA jako měsíční průměr, ne per-incident
+- **Měření je klíčové** — bez SLI nelze ověřit SLO; "nedoměřená dostupnost neexistuje"
+- **Plánovaná odstávka** — někdy se počítá do uptimu, někdy ne (záleží na definici SLA)
+
+---
+
+## DR scénáře
+
+### Klasifikace
+
+| Kategorie | Scénář | Typický RTO | Typické RPO | Frekvence |
+|-----------|--------|-------------|-------------|-----------|
+| **Site** | Výpadek celého DC / regionu | hodiny | minuty | Nízká |
+| **Infrastructure** | Selhání HW (storage, switch, server) | minuty–hodiny | sekundy | Střední |
+| **Software** | Selhání OS, aplikace, DB | minuty | vteřiny | Vysoká |
+| **Data** | Poškození dat, delete, cryptolocker | hodiny | okamžik zálohy | Nízká–střední |
+| **Human** | Chybný deployment, config change | minuty–hodiny | vteřiny | Střední |
+| **Security** | Útok, breach, ransomware | dny | před útokem | Nízká |
+| **Network** | Výpadek konektivity, DDoS | minuty–hodiny | N/A | Střední |
+| **Cloud provider** | Regionální výpadek (AWS, Azure, GCP) | hodiny | minuty | Velmi nízká |
+
+### Detail scénářů
+
+#### Site / Region failure
+
+| Aspekt | Popis |
+|--------|-------|
+| **Příčina** | Blackout, požár, povodeň, zemětřesení, výpadek cloud providera |
+| **Prevence** | Multi-AZ architektura, multi-region deployment, active-active |
+| **Mitigace** | Automatický DNS failover (Route53, Azure Traffic Manager), replica v DR regionu |
+| **Testování** | Game day: vypnout primární region, ověřit automatický failover |
+
+#### Data corruption / human error
+
+| Aspekt | Popis |
+|--------|-------|
+| **Příčina** | Chybný SQL příkaz (DELETE bez WHERE), omylem smazaný bucket, chybná migrace |
+| **Prevence** | RBAC, MFA pro destructive operace, change management, peer review SQL |
+| **Mitigace** | Point-in-time recovery (PITR), transaction log replay, immutable backups |
+| **Testování** | Obnova zálohy do izolovaného prostředí, ověření integrity dat |
+
+#### Ransomware / cyber attack
+
+| Aspekt | Popis |
+|--------|-------|
+| **Příčina** | Útok na produkční systémy, zašifrování dat, exfiltrace |
+| **Prevence** | Immutable backups (object lock), air-gapped backups, network segmentation |
+| **Mitigace** | Obnova z čisté zálohy, re-build infrastructure from IaC |
+| **Testování** | Pravidelná obnova v izolované síti, ověření že backup není infikován |
+
+---
+
+## Prevence — strategie
+
+### Backup strategie
+
+| Aproach | Popis | Use case |
+|---------|-------|----------|
+| **3-2-1 pravidlo** | 3 kopie, 2 různá média, 1 off-site | Univerzální |
+| **3-2-1-0** | + 0 chyb po obnově (testování) | Enterprise, compliance |
+| **GFS (Grandfather-Father-Son)** | Denní, týdenní, měsíční rotace | Dlouhodobý archiv |
+| **Incremental forever** | Plná záloha 1×, pak jen změny | Velké objemy dat |
+| **Reverse incremental** | Plná + inkrementální, plná je vždy aktuální | Rychlá obnova |
+
+### Zálohovací metody
+
+| Metoda | RPO | RTO | Úložiště | Vhodné pro |
+|--------|-----|-----|----------|------------|
+| **Full backup** | Poslední full | Doba obnovy full | Velké | Malá data, weekly |
+| **Incremental** | Poslední inkrement | Full + všechny inkrementy | Malé | Velká data, daily |
+| **Differential** | Poslední diff | Full + poslední diff | Střední | Kompromis |
+| **Snapshot** | Okamžik snapshotu | vteřiny | Copy-on-write | VM, storage array |
+| **Continuous (CDC)** | < 1 s | Sekundy | Log stream | DB (binlog, WAL) |
+| **PITR** | Libovolný bod v čase | Dle objemu | Full + WAL | RDS, PostgreSQL, SQL Server |
+
+### Imunabilita backupů
+
+Klíčová ochrana proti ransomwaru:
+
+| Technika | Popis |
+|----------|-------|
+| **Object Lock (WORM)** | Backup nelze smazat ani přepsat po defined retention period (S3 Object Lock, Azure Blob Immutable) |
+| **Air gap** | Backup je fyzicky oddělený od produkční sítě (offline disk, tape, cloud bez VPN) |
+| **Isolated backup network** | Backup traffic jde přes dedikovanou síť bez přístupu z produkční VLAN |
+| **Out-of-band access** | Backup management console není dostupná z produkční sítě |
+
+---
+
+## DR architektury
+
+### Multi-AZ (Single region)
+
+```
+Region ┌────────────────────────────────────┐
+       │  AZ-1              AZ-2            │
+       │  ┌──────────┐     ┌──────────┐     │
+       │  │  App      │     │  App      │     │
+       │  └─────┬────┘     └─────┬────┘     │
+       │        │                │          │
+       │  ┌─────▼────────────────▼─────┐    │
+       │  │  Load Balancer (cross-AZ)  │    │
+       │  └─────────────┬──────────────┘    │
+       │                │                   │
+       │  ┌─────────────▼──────────────┐    │
+       │  │  DB Primary (AZ-1)         │    │
+       │  │  DB Standby (AZ-2)         │    │
+       │  │  Synchronous replication   │    │
+       │  └────────────────────────────┘    │
+       └────────────────────────────────────┘
+```
+
+- RTO: minuty (automatický failover)
+- RPO: 0 (sync replication)
+- Ochrana: proti selhání AZ, nikoliv regionu
+
+### Multi-Region
+
+```
+Region A (Primary)                    Region B (DR)
+┌─────────────────────┐              ┌─────────────────────┐
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  App + DB     │  │              │  │  App + DB     │  │
+│  │  Active       │──┼──Async───────┼─►│  Standby      │  │
+│  └───────────────┘  │   replikace  │  └───────────────┘  │
+│         │           │              │         │           │
+│  ┌──────▼───────┐  │              │  ┌──────▼───────┐  │
+│  │  DNS / GSLB  │  │              │  │  DNS / GSLB  │  │
+│  └──────┬───────┘  │              │  └──────┬───────┘  │
+└─────────┼──────────┘              └─────────┼──────────┘
+          │                                    │
+          └──────────── Traffic Manager ───────┘
+```
+
+| Varianta | RTO | RPO | Náklady | Failover |
+|----------|-----|-----|---------|----------|
+| **Active-Passive** | minuty–hodiny | sekundy | Střední | Manuální / auto |
+| **Active-Active** | sekundy | < 1 s | Vysoké | Automatický (DNS) |
+| **Pilot Light** | desítky minut | minuty | Nízké | Manuální škálování |
+| **Warm Standby** | minuty | sekundy | Vysoké | Auto (zmenšená kopie) |
+| **Backup & Restore** | hodiny | 24 h | Nízké | Manuální |
+
+### On-prem → Cloud DR (Hybrid)
+
+```
+On-prem DC                              Cloud (DR)
+┌─────────────────────┐              ┌─────────────────────┐
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  Aplikace     │  │              │  │  VM / Aplikace│  │
+│  │  + DB         │  │              │  │  + DB replica │  │
+│  └───────┬───────┘  │              │  └───────┬───────┘  │
+│          │          │              │          │          │
+│  ┌───────▼───────┐  │  site-to-site│  ┌───────▼───────┐  │
+│  │  Backup proxy │──┼────VPN───────┼─►│  Backup store │  │
+│  └───────────────┘  │              │  └───────────────┘  │
+│                     │              │                     │
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  Tape / NAS   │  │              │  │  Veeam / Zerto│  │
+│  └───────────────┘  │              │  └───────────────┘  │
+└─────────────────────┘              └─────────────────────┘
+```
+
+- **RTO**: desítky minut (závisí na startup VM)
+- **RPO**: minuty–hodiny (závisí na replikačním nástroji)
+- **Nástroje**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
+- **Use case**: enterprise s on-prem DC, které potřebuje DR bez druhého DC
+
+---
+
+## DR testování
+
+### Typy testů
+
+| Typ | Popis | Frekvence | Riziko |
+|-----|-------|-----------|--------|
+| **Tabletop exercise** | Manuální procházení scénáře, žádný dopad na produkci | Měsíčně | Žádné |
+| **Walkthrough** | Verifikace runbooku, kontrola že všichni ví co dělat | Kvartálně | Žádné |
+| **Component test** | Test jedné komponenty (např. obnova jedné DB) | Měsíčně | Nízké |
+| **Integrated test** | Test celého stacku v izolovaném prostředí | Kvartálně | Nízké |
+| **Full failover test** | Produkční failover do DR site | Ročně | Vysoké |
+| **Chaos experiment** | Cílené vnášení poruch do produkce | Průběžně | Střední |
+
+### Runbook struktura
+
+Každý DR scénář by měl mít runbook:
+
+```yaml
+scenario: "Region A failure"
+triggers:
+  - "CloudWatch alarm: Region A health check 5× timeout"
+  - "PagerDuty incident P0"
+decision_tree: |
+  1. Ověřit: je Region A opravdu nedostupný? (check z 3 různých lokací)
+  2. Rozhodnout: je RTO v ohrožení? Pokud zbývá < 30 % RTO → failover
+  3. Failover: spustit playbook `dr-failover-region-b`
+  4. Verifikace: smoke testy v Region B
+  5. Komunikace: status page + stakeholders
+rollback: |
+  1. Po obnovení Region A → replikace změn z B zpět do A
+  2. Repoint DNS na A
+  3. Ověřit konzistenci dat
+  4. Vypnout Region B (nebo ponechat jako hot standby)
+contacts:
+  primary: "on-call@example.com"
+  escalation: "infra-lead@example.com"
+  management: "vp-engineering@example.com"
+```
+
+---
+
+## Best practices
+
+- **Testuj obnovu, ne zálohu** — backup bez testované obnovy není backup
+- **Automatizuj DR** — Terraform / Ansible pro spin-up DR prostředí, DNS failover
+- **Dokumentuj runbooky** — každý scénář, kontakt, rozhodovací strom
+- **Počítej se selháním** — design for failure, nečekej že všechno poběží
+- **Nepodceňuj WRT** — obnova služby neznamená plný provoz (data warming, cache, connections)
+- **Slaď RTO/RPO s businessem** — technické možnosti musí odpovídat obchodním požadavkům
+- **Monitoruj SLI** — bez dat nelze ověřit SLO
+- **DR není jen IT** — komunikace, PR, právní, regulace
+
+---
+
+## Související
+
+- [CLOUD.md](CLOUD.md) — cloud DR strategie, AWS/Azure/GCP specific
+- [DATACENTERS.md](DATACENTERS.md) — DC redundance, Tier klasifikace
+- [MONITORING.md](MONITORING.md) — alerting, SLI/SLO/SLA
+- [CICD.md](CICD.md) — deployment strategie, rollback
+- [STORAGE.md](STORAGE.md) — backup storage, replication
+
+## Zdroje
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Poslední revize: 2026-06-11*
--- a/MESSAGING.en.md
+++ b/MESSAGING.en.md
@@ -0,0 +1,275 @@
+# 📨 Messaging and streaming platforms
+
+## Platform overview
+
+| Platform | Type | Language | Protocol | Persistence | Use case |
+|-----------|-----|-------|----------|-------------|----------|
+| **Apache Kafka** | Distributed event store | Java/Scala | Binary (TCP) | Disk (log) | Event streaming, data pipeline, log aggregation |
+| **RabbitMQ** | Message broker | Erlang | AMQP 0-9-1, MQTT, STOMP | Disk / RAM | Application messaging, task queue, RPC |
+| **Apache Pulsar** | Distributed messaging + streaming | Java | Binary (TCP) + REST | Disk (segmented log) | Streaming + queue in one, multi-tenant |
+| **NATS** | Lightweight messaging | Go | NATS protocol (TCP) | Memory / JetStream (disk) | Microservices, IoT, edge, low-latency |
+| **AWS SQS** | Managed queue | — | HTTPS | Managed | Decoupling services, serverless |
+| **AWS SNS** | Managed pub/sub | — | HTTPS, SQS, Lambda, email | Managed | Push notifications, fanout |
+| **Azure Service Bus** | Managed messaging | — | AMQP, HTTPS | Managed | Enterprise messaging, sessions, transactions |
+| **Google Pub/Sub** | Managed streaming | — | gRPC, REST | Managed | Event-driven, data pipeline |
+| **Red Hat AMQ 7** (Artemis) | Message broker | Java | AMQP, MQTT, STOMP, OpenWire | Disk | Enterprise, JMS, high-availability |
+| **Oracle Service Bus (OSB)** | Enterprise ESB | Java | HTTP/S, JMS, SOAP, REST, MQ, FTP, AQ | Managed (WebLogic) | Enterprise integration, SOA, protocol mediation, routing |
+
+---
+
+## Platform details
+
+### Apache Kafka
+
+**Architecture:**
+
+```
+Producer ──► Topic ──► Partition ──► Consumer Group
+                │
+                ├── Partition 0 (Leader) ──► Broker 1
+                ├── Partition 1 (Follower) ──► Broker 2
+                └── Partition 2 (Follower) ──► Broker 3
+```
+
+| Concept | Description |
+|---------|-------|
+| **Topic** | Logical message category |
+| **Partition** | Append-only log, ordered sequence of messages |
+| **Broker** | Server in Kafka cluster |
+| **Producer** | Publishes messages to topic |
+| **Consumer** | Reads messages from partition (within consumer group) |
+| **Consumer Group** | Group of consumers sharing topic reading |
+| **Offset** | Position in partition (tracked by consumer) |
+| **KRaft** | Controller quorum (replaces Zookeeper from Kafka 3.x) |
+
+**Replication and HA:**
+
+| Parameter | Value |
+|----------|---------|
+| Replication factor | 2–3 (typically 3 for production) |
+| ISR (In-Sync Replicas) | Number of replicas keeping up with leader |
+| Min ISR | Minimum ISR for acknowledging writes (acks=all) |
+| acks=0 | Fire-and-forget (fastest, possible data loss) |
+| acks=1 | Write acknowledged by leader (compromise) |
+| acks=all | Write acknowledged by all ISR (safest) |
+| Leader failover | Automatic election of new leader from ISR |
+
+**Important configuration:**
+
+```properties
+# Production
+replication.factor=3
+min.insync.replicas=2
+default.replication.factor=3
+
+# Retention
+log.retention.hours=168     # 7 days
+log.retention.bytes=-1      # unlimited (or limit)
+log.segment.bytes=1073741824 # 1 GB per segment
+
+# Performance
+num.partitions=3            # adjust per need (scale-out)
+compression.type=snappy     # (snappy, gzip, lz4, zstd)
+```
+
+**Partitioning strategies:**
+
+| Strategy | Key | Advantage | Disadvantage |
+|----------|------|--------|----------|
+| Round-robin | null | Even distribution | Per-key ordering lost |
+| Key-based | user_id, order_id | Same key → same partition | Uneven distribution (hot keys) |
+| Custom partitioner | Custom logic | Per use-case optimization | More complex maintenance |
+
+### RabbitMQ
+
+**Architecture:**
+
+```
+Producer ──► Exchange ──► Binding ──► Queue ──► Consumer
+                  │
+      ┌───────────┼───────────┐
+      ▼           ▼           ▼
+  Direct      Topic      Fanout
+  Exchange   Exchange   Exchange
+```
+
+| Concept | Description |
+|---------|-------|
+| **Exchange** | Receives messages from producer, routes to queue |
+| **Binding** | Exchange → queue link with routing key |
+| **Queue** | FIFO message queue (consumed by consumer) |
+| **Virtual Host (vhost)** | Tenant isolation within a single cluster |
+| **Publisher Confirm** | Broker acknowledges message receipt |
+| **Consumer Ack** | Consumer acknowledges message processing |
+
+**Exchange types:**
+
+| Type | Routing | Use case |
+|-----|---------|----------|
+| **Direct** | routing_key = binding_key | Task queue, point-to-point |
+| **Topic** | routing_key match binding pattern (wildcard `*`, `#`) | Pub/sub with filtering |
+| **Fanout** | All bound queues | Broadcast, event notification |
+| **Headers** | AMQP headers match | Complex routing (not routing key dependent) |
+
+**Queue types:**
+
+```properties
+# Classic Queue (deprecated in production)
+x-queue-type: classic
+
+# Quorum Queue (recommended for production)
+x-queue-type: quorum
+x-quorum-initial-group-size: 3
+x-dead-letter-exchange: dlx
+
+# Stream Queue (for large backlogs)
+x-queue-type: stream
+x-max-length-bytes: 1073741824
+```
+
+**HA and clustering:**
+
+| Mode | Description | Use case |
+|-------|-------|----------|
+| **Quorum Queues** | Raft-based replication (3–5 node), auto failover | Production, HA messaging |
+| **Federation** | Async message forwarding between independent RabbitMQ clusters | Multi-region, DR |
+| **Shovel** | Point-to-point message forwarding (Federation at queue level) | Migration, specific routing |
+| **Warm Standby (DR)** | Secondary cluster, started on failover | Cold DR |
+
+### Apache Pulsar
+
+**Unique architecture (compute/storage separation):**
+
+```
+┌──────────────┐    ┌──────────────┐    ┌──────────────┐
+│  Producer    │    │  Consumer    │    │  Consumer    │
+└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
+       │                   │                   │
+┌──────▼───────────────────▼───────────────────▼──────┐
+│               Broker (stateless)                    │
+│         Subscription: Exclusive / Shared / Failover │
+└──────────────────────┬──────────────────────────────┘
+                       │
+┌──────────────────────▼──────────────────────────────┐
+│           BookKeeper (stateful storage)              │
+│  ├── Bookie 1  ├── Bookie 2  ├── Bookie 3  ├── ... │
+│  └── Ledger (append-only, segmented log)            │
+└─────────────────────────────────────────────────────┘
+```
+
+| Concept | Description |
+|---------|-------|
+| **Topic** | Logical category (partitioned or non-partitioned) |
+| **Subscription** | Delivery mode (Exclusive, Shared, Failover, Key_Shared) |
+| **Ledger** | Storage unit in BookKeeper (append-only) |
+| **Bookie** | Storage node (BookKeeper) |
+| **Managed Ledger** | Segmented log with cache and retention |
+
+**Advantages over Kafka:**
+- Compute/storage separation — independent scaling
+- Geo-replication built-in (native)
+- Multi-tenant (namespaces, isolation)
+- TTL, retry, dead letter topic (built-in)
+- Read-at-least-once / effectively-once
+
+### NATS
+
+| Feature | Description |
+|---------|-------|
+| **Core NATS** | Pub/sub, request-reply, < 1 ms latency |
+| **JetStream** | Persistence, exactly-once, key-value store, object store |
+| **Leaf nodes** | Hierarchical cluster connection |
+| **Super-cluster** | Multi-region clustering (global) |
+
+**Use case:** IoT, edge computing, microservices communication, low-latency messaging.
+
+### Oracle Service Bus (OSB)
+
+Part of Oracle SOA Suite, runs on WebLogic Server. Enterprise service bus for integration in Oracle-heavy environments.
+
+| Concept | Description |
+|---------|-------|
+| **Proxy Service** | Inbound endpoint (HTTP, JMS, MQ, SOAP, REST) |
+| **Business Service** | Target backend service |
+| **Pipeline** | Message processing — routing, transformation, validation |
+| **Split-Join** | Parallel/sequential orchestration of multiple services |
+| **Reporting** | Message tracking, SLA monitoring |
+
+**Key features:**
+- **Protocol mediation** — translation between SOAP/REST/JMS/MQ/FTP
+- **Message transformation** — XSLT, XQuery, MFL (non-XML)
+- **Throttling, SLA, alerting** — built-in
+- **Oracle AQ (Advanced Queuing)** — integration with Oracle DB queues
+- **XPath, XQuery, XSLT 2.0/3.0** — native support
+- **Error handling** — fault policies, error queues, retry
+
+**Use case:** Enterprise SOA, Oracle DB → Kafka bridging, legacy mainframe wrapping, B2B integration.
+
+**Alternatives:** IBM Integration Bus (IIB), MuleSoft Anypoint, WSO2 EI, Apache Camel / ServiceMix.
+
+---
+
+## Platform comparison
+
+### Performance and scaling
+
+| Platform | Max throughput | Latency (P99) | Messages/s (1 broker) | Scaling |
+|-----------|--------------|---------------|-------------------------|-----------|
+| **Kafka** | > 1 GB/s | 2–10 ms | ~1,000,000 | Partitions (horizontal) |
+| **Pulsar** | > 1 GB/s | 5–15 ms | ~1,000,000 | Brokers + Bookies |
+| **RabbitMQ** | ~100 MB/s | < 1 ms (RAM) | ~100,000 | Clustering (node) |
+| **NATS** | > 10 GB/s | < 0.5 ms | ~10,000,000 | Clustering + Leaf nodes |
+| **OSB** | < 1 GB/s | 10–100 ms | ~10,000 | Vertical (WebLogic cluster)
+
+### Delivery guarantees
+
+| Platform | At most once | At least once | Exactly once | Ordering |
+|-----------|-------------|---------------|-------------|----------|
+| **Kafka** | Yes | Yes (acks=all + min.insync) | Yes (idempotent + transactional) | Per partition |
+| **Pulsar** | Yes | Yes | Yes (dedup + transactional) | Per partition |
+| **RabbitMQ** | Yes | Yes (Publisher Confirm + Consumer Ack) | Limited | Per queue |
+| **NATS** | Yes | Yes (JetStream) | Limited | Per subject |
+| **OSB** | Yes | Yes (XA transactions, exactly-once delivery) | Yes (XA + WS-AT) | Per pipeline |
+
+### When to use what
+
+| Use case | Recommended platform | Reasoning |
+|----------|---------------------|------------|
+| **Event sourcing / audit log** | Kafka, Pulsar | Append-only log, high throughput, replay |
+| **CDC (Change Data Capture)** | Kafka (Kafka Connect + Debezium) | Connector ecosystem |
+| **Task queue (job processing)** | RabbitMQ, SQS | Dead letter, retry, priority, scheduling |
+| **API messaging / microservices** | NATS, RabbitMQ | Low latency, simplicity |
+| **Data pipeline (ETL)** | Kafka (KSQL, Kafka Streams) | Stream processing in platform |
+| **IoT / Edge** | NATS, MQTT (RabbitMQ) | Lightweight, leaf nodes |
+| **Enterprise SOA / EAI** | OSB, IBM IIB, MuleSoft | Protocol mediation, XA, B2B, legacy wrapping |
+| **Multi-tenant cloud** | Pulsar | Native multi-tenant, geo-replication |
+| **Serverless / event-driven** | SQS/SNS, Pub/Sub | Managed, auto-scaling |
+
+---
+
+## DR and high availability
+
+See [DATACENTERS.en.md](DATACENTERS.en.md) — section "Impact of individual technologies on DC topology selection" for detailed DR mapping per platform.
+
+### Best practices
+
+- **Don't lose messages in queue** — prefer acknowledgement-based consumption (not auto-ack)
+- **Dead letter queue** — every main queue has a DLQ for undeliverable messages
+- **Monitor lag** — consumer lag is a key metric (Kafka: `kafka.consumer:consumer_lag`)
+- **Idempotent consumer** — same message may be delivered twice
+- **Retry with backoff** — exponential backoff on processing failure
+- **Schema registry** — avoid deserialization errors (Avro, Protobuf, JSON Schema)
+- **Encryption** — TLS in transit, encryption at rest (Kafka: cluster-side + topic-level)
+
+---
+
+## Related
+
+- [DATACENTERS.en.md](DATACENTERS.en.md) — DR topology, per-platform mapping
+- [CLOUD.en.md](CLOUD.en.md) — managed messaging (SQS, SNS, Service Bus, Pub/Sub)
+
+## Sources
+
+Links, books, and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Last revision: 2026-06-12*
--- a/MESSAGING.md
+++ b/MESSAGING.md
@@ -0,0 +1,275 @@
+# 📨 Messaging a streaming platformy
+
+## Přehled platformem
+
+| Platforma | Typ | Jazyk | Protokol | Persistence | Use case |
+|-----------|-----|-------|----------|-------------|----------|
+| **Apache Kafka** | Distributed event store | Java/Scala | Binary (TCP) | Disk (log) | Event streaming, data pipeline, log aggregation |
+| **RabbitMQ** | Message broker | Erlang | AMQP 0-9-1, MQTT, STOMP | Disk / RAM | Aplikační messaging, task queue, RPC |
+| **Apache Pulsar** | Distributed messaging + streaming | Java | Binary (TCP) + REST | Disk (segmented log) | Streaming + queue v jednom, multi-tenant |
+| **NATS** | Lightweight messaging | Go | NATS protocol (TCP) | Memory / JetStream (disk) | Microservices, IoT, edge, low-latency |
+| **AWS SQS** | Managed queue | — | HTTPS | Managed | Decoupling services, serverless |
+| **AWS SNS** | Managed pub/sub | — | HTTPS, SQS, Lambda, email | Managed | Push notifications, fanout |
+| **Azure Service Bus** | Managed messaging | — | AMQP, HTTPS | Managed | Enterprise messaging, sessions, transactions |
+| **Google Pub/Sub** | Managed streaming | — | gRPC, REST | Managed | Event-driven, data pipeline |
+| **Red Hat AMQ 7** (Artemis) | Message broker | Java | AMQP, MQTT, STOMP, OpenWire | Disk | Enterprise, JMS, high-availability |
+| **Oracle Service Bus (OSB)** | Enterprise ESB | Java | HTTP/S, JMS, SOAP, REST, MQ, FTP, AQ | Managed (WebLogic) | Enterprise integration, SOA, protocol mediation, routing |
+
+---
+
+## Detail platformem
+
+### Apache Kafka
+
+**Architektura:**
+
+```
+Producer ──► Topic ──► Partition ──► Consumer Group
+                │
+                ├── Partition 0 (Leader) ──► Broker 1
+                ├── Partition 1 (Follower) ──► Broker 2
+                └── Partition 2 (Follower) ──► Broker 3
+```
+
+| Koncept | Popis |
+|---------|-------|
+| **Topic** | Logická kategorie zpráv |
+| **Partition** | Append-only log, ordered sequence of messages |
+| **Broker** | Server v Kafka clusteru |
+| **Producer** | Publikuje zprávy do topicu |
+| **Consumer** | Čte zprávy z partition (v rámci consumer group) |
+| **Consumer Group** | Skupina consumerů sdílejících čtení topicu |
+| **Offset** | Pozice v partition (sledovaná consumerem) |
+| **KRaft** | Controller quorum (nahrazuje Zookeeper od Kafka 3.x) |
+
+**Replikace a HA:**
+
+| Parametr | Hodnota |
+|----------|---------|
+| Replication factor | 2–3 (typicky 3 pro produkci) |
+| ISR (In-Sync Replicas) | Počet replik, které drží krok s leaderem |
+| Min ISR | Minimální počet ISR pro potvrzení zápisu (acks=all) |
+| acks=0 | Fire-and-forget (nejrychlejší, možná ztráta dat) |
+| acks=1 | Zápis potvrzen leaderem (kompromis) |
+| acks=all | Zápis potvrzen všemi ISR (nejbezpečnější) |
+| Leader failover | Automatický výběr nového leadera z ISR |
+
+**Důležité konfigurace:**
+
+```properties
+# Produkce
+replication.factor=3
+min.insync.replicas=2
+default.replication.factor=3
+
+# Retention
+log.retention.hours=168     # 7 dní
+log.retention.bytes=-1      # neomezeno (nebo limit)
+log.segment.bytes=1073741824 # 1 GB per segment
+
+# Performance
+num.partitions=3            # podle potřeb (scale-out)
+compression.type=snappy     # (snappy, gzip, lz4, zstd)
+```
+
+**Partitioning strategies:**
+
+| Strategy | Klíč | Výhoda | Nevýhoda |
+|----------|------|--------|----------|
+| Round-robin | null | Rovnoměrné rozložení | Ztráta pořadí per klíč |
+| Key-based | user_id, order_id | Zprávy se stejným klíčem → stejná partition | Nerovnoměrné rozložení (hot keys) |
+| Custom partitioner | Vlastní logika | Optimalizace per use case | Složitější na údržbu |
+
+### RabbitMQ
+
+**Architektura:**
+
+```
+Producer ──► Exchange ──► Binding ──► Queue ──► Consumer
+                  │
+      ┌───────────┼───────────┐
+      ▼           ▼           ▼
+  Direct      Topic      Fanout
+  Exchange   Exchange   Exchange
+```
+
+| Koncept | Popis |
+|---------|-------|
+| **Exchange** | Přijímá zprávy od producera, routuje do queue |
+| **Binding** | Vazba exchange → queue s routing key |
+| **Queue** | FIFO fronta zpráv (consumer čte) |
+| **Virtual Host (vhost)** | Izolace tenantů v rámci jednoho clusteru |
+| **Publisher Confirm** | Potvrzení že broker zprávu přijal |
+| **Consumer Ack** | Potvrzení že consumer zprávu zpracoval |
+
+**Exchange typy:**
+
+| Typ | Routing | Use case |
+|-----|---------|----------|
+| **Direct** | routing_key = binding_key | Task queue, point-to-point |
+| **Topic** | routing_key match binding pattern (wildcard `*`, `#`) | Pub/sub s filtrováním |
+| **Fanout** | Všem bindovaným queue | Broadcast, event notification |
+| **Headers** | AMQP headers match | Komplexní routing (není závislý na routing key) |
+
+**Queue typy:**
+
+```properties
+# Classic Queue (deprecated v produkci)
+x-queue-type: classic
+
+# Quorum Queue (doporučeno pro produkci)
+x-queue-type: quorum
+x-quorum-initial-group-size: 3
+x-dead-letter-exchange: dlx
+
+# Stream Queue (pro large backlogs)
+x-queue-type: stream
+x-max-length-bytes: 1073741824
+```
+
+**HA a clustering:**
+
+| Režim | Popis | Use case |
+|-------|-------|----------|
+| **Quorum Queues** | Raft-based replikace (3–5 node), auto failover | Produkce, HA messaging |
+| **Federation** | Async forwarding zpráv mezi nezávislými RabbitMQ clustery | Multi-region, DR |
+| **Shovel** | Point-to-point forwarding zpráv (Federation na úrovni queue) | Migrace, specifický routing |
+| **Warm Standby (DR)** | Druhý cluster, start až při failoveru | Cold DR |
+
+### Apache Pulsar
+
+**Unikátní architektura (compute/storage separation):**
+
+```
+┌──────────────┐    ┌──────────────┐    ┌──────────────┐
+│  Producer    │    │  Consumer    │    │  Consumer    │
+└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
+       │                   │                   │
+┌──────▼───────────────────▼───────────────────▼──────┐
+│               Broker (stateless)                    │
+│         Subscription: Exclusive / Shared / Failover │
+└──────────────────────┬──────────────────────────────┘
+                       │
+┌──────────────────────▼──────────────────────────────┐
+│           BookKeeper (stateful storage)              │
+│  ├── Bookie 1  ├── Bookie 2  ├── Bookie 3  ├── ... │
+│  └── Ledger (append-only, segmented log)            │
+└─────────────────────────────────────────────────────┘
+```
+
+| Koncept | Popis |
+|---------|-------|
+| **Topic** | Logická kategorie (partitioned nebo non-partitioned) |
+| **Subscription** | Způsob doručení (Exclusive, Shared, Failover, Key_Shared) |
+| **Ledger** | Storage unit v BookKeeper (append-only) |
+| **Bookie** | Storage node (BookKeeper) |
+| **Managed Ledger** | Segmentovaný log s cache a retention |
+
+**Výhody oproti Kafce:**
+- Compute/storage separation — nezávislé škálování
+- Geo-replication built-in (nativní)
+- Multi-tenant (namespaces, isolation)
+- TTL, retry, dead letter topic (built-in)
+- Read-at-least-once / effectively-once
+
+### NATS
+
+| Feature | Popis |
+|---------|-------|
+| **Core NATS** | Pub/sub, request-reply, < 1 ms latence |
+| **JetStream** | Persistence, exactly-once, key-value store, object store |
+| **Leaf nodes** | Hierarchické propojení clusterů |
+| **Super-cluster** | Multi-region clustering (global) |
+
+**Use case:** IoT, edge computing, microservices communication, low-latency messaging.
+
+### Oracle Service Bus (OSB)
+
+Součást Oracle SOA Suite, běží na WebLogic Serveru. Enterprise service bus pro integraci v Oracle-heavy prostředích.
+
+| Koncept | Popis |
+|---------|-------|
+| **Proxy Service** | Vstupní endpoint (HTTP, JMS, MQ, SOAP, REST) |
+| **Business Service** | Cílový backend service |
+| **Pipeline** | Message processing — routing, transformation, validation |
+| **Split-Join** | Parallel/sequential orchestration více služeb |
+| **Reporting** | Message tracking, SLA monitoring |
+
+**Klíčové vlastnosti:**
+- **Protocol mediation** — překlad mezi SOAP/REST/JMS/MQ/FTP
+- **Message transformation** — XSLT, XQuery, MFL (neXML)
+- **Throttling, SLA, alerting** — built-in
+- **Oracle AQ (Advanced Queuing)** — integrace s Oracle DB frontami
+- **XPath, XQuery, XSLT 2.0/3.0** — nativní podpora
+- **Error handling** — fault policies, error queues, retry
+
+**Use case:** Enterprise SOA, Oracle DB → Kafka bridging, legacy mainframe wrapping, B2B integration.
+
+**Alternativy:** IBM Integration Bus (IIB), MuleSoft Anypoint, WSO2 EI, Apache Camel / ServiceMix.
+
+---
+
+## Srovnání platformem
+
+### Výkon a škálování
+
+| Platforma | Max throughput | Latence (P99) | Počet zpráv/s (1 broker) | Škálování |
+|-----------|--------------|---------------|-------------------------|-----------|
+| **Kafka** | > 1 GB/s | 2–10 ms | ~1 000 000 | Partitions (horizontální) |
+| **Pulsar** | > 1 GB/s | 5–15 ms | ~1 000 000 | Brokers + Bookies |
+| **RabbitMQ** | ~100 MB/s | < 1 ms (RAM) | ~100 000 | Clustering (node) |
+| **NATS** | > 10 GB/s | < 0,5 ms | ~10 000 000 | Clustering + Leaf nodes |
+| **OSB** | < 1 GB/s | 10–100 ms | ~10 000 | Vertikální (WebLogic cluster)
+
+### Delivery guarantees
+
+| Platforma | At most once | At least once | Exactly once | Ordering |
+|-----------|-------------|---------------|-------------|----------|
+| **Kafka** | Ano | Ano (acks=all + min.insync) | Ano (idempotent + transactional) | Per partition |
+| **Pulsar** | Ano | Ano | Ano (dedup + transactional) | Per partition |
+| **RabbitMQ** | Ano | Ano (Publisher Confirm + Consumer Ack) | Omezeně | Per queue |
+| **NATS** | Ano | Ano (JetStream) | Omezeně | Per subject |
+| **OSB** | Ano | Ano (XA transactions, exactly-once delivery) | Ano (XA + WS-AT) | Per pipeline |
+
+### Kdy co použít
+
+| Use case | Doporučená platforma | Zdůvodnění |
+|----------|---------------------|------------|
+| **Event sourcing / audit log** | Kafka, Pulsar | Append-only log, high throughput, replay |
+| **CDC (Change Data Capture)** | Kafka (Kafka Connect + Debezium) | Ekosystém konektorů |
+| **Task queue (job processing)** | RabbitMQ, SQS | Dead letter, retry, priority, scheduling |
+| **API messaging / microservices** | NATS, RabbitMQ | Nízká latence, jednoduchost |
+| **Data pipeline (ETL)** | Kafka (KSQL, Kafka Streams) | Stream processing v platformě |
+| **IoT / Edge** | NATS, MQTT (RabbitMQ) | Lightweight, leaf nodes |
+| **Enterprise SOA / EAI** | OSB, IBM IIB, MuleSoft | Protocol mediation, XA, B2B, legacy wrapping |
+| **Multi-tenant cloud** | Pulsar | Nativní multi-tenant, geo-replication |
+| **Serverless / event-driven** | SQS/SNS, Pub/Sub | Managed, auto-scaling |
+
+---
+
+## DR a vysoká dostupnost
+
+Viz [DATACENTERS.md](DATACENTERS.md) — sekce "Vliv jednotlivých technologií na výběr DC topologie" pro detail DR mapping per platforma.
+
+### Best practices
+
+- **Neztrať zprávu v queue** — preferovat aknowledge-based consumption (ne auto-ack)
+- **Dead letter queue** — každá hlavní queue má DLQ pro nedoručitelné zprávy
+- **Monitoring lag** — consumer lag je klíčová metrika (Kafka: `kafka.consumer:consumer_lag`)
+- **Idempotentní consumer** — stejná zpráva může být doručena dvakrát
+- **Retry s backoff** — exponenciální backoff při selhání zpracování
+- **Schema registry** — vyhnout se deserialization errors (Avro, Protobuf, JSON Schema)
+- **Šifrování** — TLS in transit, encryption at rest (Kafka: cluster-side + topic-level)
+
+---
+
+## Související
+
+- [DATACENTERS.md](DATACENTERS.md) — DR topologie, per-platforma mapping
+- [CLOUD.md](CLOUD.md) — managed messaging (SQS, SNS, Service Bus, Pub/Sub)
+
+## Zdroje
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Poslední revize: 2026-06-12*
--- a/README.en.md
+++ b/README.en.md
@@ -52,9 +52,10 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | 🌐 Network architecture | [NETWORKING.md](NETWORKING.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
 | 📊 Monitoring & observability | [MONITORING.md](MONITORING.md) | Prometheus, Grafana, OTel, logging, alerting | — |
 | 🔄 CI/CD & DevOps | [CICD.md](CICD.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
+| 🔄 Disaster Recovery | [DR.md](DR.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
 | 🗄️ Database architecture | [DATABASES.md](DATABASES.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VEKTOROVE-DB, DATABAZOVE-ENGINY |
 | 🖥️ Hypervisors | [HYPERVISORS.md](HYPERVISORS.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
-| 🏭 Data centers | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC services | MONITORING |
+| 🏭 Data centers | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING |
 | 💾 Storage | [STORAGE.md](STORAGE.md) | SAN/NAS/object, RAID, SDS, Ceph, OpenStack Cinder/Swift/Manila | — |
 | 🔌 Server connectivity | [CONNECTIVITY.md](CONNECTIVITY.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
 | 🔧 Server hardware | [SERVER-HW.md](SERVER-HW.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
@@ -89,9 +90,10 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | 🌐 Network architecture | [NETWORKING.en.md](NETWORKING.en.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
 | 📊 Monitoring & observability | [MONITORING.en.md](MONITORING.en.md) | Prometheus, Grafana, OTel, logging, alerting | — |
 | 🔄 CI/CD & DevOps | [CICD.en.md](CICD.en.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
+| 🔄 Disaster Recovery | [DR.en.md](DR.en.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
 | 🗄️ Database architecture | [DATABASES.en.md](DATABASES.en.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VECTOR-DBS, DATABASE-ENGINES |
 | 🖥️ Hypervisors | [HYPERVISORS.en.md](HYPERVISORS.en.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
-| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services | MONITORING |
+| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING |
 | 💾 Storage | [STORAGE.en.md](STORAGE.en.md) | SAN/NAS/object, RAID, SDS, Ceph | — |
 | 🔌 Server connectivity | [CONNECTIVITY.en.md](CONNECTIVITY.en.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
 | 🔧 Server hardware | [SERVER-HW.en.md](SERVER-HW.en.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
@@ -136,6 +138,7 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | `DATACENTERS.md` / `DATACENTERS.en.md` | [`MONITORING.md`](MONITORING.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `MONITORING.md` / `MONITORING.en.md` | [`sources/monitoring/sources.md`](sources/monitoring/sources.md) |
 | `CICD.md` / `CICD.en.md` | [`sources/cicd/sources.md`](sources/cicd/sources.md) |
+| `DR.md` / `DR.en.md` | [`CLOUD.md`](CLOUD.md), [`DATACENTERS.md`](DATACENTERS.md), [`MONITORING.md`](MONITORING.md), [`CICD.md`](CICD.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `PROVISIONING.md` / `PROVISIONING.en.md` | [`CICD.md`](CICD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `STORAGE.md` / `STORAGE.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `GPU.md` / `GPU.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
--- a/README.md
+++ b/README.md
@@ -52,15 +52,18 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | 🌐 Síťová architektura | [NETWORKING.md](NETWORKING.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
 | 📊 Monitoring a observabilita | [MONITORING.md](MONITORING.md) | Prometheus, Grafana, OTel, logging, alerting, SLO | — |
 | 🔄 CI/CD a DevOps | [CICD.md](CICD.md) | Pipelines, GitOps, IaC (Terraform), deployment strategie | — |
+| 🔄 Disaster Recovery | [DR.md](DR.md) | RTO, RPO, scénáře, prevence, výpočet uptimu | CLOUD, DATACENTERS, MONITORING |
 | 🗄️ Databázová architektura | [DATABASES.md](DATABASES.md) | Klasifikace, sharding, replikace, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VEKTOROVE-DB, DATABAZOVE-ENGINY |
 | 🖥️ Hypervisory | [HYPERVISORS.md](HYPERVISORS.md) | VMware, Hyper-V, KVM, Proxmox, migrace | STORAGE, SERVER-HW |
-| 🏭 Datová centra | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC služby | MONITORING |
+| 🏭 Datová centra | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC služby, sekundární DC topologie | MONITORING, MESSAGING |
 | 💾 Storage | [STORAGE.md](STORAGE.md) | SAN/NAS/object, RAID, SDS, Ceph, OpenStack Cinder/Swift/Manila | — |
 | 🔌 Server connectivity | [CONNECTIVITY.md](CONNECTIVITY.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
 | 🔧 Server hardware | [SERVER-HW.md](SERVER-HW.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
 | 🎮 GPU | [GPU.md](GPU.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — |
 | ⚙️ Server config | [SERVER-CONFIG.md](SERVER-CONFIG.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — |
 | 📦 Provisioning | [PROVISIONING.md](PROVISIONING.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD |
+| 📨 Messaging & streaming | [MESSAGING.md](MESSAGING.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD |
+| 🏗️ Migrace DC | [DC-MIGRATION.md](DC-MIGRATION.md) | Strategie, fáze, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE |
 | 📋 Původní rozcestník | [HARDWARE.md](HARDWARE.md) | Legacy index → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING |
 | 📋 Původní infrastruktura | [INFRASTRUCTURE.md](INFRASTRUCTURE.md) | Legacy index → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE |
 | 📋 Review workflow | [REVIEW.md](REVIEW.md) | Proces oponentury a kontroly obsahu | — |
@@ -89,15 +92,18 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | 🌐 Network architecture | [NETWORKING.en.md](NETWORKING.en.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
 | 📊 Monitoring & observability | [MONITORING.en.md](MONITORING.en.md) | Prometheus, Grafana, OTel, logging, alerting | — |
 | 🔄 CI/CD & DevOps | [CICD.en.md](CICD.en.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
+| 🔄 Disaster Recovery | [DR.en.md](DR.en.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
 | 🗄️ Database architecture | [DATABASES.en.md](DATABASES.en.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VECTOR-DBS, DATABASE-ENGINES |
 | 🖥️ Hypervisors | [HYPERVISORS.en.md](HYPERVISORS.en.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
-| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services | MONITORING |
+| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING, MESSAGING |
 | 💾 Storage | [STORAGE.en.md](STORAGE.en.md) | SAN/NAS/object, RAID, SDS, Ceph | — |
 | 🔌 Server connectivity | [CONNECTIVITY.en.md](CONNECTIVITY.en.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
 | 🔧 Server hardware | [SERVER-HW.en.md](SERVER-HW.en.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
 | 🎮 GPU | [GPU.en.md](GPU.en.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — |
 | ⚙️ Server config | [SERVER-CONFIG.en.md](SERVER-CONFIG.en.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — |
 | 📦 Provisioning | [PROVISIONING.en.md](PROVISIONING.en.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD |
+| 📨 Messaging & streaming | [MESSAGING.en.md](MESSAGING.en.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD |
+| 🏗️ DC Migration | [DC-MIGRATION.en.md](DC-MIGRATION.en.md) | Strategies, phases, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE |
 | 📋 Legacy index | [HARDWARE.en.md](HARDWARE.en.md) | → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING |
 | 📋 Legacy infra | [INFRASTRUCTURE.en.md](INFRASTRUCTURE.en.md) | → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE |
 | 📋 Review workflow | [REVIEW.en.md](REVIEW.en.md) | Review and content control process | — |
@@ -136,6 +142,9 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | `DATACENTERS.md` / `DATACENTERS.en.md` | [`MONITORING.md`](MONITORING.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `MONITORING.md` / `MONITORING.en.md` | [`sources/monitoring/sources.md`](sources/monitoring/sources.md) |
 | `CICD.md` / `CICD.en.md` | [`sources/cicd/sources.md`](sources/cicd/sources.md) |
+| `DR.md` / `DR.en.md` | [`CLOUD.md`](CLOUD.md), [`DATACENTERS.md`](DATACENTERS.md), [`MONITORING.md`](MONITORING.md), [`CICD.md`](CICD.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
+| `MESSAGING.md` / `MESSAGING.en.md` | [`DATACENTERS.md`](DATACENTERS.md), [`CLOUD.md`](CLOUD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
+| `DC-MIGRATION.md` / `DC-MIGRATION.en.md` | [`DATACENTERS.md`](DATACENTERS.md), [`CLOUD.md`](CLOUD.md), [`DR.md`](DR.md), [`NETWORKING.md`](NETWORKING.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `PROVISIONING.md` / `PROVISIONING.en.md` | [`CICD.md`](CICD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `STORAGE.md` / `STORAGE.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `GPU.md` / `GPU.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
@@ -187,4 +196,4 @@ Raw referenční data (dokumentace, knihy, standardy) podle oblastí:

 ---

-*Rozcestník je automaticky udržován agentem `kb-index`. Poslední aktualizace: 2026-06-11.*
+*Rozcestník je automaticky udržován agentem `kb-index`. Poslední aktualizace: 2026-06-12.*
--- a/sources/infrastructure/sources.md
+++ b/sources/infrastructure/sources.md
@@ -111,7 +111,22 @@ Rozděleno do samostatných souborů:
 | VMware Migration in 2026: Proxmox, KVM, XCP-ng & Veeam — StarWind | https://starwindsoftware.com/blog/vmware-migration-to-proxmox-kvm-xcp-ng-2026 | `[done]` |
 | Complete guide to modern vSphere alternatives — Spectro Cloud | https://www.spectrocloud.com/blog/vsphere-alternatives | `[done]` |
 | Broadcom VMware Acquisition: What's Next — Sayers | https://www.sayers.com/blog/after-the-deal-whats-next-for-vmware-customers | `[done]` |
-| Stanford University migration from VMware to Proxmox | https://itcommunity.stanford.edu/news/enterprise-technology-completes-successful-virtual-infrastructure-migration-vmware-proxmox | `[done]` |
+ | Stanford University migration from VMware to Proxmox | https://itcommunity.stanford.edu/news/enterprise-technology-completes-successful-virtual-infrastructure-migration-vmware-proxmox | `[done]` |
+| | **Messaging / streaming** | |
+| Apache Kafka docs | https://kafka.apache.org/documentation/ | `[done]` |
+| RabbitMQ docs | https://www.rabbitmq.com/documentation.html | `[done]` |
+| Apache Pulsar docs | https://pulsar.apache.org/docs/ | `[done]` |
+| NATS docs | https://docs.nats.io/ | `[done]` |
+| Designing Event-Driven Systems (Confluent) | https://www.confluent.io/designing-event-driven-systems/ | `[done]` |
+| Kafka: The Definitive Guide (2nd ed.) — Confluent | https://www.confluent.io/resources/kafka-the-definitive-guide/ | `[done]` |
+| Enterprise Integration Patterns — Hohpe & Woolf | https://www.enterpriseintegrationpatterns.com/ | `[done]` |
+| | **DC migrace** | |
+| AWS Cloud Migration — 6 Strategies for Migrating to the Cloud | https://aws.amazon.com/blogs/enterprise-strategy/6-strategies-for-migrating-applications-to-the-cloud/ | `[done]` |
+| Azure Cloud Migration — Microsoft Cloud Adoption Framework | https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ | `[done]` |
+| Gartner 5 Rs of Cloud Migration | https://www.gartner.com/en/documents/3984835 | `[done]` |
+| VMware Site Recovery Manager — documentation | https://docs.vmware.com/en/Site-Recovery-Manager/ | `[done]` |
+| Zerto — Disaster Recovery & Migration | https://www.zerto.com/resources/ | `[done]` |
+| The Phoenix Project — IT Ops & Migration patterns | https://itrevolution.com/product/the-phoenix-project/ | `[done]` |

 ## Výrobci hardware