From b53714113ca99b4b6eced507461309dc3c9bf35d Mon Sep 17 00:00:00 2001 From: Stanislav Hubacek Date: Tue, 16 Jun 2026 15:47:45 +0200 Subject: [PATCH] new files --- DATACENTERS.en.md | 275 ++++++++++++++++++++++++ DATACENTERS.md | 277 +++++++++++++++++++++++- DC-MIGRATION.en.md | 246 ++++++++++++++++++++++ DC-MIGRATION.md | 246 ++++++++++++++++++++++ DR.en.md | 336 ++++++++++++++++++++++++++++++ DR.md | 336 ++++++++++++++++++++++++++++++ MESSAGING.en.md | 275 ++++++++++++++++++++++++ MESSAGING.md | 275 ++++++++++++++++++++++++ README.en.md | 7 +- README.md | 15 +- sources/infrastructure/sources.md | 17 +- 11 files changed, 2298 insertions(+), 7 deletions(-) create mode 100644 DC-MIGRATION.en.md create mode 100644 DC-MIGRATION.md create mode 100644 DR.en.md create mode 100644 DR.md create mode 100644 MESSAGING.en.md create mode 100644 MESSAGING.md diff --git a/DATACENTERS.en.md b/DATACENTERS.en.md index 5eee5b8..bbb732a 100644 --- a/DATACENTERS.en.md +++ b/DATACENTERS.en.md @@ -658,6 +658,281 @@ flowchart TD CLIM -->|"Cold (SE, NO)"| FC3["Free cooling 7000+ h/year
Air-side economizer
PUE < 1.2"] ``` +## Secondary data center topologies + +When planning a second DC, the choice of topology is key based on distance, RPO/RTO, and budget. + +### Distance classification + +| Category | Distance | Latency (round-trip) | Use case | +|-----------|-----------|---------------------|----------| +| **Metro (Campus)** | 1–20 km | < 1 ms | Synchronous replication, stretched cluster | +| **Metro** | 20–100 km | 1–5 ms | Metro cluster, mostly sync replication | +| **Regional** | 100–500 km | 5–20 ms | Asynchronous replication, warm standby | +| **Continent** | 500–3000 km | 20–100 ms | Asynchronous replication, cold standby | +| **Global** | 3000+ km | > 100 ms | Async only, no real-time dependencies | + +### Topologies by operational mode + +#### Active-Active (Hot-Hot) + +``` +DC-A (Primary) DC-B (Active) +┌────────────────────┐ ┌────────────────────┐ +│ App Active │ │ App Active │ +│ DB Active │◄─sync─►│ DB Active │ +│ Users → LB → A │ │ Users → LB → B │ +└────────────────────┘ └────────────────────┘ + │ │ + └──── Global Load Balancer ────┘ +``` + +| Parameter | Value | +|----------|---------| +| **RTO** | 0–seconds (automatic failover, traffic is redirected) | +| **RPO** | 0 (sync replication, commit is confirmed only after write to both DCs) | +| **Max distance** | < 100 km (latency < 5 ms RTT for sync DB replication) | +| **Operating costs** | 2× (both DCs fully active, both fully equipped) | +| **Advantages** | Zero downtime, instant switchover, full utilization of both DCs | +| **Disadvantages** | Requires synchronous replication → distance limit, complex networking, split-brain risk | + +**Split-brain solutions**: STONITH (Shoot The Other Node In The Head), watchdog, quorum (3rd node in 3rd location / cloud), fencing, SCSI-3 persistent reservation. + +**Use case**: Financial services, telco, payment gateways — where even a minute of downtime = millions. + +#### Active-Passive (Hot-Warm, MetroCluster) + +``` +DC-A (Primary) DC-B (Standby) +┌────────────────────┐ ┌────────────────────┐ +│ App Active │ │ App Standby │ +│ DB Primary │──sync──►│ DB Standby │ +│ Users → LB → A │ │ ~~~ (waiting) ~~~ │ +│ DNS: A-record │ │ DNS: health check │ +└────────────────────┘ └────────────────────┘ +``` + +| Parameter | Value | +|----------|---------| +| **RTO** | tens of seconds–minutes (DNS failover + App startup) | +| **RPO** | 0 (sync) or seconds (async) | +| **Max distance** | sync < 100 km, async unlimited | +| **Operating costs** | 1.5–1.8× (second DC has reduced or idle compute) | +| **MetroCluster** | Specific implementation: FC SAN over DWDM, sync mirror, automatic failover | + +**MetroCluster** (NetApp, Dell EMC, HPE): +- Storage-based cluster with synchronous mirroring between DCs +- Automatic failover on entire DC failure +- Requires dedicated DWDM or dark fiber interconnection +- Typical distance: up to 50 km (for latency < 1 ms RTT) +- Use case: enterprise storage, primary+secondary DC in metropolitan area + +#### Hot-Cold (Warm Standby → Cold) + +``` +DC-A (Primary) DC-B (Cold Standby) +┌────────────────────┐ ┌────────────────────┐ +│ App Active │ │ ~~~ powered off ~~~│ +│ DB Active │──async─►│ Backup storage │ +│ Users → A │ │ ~~~ no compute ~~~│ +└────────────────────┘ └────────────────────┘ +``` + +| Parameter | Value | +|----------|---------| +| **RTO** | hours–days (purchase/rent HW, restore from backup) | +| **RPO** | hours (last backup) | +| **Max distance** | unlimited | +| **Operating costs** | 1.1–1.3× (only storage and facility, compute only at failover) | +| **Typical use case** | Low-cost DR, compliance, last resort | + +#### Pilot Light + +``` +DC-A (Primary) DC-B (Pilot Light) +┌────────────────────┐ ┌────────────────────┐ +│ App Active │ │ ~~~ off ~~~ │ +│ DB Active │──async─►│ DB replica (mini) │ +│ All services │ │ Core services only│ +│ │ │ (DNS, LDAP, mon) │ +└────────────────────┘ └────────────────────┘ + On DR: spin-up compute + from IaC, rest from backup +``` + +- DC-B runs with minimum compute (only core services and DB replica) +- Application layer is spun up from IaC (Terraform, Ansible) only during DR +- Compromise between cost and RTO + +### Comparison table + +| Topology | RTO | RPO | Cost (× primary) | Max distance | Failover | +|-----------|-----|-----|-------------------|-------------|----------| +| **Active-Active** | 0–s | 0 | 2.0× | < 100 km | Auto (traffic) | +| **MetroCluster** | s–min | 0 | 1.8–2.0× | < 50 km | Auto (storage) | +| **Active-Passive (sync)** | min | 0 | 1.5–1.8× | < 100 km | Semi-auto | +| **Active-Passive (async)** | min–h | s–min | 1.3–1.5× | unlimited | Semi-auto | +| **Pilot Light** | h | min–h | 1.2–1.4× | unlimited | Manual | +| **Warm Standby** | min–h | s–min | 1.5–1.8× | unlimited | Semi-auto | +| **Cold Standby** | days | h | 1.1–1.3× | unlimited | Manual | + +### Stretched Cluster + +``` +┌──── Site A (50 km) ────┐ ┌──── Site B ──────────┐ +│ ┌──────────────────┐ │ │ ┌──────────────────┐ │ +│ │ ESXi / Hyper-V │ │ │ │ ESXi / Hyper-V │ │ +│ │ VM │ │ │ │ VM (complement) │ │ +│ └────────┬─────────┘ │ │ └────────┬─────────┘ │ +│ │ │ │ │ │ +│ ┌────────▼─────────┐ │ │ ┌────────▼─────────┐ │ +│ │ Storage (SAN) │──┼────┼──│ Storage (SAN) │ │ +│ │ MetroCluster │ │ │ │ MetroCluster │ │ +│ └──────────────────┘ │ │ └──────────────────┘ │ +└────────────────────────┘ └────────────────────────┘ + │ + ┌─────▼──────┐ + │ vCenter / │ + │ Cluster │ + │ (single) │ + └────────────┘ +``` + +- One cluster stretched across two sites (single management domain) +- VMs can live-migrate between sites (vMotion over distance) +- Storage synchronously mirrored (MetroCluster, VPLEX, vSAN延伸) +- **Requirements**: dark fiber / DWDM, low latency (< 5 ms), high link reliability +- **Risks**: split-brain, brain drain (split-site cluster), network dependency +- **Use case**: enterprise with own dark fiber between two DCs in a metropolitan area + +### Decision tree + +```mermaid +flowchart TD + Start(["Secondary DC"]) --> RPO{"Required RPO?"} + RPO -->|"0 (no data loss)"| SYNC{"Sync replication possible?"} + SYNC -->|"Yes, < 100 km"| ACT{"Want zero downtime?"} + ACT -->|"Yes"| AA["Active-Active
RTO=0, RPO=0, 2× cost"] + ACT -->|"No"| AP["Active-Passive
RTO=min, RPO=0, 1.5×"] + SYNC -->|"No, > 100 km"| ASYNC["Active-Passive (async)
RTO=min, RPO=s, 1.3×"] + + RPO -->|"minutes–hours"| WARM{"Want fast failover?"} + WARM -->|"Yes"| PILOT["Pilot Light
RTO=h, RPO=min, 1.2×"] + WARM -->|"No"| COLD["Cold Standby
RTO=days, RPO=h, 1.1×"] + + Start --> DIST{"Distance between DCs"} + DIST -->|"< 50 km, own fiber"| MC["MetroCluster / Stretched Cluster
Single management, sync storage"] + DIST -->|"50–300 km"| REG["Regional DR
Active-Passive, async replication"] + DIST -->|"> 300 km"| GLOBAL["Global DR
Cold standby, backup & restore"] +``` + +### Physical infrastructure for DC interconnection + +| Technology | Bandwidth | Max distance | Latency | Use case | +|------------|-----------|-------------|---------|----------| +| **Dark fiber** | 100 GbE–800 GbE | 10–80 km (single-mode) | < 0.1 ms | MetroCluster, stretched cluster | +| **DWDM** | 400 GbE–1.6 TbE (per lambda) | 80–120 km (without amplifier) | < 0.5 ms | Metro, metro cluster | +| **CWDM** | 10–25 GbE (per channel) | 10–40 km | < 0.3 ms | Campus, smaller metro | +| **MPLS L2VPN** | 10–100 GbE | unlimited | 1–10 ms | Regional DR, async replication | +| **Internet IPsec** | 1–10 GbE | unlimited | 5–50 ms | Cold standby, backup | + +### Impact of individual technologies on DC topology selection + +Choosing a secondary DC topology is not purely an infrastructure decision — each layer (DB, hypervisor, orchestration, messaging) brings its own constraints. + +#### Databases + +| DB technology | Sync replication | Max distance | Auto-failover | Split-brain handling | Note | +|---------------|---------------|-------------|---------------|-------------------|----------| +| **PostgreSQL** | Synchronous commit (synchronous_standby_names) | < 100 km (latency < 10 ms) | Patroni / repmgr + etcd | Quorum (etcd, 3+ node) | Streaming replication, needs wal_keep_segments | +| **MySQL** | Group Replication (multi-primary, single-primary) | < 100 km | MySQL InnoDB Cluster + MySQL Router | Paxos (Group Replication, 3+ node) | Semi-sync as compromise | +| **Oracle** | Data Guard (SYNC/FASTSYNC/ASYNC), RAC extended | sync < 100 km, async unlimited | Data Guard Broker / FSFO (Fast Start Failover) | Observer (3rd node) | Far Sync for remote DCs | +| **MSSQL** | AlwaysOn Availability Groups (SYNCHRONOUS_COMMIT) | < 100 km | AlwaysOn + Cluster quorum | File share majority / cloud witness | Multi-site cluster support | +| **MongoDB** | Majority write concern + journaling | < 100 km | Replica set auto-election | Arbitration node (voting member) | Priority-based failover | +| **Cassandra** | N/A (multi-master, eventual consistency) | unlimited | Yes (peer-to-peer) | None (multi-master, gossip protocol) | Snitch-aware topology, NetworkTopologyStrategy | +| **Redis** | Redis Sentinel / Redis Cluster (async) | unlimited (async) | Sentinel / Cluster failover | Quorum (Sentinel, majority) | PSYNC replication, replication lag | + +Key limitation for **sync replication**: latency < 5 ms RTT (commit must wait for confirmation from both DCs). At 100 km RTT ~1 ms — OK. At 1000 km (~10 ms RTT) sync replication reduces transaction throughput by 80+ %. + +Suitable for **Active-Active**: +- **Cassandra / ScyllaDB** — native multi-DC, eventual consistency, no split-brain +- **MySQL Group Replication (multi-primary)** — 3+ DC for quorum +- **CockroachDB / TiDB** — native multi-region, ACID across DCs +- **Redis Enterprise** — Active-Active (CRDT-based) + +Suitable for **Active-Passive**: +- **PostgreSQL + Patroni** — auto-failover, etcd quorum +- **Oracle Data Guard** — FSFO, far sync for remote DCs +- **MSSQL AlwaysOn** — cloud witness +- **MongoDB Replica Set** — arbitration node in 3rd location + +#### Hypervisors + +| Hypervisor | Cluster technology | Stretched cluster | Max distance | Split-brain | +|-----------|-------------------|-------------------|-------------|-------------| +| **VMware vSphere** | vSAN延伸, Metro vCenter, Site Recovery Manager | Yes (vSAN延伸, Metro Cluster) | < 50 km (vSAN延伸), < 10 ms RTT | Fencing (STONITH), witness host | +| **Hyper-V** | Storage Replica + Failover Cluster | Yes (Cluster Sets) | < 50 km (sync), unlimited (async) | File share witness / cloud witness | +| **Proxmox VE** | Proxmox HA + Ceph | Limited (Ceph stretch cluster) | < 50 km (Ceph sync) | Ceph monitor quorum (3+ DC) | +| **XCP-ng / XenServer** | Xen Orchestra HA + SR (Storage Repository) replication | Limited | depends on storage replication | — | +| **Nutanix AHV** | Metro Availability (sync), Async DR | Yes (Metro) | < 100 km (sync), unlimited (async) | Witness VM (cloud / 3rd site) | +| **KVM / oVirt** | oVirt HA + GlusterFS / NFS | Limited | depends on storage replication | — | + +**vSAN延伸 specific requirements:** +- Dedicated vSAN network (25 GbE min., < 5 ms RTT) +- Witness host in 3rd location (or cloud witness) +- All VM policies (FTT=1, mirroring striped) +- Storage policy: `site-A + site-B + witness` + +#### Kubernetes and container platforms + +| Platform | Multi-cluster DR | Replication | Max distance | Failover | +|-----------|-----------------|-----------|-------------|----------| +| **Vanilla K8s** | KubeFed, Cluster API, Velero + Restic | Velero (backup/restore), Rook (Ceph) | unlimited | Manual (Velero restore) | +| **OpenShift** | ACM (Advanced Cluster Management), Velero | OADP (OpenShift API for Data Protection) | unlimited | ACM failover (subscription) | +| **Rancher** | Rancher Multi-Cluster App, Velero | Longhorn (sync/async DR), Velero | unlimited | Semi-auto | +| **Google GKE** | Multi-cluster Services, Backup for GKE | Config Sync, Backup for GKE | unlimited | Manual | +| **Azure AKS** | Azure ARC + Velero + Azure Traffic Manager | AKS backup (velero), Azure Site Recovery | unlimited | Manual (Velero) | +| **AWS EKS** | EKS multi-cluster, Velero + S3 cross-region | Velero (S3), Rook (EBS snapshots) | unlimited | Manual | + +**Key K8s DR principles:** +- **Applications must be stateless** (or state externalized to DB/storage) +- **Velero** — backup/restore entire cluster (PV, resources, helm releases) +- **Rook/Ceph** — cross-region mirroring RBD volumes +- **KubeFed / ACM** — subscription-based deploy to multiple clusters +- **Ingress/Gateway API** — traffic routing between clusters +- **External DNS** — DNS failover on cluster outage + +#### Messaging / streaming + +| Platform | Replication | Topology | DR support | Max distance | +|-----------|-----------|-----------|------------|-------------| +| **Apache Kafka** | MirrorMaker 2, Confluent Cluster Linking, KRaft quorum | Active-Passive (MM2), Active-Active (Cluster Linking) | MM2: async, Cluster Linking: async | unlimited | +| **RabbitMQ** | Classic Queue Mirroring, Quorum Queues | Active-Passive (Warm Standby) | Federation / Shovel (async) | unlimited | +| **Red Hat AMQ** | (Artemis) Cluster + HA | Active-Passive (shared store / replication) | Live-backup pair | < 100 km (sync) | +| **NATS** | NATS JetStream (cluster + cross-account) | Active-Active (Leaf nodes, cross-account) | Super-cluster, failover | unlimited | +| **Apache Pulsar** | BookKeeper (bookie rack-aware), geo-replication | Active-Active (geo-replication) | Built-in (cluster-level) | unlimited (async) | +| **AWS SQS/SNS** | Managed, AWS region pairs | Active-Active (multi-region) | Built-in (AWS managed) | unlimited | +| **Azure Service Bus** | Managed, paired region | Active-Passive (paired region) | Built-in (geo-recovery) | unlimited | +| **Oracle Service Bus (OSB)** | Oracle WebLogic Cluster + JDBC store + AQ | Active-Passive (WebLogic Cluster + Data Guard) | OSB/WLS cluster + Oracle RAC/Data Guard sync | < 100 km (Data Guard sync), unlimited (async) | + +**Messaging DR recommendations:** +- **Kafka**: use Cluster Linking for Active-Active, or MirrorMaker 2 for Active-Passive; replicate only critical topics +- **RabbitMQ**: Quorum Queues + Federation upstream for DR; avoid Classic Queue Mirroring (deprecated) +- **Pulsar**: native geo-replication, bookie rack-aware for stretched cluster; easiest DR among messaging platforms +- **OSB**: WebLogic cluster + Oracle RAC/Data Guard; DR depends on DB layer, not on OSB itself + +### Per-layer limitations summary table + +| Layer | Limiting factor for secondary DC | Max distance for sync | Impact on topology selection | +|--------|-----------------------------------|----------------------|--------------------------| +| **Storage** | Sync mirror latency, DWDM cost | < 50 km (MetroCluster) | Stretched cluster only in metro | +| **Databases** | Commit wait for sync replication | < 100 km (5 ms RTT) | Active-Active only with multi-master DB | +| **Hypervisor** | Stretched cluster quorum + fencing | < 50 km (vSAN, 5 ms) | MetroCluster / stretched cluster | +| **Kubernetes** | Velero restore time, Rook mirror latency | unlimited (async) | Active-Passive, cold standby | +| **Messaging** | Replication lag, offset management | unlimited (async) | Active-Active (Kafka, Pulsar, NATS) or Active-Passive | +| **Network** | Dark fiber/DWDM cost, latency | < 100 km (metro fiber) | Limits sync replication options | +| **Application** | Stateful/stateless, connection draining | depends on architecture | Stateless app → any topology | + ## Disk monitoring — S.M.A.R.T. Self-Monitoring, Analysis and Reporting Technology — predictive monitoring of HDD/SSD. diff --git a/DATACENTERS.md b/DATACENTERS.md index 00dd318..8c97401 100644 --- a/DATACENTERS.md +++ b/DATACENTERS.md @@ -658,6 +658,281 @@ flowchart TD CLIM -->|"Chladná (SE, NO)"| FC3["Free cooling 7000+ h/rok
Air-side economizer
PUE < 1.2"] ``` +## Topologie sekundárního datového centra + +Při plánování druhého DC je klíčová volba topologie podle vzdálenosti, RPO/RTO a rozpočtu. + +### Klasifikace vzdáleností + +| Kategorie | Vzdálenost | Latence (round-trip) | Use case | +|-----------|-----------|---------------------|----------| +| **Metro (Campus)** | 1–20 km | < 1 ms | Synchronní replikace, stretched cluster | +| **Metro** | 20–100 km | 1–5 ms | Metro cluster, většinou sync replikace | +| **Regional** | 100–500 km | 5–20 ms | Asynchronní replikace, warm standby | +| **Continent** | 500–3000 km | 20–100 ms | Asynchronní replikace, cold standby | +| **Global** | 3000+ km | > 100 ms | Pouze async, žádné real-time závislosti | + +### Topologie podle provozního režimu + +#### Active-Active (Hot-Hot) + +``` +DC-A (Primary) DC-B (Active) +┌────────────────────┐ ┌────────────────────┐ +│ App Active │ │ App Active │ +│ DB Active │◄─sync─►│ DB Active │ +│ Users → LB → A │ │ Users → LB → B │ +└────────────────────┘ └────────────────────┘ + │ │ + └──── Global Load Balancer ────┘ +``` + +| Parametr | Hodnota | +|----------|---------| +| **RTO** | 0–vteřiny (automatický failover, traffic se přesměruje) | +| **RPO** | 0 (sync replikace, commit je potvrzen až po zápisu do obou DC) | +| **Max distance** | < 100 km (latence < 5 ms RTT pro sync DB replikaci) | +| **Provozní náklady** | 2× (obě DC plně aktivní, obě plně vybavené) | +| **Výhody** | Nulový výpadek, okamžité přepnutí, plné využití obou DC | +| **Nevýhody** | Nutná synchronní replikace → limit vzdálenosti, komplexní networking, split-brain risk | + +**Split-brain řešení**: STONITH (Shoot The Other Node In The Head), watchdog, quorum (3. node v 3. lokaci / cloud), fencing, SCSI-3 persistent reservation. + +**Use case**: Finanční služby, telco, platební brány — kde i minuta výpadku = miliony. + +#### Active-Passive (Hot-Warm, MetroCluster) + +``` +DC-A (Primary) DC-B (Standby) +┌────────────────────┐ ┌────────────────────┐ +│ App Active │ │ App Standby │ +│ DB Primary │──sync──►│ DB Standby │ +│ Users → LB → A │ │ ~~~ (čeká) ~~~ │ +│ DNS: A-record │ │ DNS: health check │ +└────────────────────┘ └────────────────────┘ +``` + +| Parametr | Hodnota | +|----------|---------| +| **RTO** | desítky vteřin–minuty (DNS failover + startup App) | +| **RPO** | 0 (sync) nebo sekundy (async) | +| **Max distance** | sync < 100 km, async neomezeně | +| **Provozní náklady** | 1,5–1,8× (druhé DC má zmenšený nebo idle compute) | +| **MetroCluster** | Specifická implementace: FC SAN přes DWDM, sync mirror, automatický failover | + +**MetroCluster** (NetApp, Dell EMC, HPE): +- Storage-based cluster se synchronním mirroringem mezi DC +- Automatic failover při selhání celého DC +- Vyžaduje dedikované DWDM nebo dark fiber propojení +- Typická vzdálenost: do 50 km (pro latenci < 1 ms RTT) +- Use case: enterprise storage, primary+secondary DC v metropolitní oblasti + +#### Hot-Cold (Warm Standby → Cold) + +``` +DC-A (Primary) DC-B (Cold Standby) +┌────────────────────┐ ┌────────────────────┐ +│ App Active │ │ ~~~ powered off ~~~│ +│ DB Active │──async─►│ Backup storage │ +│ Users → A │ │ ~~~ no compute ~~~│ +└────────────────────┘ └────────────────────┘ +``` + +| Parametr | Hodnota | +|----------|---------| +| **RTO** | hodiny–dny (nákup/najmutí HW, obnova z backupu) | +| **RPO** | hodiny (poslední backup) | +| **Max distance** | neomezena | +| **Provozní náklady** | 1,1–1,3× (jen storage a facility, compute až při failoveru) | +| **Typ use case** | Low-cost DR, compliance, poslední záchrana | + +#### Pilot Light + +``` +DC-A (Primary) DC-B (Pilot Light) +┌────────────────────┐ ┌────────────────────┐ +│ App Active │ │ ~~~ off ~~~ │ +│ DB Active │──async─►│ DB replica (mini) │ +│ Všechny služby │ │ Core services jen │ +│ │ │ (DNS, LDAP, mon) │ +└────────────────────┘ └────────────────────┘ + Při DR: spin-up compute + z IaC, zbytek z backupu +``` + +- DC-B běží s minimem compute (jen core služby a DB replica) +- Aplikační vrstva se spin-up z IaC (Terraform, Ansible) až při DR +- Kompromis mezi náklady a RTO + +### Srovnávací tabulka + +| Topologie | RTO | RPO | Náklady (× primár) | Max distance | Failover | +|-----------|-----|-----|-------------------|-------------|----------| +| **Active-Active** | 0–s | 0 | 2,0× | < 100 km | Auto (traffic) | +| **MetroCluster** | s–min | 0 | 1,8–2,0× | < 50 km | Auto (storage) | +| **Active-Passive (sync)** | min | 0 | 1,5–1,8× | < 100 km | Polo-auto | +| **Active-Passive (async)** | min–h | s–min | 1,3–1,5× | neomezena | Polo-auto | +| **Pilot Light** | h | min–h | 1,2–1,4× | neomezena | Manuální | +| **Warm Standby** | min–h | s–min | 1,5–1,8× | neomezena | Polo-auto | +| **Cold Standby** | dny | h | 1,1–1,3× | neomezena | Manuální | + +### Stretched Cluster + +``` +┌──── Site A (50 km) ────┐ ┌──── Site B ──────────┐ +│ ┌──────────────────┐ │ │ ┌──────────────────┐ │ +│ │ ESXi / Hyper-V │ │ │ │ ESXi / Hyper-V │ │ +│ │ VM │ │ │ │ VM (komplement) │ │ +│ └────────┬─────────┘ │ │ └────────┬─────────┘ │ +│ │ │ │ │ │ +│ ┌────────▼─────────┐ │ │ ┌────────▼─────────┐ │ +│ │ Storage (SAN) │──┼────┼──│ Storage (SAN) │ │ +│ │ MetroCluster │ │ │ │ MetroCluster │ │ +│ └──────────────────┘ │ │ └──────────────────┘ │ +└────────────────────────┘ └────────────────────────┘ + │ + ┌─────▼──────┐ + │ vCenter / │ + │ Cluster │ + │ (single) │ + └────────────┘ +``` + +- Jeden cluster roztažený přes dvě lokality (single management domain) +- VM mohou live-migrovat mezi site (vMotion nad vzdálenost) +- Storage synchronně mirrorovaná (MetroCluster, VPLEX, vSAN延伸) +- **Požadavky**: dark fiber / DWDM, nízká latence (< 5 ms), vysoká spolehlivost linky +- **Riziko**: split-brain, brain drain (split-site cluster), závislost na síti +- **Use case**: enterprise s vlastní dark fiber mezi dvěma DC v metropolitní oblasti + +### Rozhodovací strom + +```mermaid +flowchart TD + Start(["Sekundární DC"]) --> RPO{"Požadované RPO?"} + RPO -->|"0 (žádná ztráta dat)"| SYNC{"Sync replikace možná?"} + SYNC -->|"Ano, < 100 km"| ACT{"Chceš nulový výpadek?"} + ACT -->|"Ano"| AA["Active-Active
RTO=0, RPO=0, 2× náklady"] + ACT -->|"Ne"| AP["Active-Passive
RTO=min, RPO=0, 1,5×"] + SYNC -->|"Ne, > 100 km"| ASYNC["Active-Passive (async)
RTO=min, RPO=s, 1,3×"] + + RPO -->|"minuty–hodiny"| WARM{"Chceš rychlý failover?"} + WARM -->|"Ano"| PILOT["Pilot Light
RTO=h, RPO=min, 1,2×"] + WARM -->|"Ne"| COLD["Cold Standby
RTO=dny, RPO=h, 1,1×"] + + Start --> DIST{"Vzdálenost mezi DC"} + DIST -->|"< 50 km, vlastní fiber"| MC["MetroCluster / Stretched Cluster
Single management, sync storage"] + DIST -->|"50–300 km"| REG["Regionální DR
Active-Passive, async replikace"] + DIST -->|"> 300 km"| GLOBAL["Globální DR
Cold standby, backup & restore"] +``` + +### Fyzická infrastruktura pro propojení DC + +| Technologie | Bandwidth | Max distance | Latence | Use case | +|------------|-----------|-------------|---------|----------| +| **Dark fiber** | 100 GbE–800 GbE | 10–80 km (single-mode) | < 0,1 ms | MetroCluster, stretched cluster | +| **DWDM** | 400 GbE–1,6 TbE (per lambda) | 80–120 km (bez zesilovače) | < 0,5 ms | Metro, metro cluster | +| **CWDM** | 10–25 GbE (per channel) | 10–40 km | < 0,3 ms | Campus, menší metro | +| **MPLS L2VPN** | 10–100 GbE | neomezena | 1–10 ms | Regional DR, async replikace | +| **Internet IPsec** | 1–10 GbE | neomezena | 5–50 ms | Cold standby, backup | + +### Vliv jednotlivých technologií na výběr DC topologie + +Volba topologie sekundárního DC není čistě infrastrukturní rozhodnutí — každá vrstva (DB, hypervisor, orchestrace, messaging) přináší vlastní omezení. + +#### Databáze + +| DB technologie | Sync replikace | Max distance | Auto-failover | Split-brain řešení | Poznámka | +|---------------|---------------|-------------|---------------|-------------------|----------| +| **PostgreSQL** | Synchronous commit (synchronous_standby_names) | < 100 km (latence < 10 ms) | Patroni / repmgr + etcd | Quorum (etcd, 3+ node) | Streaming replication, nutné wal_keep_segments | +| **MySQL** | Group Replication (multi-primary, single-primary) | < 100 km | MySQL InnoDB Cluster + MySQL Router | Paxos (Group Replication, 3+ node) | Semi-sync jako kompromis | +| **Oracle** | Data Guard (SYNC/FASTSYNC/ASYNC), RAC extended | sync < 100 km, async neomezena | Data Guard Broker / FSFO (Fast Start Failover) | Observer (3. node) | Far Sync pro vzdálená DC | +| **MSSQL** | AlwaysOn Availability Groups (SYNCHRONOUS_COMMIT) | < 100 km | AlwaysOn + Cluster quorum | File share majority / cloud witness | Multi-site cluster podpora | +| **MongoDB** | Majority write concern + journaling | < 100 km | Replica set auto-election | Arbitration node (voting member) | Priority-based failover | +| **Cassandra** | N/A (multi-master, eventual consistency) | neomezena | Ano (peer-to-peer) | Žádné (multi-master, gossip protokol) | Snitch-aware topologie, NetworkTopologyStrategy | +| **Redis** | Redis Sentinel / Redis Cluster (async) | neomezena (async) | Sentinel / Cluster failover | Quorum (Sentinel, majority) | PSYNC replikace, replication lag | + +Klíčové omezení pro **sync replikaci**: latence < 5 ms RTT (commit musí počkat na potvrzení z obou DC). Při 100 km je RTT ~1 ms – v pořádku. Při 1000 km (~10 ms RTT) sync replikace snižuje výkon transakcí o 80+ %. + +Pro **Active-Active** jsou vhodné: +- **Cassandra / ScyllaDB** — nativní multi-DC, eventual consistency, žádný split-brain +- **MySQL Group Replication (multi-primary)** — 3+ DC pro kvorum +- **CockroachDB / TiDB** — nativní multi-region, ACID napříč DC +- **Redis Enterprise** — Active-Active (CRDT-based) + +Pro **Active-Passive** jsou vhodné: +- **PostgreSQL + Patroni** — auto-failover, etcd kvorum +- **Oracle Data Guard** — FSFO, far sync pro vzdálené DC +- **MSSQL AlwaysOn** — cloud witness +- **MongoDB Replica Set** — arbitration node v 3. lokaci + +#### Hypervisory + +| Hypervisor | Cluster technologie | Stretched cluster | Max distance | Split-brain | +|-----------|-------------------|-------------------|-------------|-------------| +| **VMware vSphere** | vSAN延伸, Metro vCenter, Site Recovery Manager | Ano (vSAN延伸, Metro Cluster) | < 50 km (vSAN延伸), < 10 ms RTT | Fencing (STONITH), witness host | +| **Hyper-V** | Storage Replica + Failover Cluster | Ano (Cluster Sets) | < 50 km (sync), neomezena (async) | File share witness / cloud witness | +| **Proxmox VE** | Proxmox HA + Ceph | Omezeně (Ceph stretch cluster) | < 50 km (Ceph sync) | Ceph monitor quorum (3+ DC) | +| **XCP-ng / XenServer** | Xen Orchestra HA + SR (Storage Repository) replication | Omezeně | závisí na storage replikaci | — | +| **Nutanix AHV** | Metro Availability (sync), Async DR | Ano (Metro) | < 100 km (sync), neomezena (async) | Witness VM (cloud / 3. site) | +| **KVM / oVirt** | oVirt HA + GlusterFS / NFS | Omezeně | závisí na storage replikaci | — | + +**vSAN延伸** specifické požadavky: +- Dedikovaná síť pro vSAN (25 GbE min., < 5 ms RTT) +- Witness host v 3. lokaci (nebo cloud witness) +- Všechny VM protokoly (FTT=1, mirroring striped) +- Storage policy: `site-A + site-B + witness` + +#### Kubernetes a kontejnerové platformy + +| Platforma | Multi-cluster DR | Replikace | Max distance | Failover | +|-----------|-----------------|-----------|-------------|----------| +| **Vanilla K8s** | KubeFed, Cluster API, Velero + Restic | Velero (backup/restore), Rook (Ceph) | neomezena | Manuální (Velero restore) | +| **OpenShift** | ACM (Advanced Cluster Management), Velero | OADP (OpenShift API for Data Protection) | neomezena | ACM failover (subscription) | +| **Rancher** | Rancher Multi-Cluster App, Velero | Longhorn (sync/async DR), Velero | neomezena | Polo-auto | +| **Google GKE** | Multi-cluster Services, Backup for GKE | Config Sync, Backup for GKE | neomezena | Manuální | +| **Azure AKS** | Azure ARC + Velero + Azure Traffic Manager | AKS backup (velero), Azure Site Recovery | neomezena | Manuální (Velero) | +| **AWS EKS** | EKS multi-cluster, Velero + S3 cross-region | Velero (S3), Rook (EBS snapshots) | neomezena | Manuální | + +**Klíčové principy K8s DR:** +- **Aplikace musí být stateless** (nebo state externalizovaný do DB/storage) +- **Velero** — backup/restore celého clusteru (PV, resources, helm releases) +- **Rook/Ceph** — cross-region mirroring RBD volumes +- **KubeFed / ACM** — subscription-based deploy do více clusterů +- **Ingress/Gateway API** — traffic routing mezi clustery +- **External DNS** — DNS failover při výpadku clusteru + +#### Messaging / streaming + +| Platforma | Replikace | Topologie | DR podpora | Max distance | +|-----------|-----------|-----------|------------|-------------| +| **Apache Kafka** | MirrorMaker 2, Confluent Cluster Linking, KRaft quorum | Active-Passive (MM2), Active-Active (Cluster Linking) | MM2: async, Cluster Linking: async | neomezena | +| **RabbitMQ** | Classic Queue Mirroring, Quorum Queues | Active-Passive (Warm Standby) | Federation / Shovel (async) | neomezena | +| **Red Hat AMQ** | (Artemis) Cluster + HA | Active-Passive (shared store / replication) | Live-backup pair | < 100 km (sync) | +| **NATS** | NATS JetStream (cluster + cross-account) | Active-Active (Leaf nodes, cross-account) | Super-cluster, failover | neomezena | +| **Apache Pulsar** | BookKeeper (bookie rack-aware), geo-replication | Active-Active (geo-replication) | Built-in (cluster-level) | neomezena (async) | +| **AWS SQS/SNS** | Managed, AWS region pairs | Active-Active (multi-region) | Built-in (AWS managed) | neomezena | +| **Azure Service Bus** | Managed, paired region | Active-Passive (paired region) | Built-in (geo-recovery) | neomezena | +| **Oracle Service Bus (OSB)** | Oracle WebLogic Cluster + JDBC store + AQ | Active-Passive (WebLogic Cluster + Data Guard) | OSB/WLS cluster + Oracle RAC/Data Guard sync | < 100 km (Data Guard sync), neomezena (async) | + +**Doporučení pro DR messagingu:** +- **Kafka**: použít Cluster Linking pro Active-Active, nebo MirrorMaker 2 pro Active-Passive; replikovat jen kritická témata +- **RabbitMQ**: Quorum Queues + Federation upstream pro DR; vyhnout se Classic Queue Mirroring (deprecated) +- **Pulsar**: nativní geo-replication, bookie rack-aware pro stretch cluster; nejjednodušší DR mezi messaging platformami +- **OSB**: WebLogic cluster + Oracle RAC/Data Guard; DR závisí na DB vrstvě, ne na OSB samotném + +### Hlavní omezení per vrstva (shrnující tabulka) + +| Vrstva | Omezující faktor pro sekundární DC | Max distance pro sync | Dopad na výběr topologie | +|--------|-----------------------------------|----------------------|--------------------------| +| **Storage** | Latence sync mirroru, DWDM náklady | < 50 km (MetroCluster) | Stretched cluster jen v metru | +| **Databáze** | Commit wait pro sync replikaci | < 100 km (5 ms RTT) | Active-Active jen s DB podporující multi-master | +| **Hypervisor** | Stretched cluster quorum + fencing | < 50 km (vSAN, 5 ms) | MetroCluster / stretched cluster | +| **Kubernetes** | Velero restore time, Rook mirror latency | neomezena (async) | Active-Passive, cold standby | +| **Messaging** | Replication lag, offset management | neomezena (async) | Active-Active (Kafka, Pulsar, NATS) nebo Active-Passive | +| **Network** | Dark fiber/DWDM náklady, latency | < 100 km (metro fiber) | Omezuje možnosti sync replikace | +| **Aplikace** | Stateful/stateless, connection draining | závisí na architektuře | Stateless app → libovolná topologie | + ## Monitoring disků — S.M.A.R.T. Self-Monitoring, Analysis and Reporting Technology — prediktivní monitoring HDD/SSD. @@ -785,4 +1060,4 @@ OpenStack přináší do DC softwarovou abstrakční vrstvu, která umožňuje m - Akademické / HPC clustery (Ironic, Cyborg, Manila) - Government / regulated prostředí (on-prem, audit trail) -*Poslední revize: 2026-06-03* +*Poslední revize: 2026-06-12* diff --git a/DC-MIGRATION.en.md b/DC-MIGRATION.en.md new file mode 100644 index 0000000..5b904dc --- /dev/null +++ b/DC-MIGRATION.en.md @@ -0,0 +1,246 @@ +# 🏗️ Data Center Migration + +## Migration strategies + +| Strategy | RTO | RPO | Risk | Cost | Duration | Description | +|-----------|-----|-----|--------|---------|-------------|-------| +| **Cold / Big Bang** | hours–days | days | High | Low | days | Shut everything down, move, power up | +| **Phased / Wave** | minutes (per wave) | minutes | Medium | Medium | weeks–months | Workloads moved in waves | +| **Rolling** | 0 (live) | 0 | Low | High | months | Live migration per VM/service | +| **Parallel Run** | 0 | 0 | Low | Very high | months | Both DCs operational, gradual cutover | +| **Pilot Light** | hours | minutes | Medium | Low | weeks | Critical services in new DC, rest migrates | +| **Lift & Shift** | hours | minutes | Medium | Low | weeks | VMs/servers moved without configuration changes | +| **Re-platform** | hours | minutes | Low | Medium | months | Optimization during migration (OS upgrade, resize) | +| **Re-architect** | 0 | 0 | Low | High | months–years | Application redesigned for new platform | + +--- + +## Decision tree + +```mermaid +flowchart TD + Start(["DC Migration"]) --> APP{"Application\nstateful?"} + APP -->|"Yes"| DOWNTIME{"Tolerates\ndowntime?"} + APP -->|"No"| ROLLING["Rolling / Parallel Run"] + + DOWNTIME -->|"Yes, hours+"| COLD["Cold / Big Bang\nSimplest, cheapest\nRisk: all at once"] + DOWNTIME -->|"Yes, minutes"| PHASED["Phased / Wave\nBy application / business unit"] + DOWNTIME -->|"No (zero downtime)"| SYNC{"Sync replication\npossible?"} + + SYNC -->|"Yes, < 100 km"| ROLLING + SYNC -->|"No"| PARALLEL["Parallel Run\nBoth DCs active, gradual cutover"] + + ROLLING --> ROLL_HA{"VMware,\nHyper-V?"} + ROLL_HA -->|"Yes"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"] + ROLL_HA -->|"No"| ROLL_REPL["Storage + DB replication\nGradual workload migration"] +``` + +--- + +## Migration phases + +### 1. Discovery and assessment + +| Task | Tools | Output | +|------|----------|--------| +| HW and SW inventory | RVTools, NetBox, CMDB | Server, VM, and service list | +| Dependency mapping | ServiceNow, AppDynamics, manual | Application dependency graph | +| Traffic analysis | NetFlow, sFlow, vRNI | Bandwidth, latency, peak usage | +| Performance baseline | Prometheus, Zabbix, vRealize | CPU/RAM/disk/network per workload | +| License audit | Flexera, SAM | Licenses, support, compliance | + +**Output:** workload list with RTO/RPO, dependencies, and criticality. + +### 2. Planning + +- **Wave plan** — workload division into migration waves (10–50 VMs per wave) +- **Dependency ordering** — DNS, NTP, LDAP, PKI first +- **Cutover window** — time window for switching (typically weekend) +- **Rollback plan** — conditions and procedure for reversal +- **Test plan** — what and how to test post-migration +- **Communication plan** — who, when, how is informed + +### 3. New DC preparation + +- **Infrastructure** — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (see [DATACENTERS.en.md](DATACENTERS.en.md) — deployment order) +- **Network** — BGP peering, VXLAN/VLAN, firewall rules, load balancers +- **Storage** — SAN zoning, NAS exports, Ceph cluster +- **Virtualization** — vCenter, Hyper-V cluster, Proxmox + +### 4. Replication and synchronization + +| Layer | Method | Tools | +|--------|--------|----------| +| **Storage (block)** | SAN sync/async mirror, LUN replication | NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster | +| **Storage (file)** | DFS-R, rsync, robocopy | Windows DFS, Rsync | +| **Storage (object)** | Cross-region replication | MinIO replication, S3 CRR | +| **Databases** | Log shipping, CDC, streaming replication | PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication | +| **VM** | Storage vMotion, replication | VMware vSphere Replication, Hyper-V Replica, Zerto | +| **Kubernetes** | Velero + Restic, Rook Ceph mirror | Velero, Rook | + +### 5. Workload migration + +#### Wave migration (recommended for medium/large DCs) + +```mermaid +gantt + title Wave migration + dateFormat YYYY-MM-DD + section Wave 1 - Core + DNS, NTP, LDAP :done, w1a, 2026-07-01, 3d + Monitoring + logging :done, w1b, after w1a, 2d + section Wave 2 - Network + Load balancers :active, w2a, 2026-07-06, 2d + Firewalls :active, w2b, 2026-07-08, 2d + section Wave 3 - Storage + NAS migration :w3a, 2026-07-10, 5d + SAN replication :w3b, 2026-07-10, 3d + section Wave 4 - Dev/Test + Dev VMs :w4a, 2026-07-15, 5d + section Wave 5 - Prod tier 3 + Internal apps :w5a, 2026-07-22, 5d + section Wave 6 - Prod tier 2 + Business apps :w6a, 2026-07-29, 5d + section Wave 7 - Prod tier 1 + Critical apps :w7a, 2026-08-05, 5d +``` + +#### Typical single wave procedure: + +1. **Day -7**: Sync data replication (initial seed) +2. **Day -1**: Incremental sync, final test +3. **Day 0 (cutover)**: + - Stop application in source DC + - Final sync (last delta) + - Start application in target DC + - DNS/Traffic switch + - Smoke test +4. **Day +1**: Monitoring (performance, errors, lag) +5. **Day +7**: Rollback window end (success confirmation) + +### 6. Network strategies + +#### IP re-addressing + +| Approach | Description | Pros | Cons | +|---------|-------|--------|----------| +| **Keep IP** | Same IPs, BGP anycast or stretch VLAN | No application config changes | Stretched VLAN/L2 limitations | +| **Change IP** | New IP range, DNS/BGP routing change | Clean architecture | Config changes, DNS TTL | +| **NAT translation** | NAT between old and new IP space | No application changes | Latency, troubleshooting complexity | + +**Keep IP** is only possible with: +- L2 stretch between DCs (VXLAN, OTV) — distance limited +- BGP anycast for VIPs (load balancers) +- Applications tolerant to ARP cache changes + +#### DNS cutover + +``` +1. Lower TTL to 60–300 s (one week ahead) +2. At cutover, change A/AAAA records to new IPs +3. Wait for propagation (per TTL) +4. Monitor traffic +``` + +#### Traffic steering + +| Technique | Use case | +|----------|----------| +| **BGP** | Change AS path / local pref for traffic steering | +| **DNS** | Lower TTL, change A records | +| **Load balancer** | Change pool members, health check | +| **GSLB** | Global Server Load Balancing (F5 GTM, NSX ALB) | +| **Cloud DNS** | AWS Route53, Azure Traffic Manager, Google Cloud DNS | + +### 7. Database migration + +See individual DB files for details. Summary table: + +| DB | Method | RPO | RTO | Note | +|----|--------|-----|-----|----------| +| **PostgreSQL** | Streaming replication + Patroni switchover | 0 (sync) / ~MB (async) | min | Patroni auto-failover | +| **MySQL** | Group Replication / async replication | 0 (sync) / seconds | min | InnoDB Cluster | +| **Oracle** | Data Guard switchover | 0 (sync) | min | Far sync for remote DCs | +| **MSSQL** | AlwaysOn AG failover | 0 (sync) | min | Cloud witness | +| **MongoDB** | Replica set election | seconds | < 1 min | Priority-based failover | +| **Cassandra** | Multi-DC replication | eventual | 0 | Native multi-master | + +### 8. Testing + +| Phase | What to test | Method | +|------|-------------|--------| +| **Pre-migration** | Application in new DC (isolated) | Dry run on replicated data | +| **Cutover** | Functionality, availability, latency | Smoke test, synthetic transactions | +| **Post-migration** | Performance, integration, monitoring | A/B comparison with baseline, canary traffic | +| **Rollback** | Return to old DC | Tested rollback plan | + +### 9. Rollback plan + +Each wave must have a defined rollback: + +| Condition | Action | +|----------|------| +| Application fails to start in new DC | DNS switch back, stop replication | +| Performance worse than baseline (> 20 %) | Rollback, root cause analysis | +| Integration failure (API timeout, DB connection) | Rollback, dependency check | +| Security incident | Rollback, forensic analysis | + +Rollback must be tested **before** the real cutover. + +--- + +## Special cases + +### Mainframe migration + +- **IBM z/OS** — GDPS (Geographically Dispersed Parallel Sysplex) +- HyperSwap for storage mirroring +- Cross-system coupling facility (XCF) +- Often the last migrated component + +### COTS applications (Oracle EBS, SAP) + +- Require vendor-specific migration procedures +- Oracle EBS: Autoconfig, cloning (ADXLC) +- SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM +- License re-licensing on HW change + +### Cloud migration (On-prem → Cloud) + +See [CLOUD.en.md](CLOUD.en.md) — migration strategies (6 Rs): + +| Strategy | Description | +|-----------|-------| +| **Re-host (Lift & Shift)** | VM → Cloud VM (AWS MGN, Azure Migrate) | +| **Re-platform** | OS upgrade, managed DB (RDS, Cloud SQL) | +| **Re-architect** | Application rewritten as cloud-native | +| **Retire** | Decommission unnecessary applications | +| **Retain** | Application stays on-prem (review later) | +| **Repurchase** | SaaS replacement | + +--- + +## Recommended approach per DC size + +| DC Size | VM Count | Recommended strategy | Duration | Team | +|-------------|----------|---------------------|-------------|-----| +| **Small** | < 50 | Big Bang (weekend) | 2–4 days | 3–5 people | +| **Medium** | 50–500 | Phased (5–10 waves) | 2–8 weeks | 5–10 people | +| **Large** | 500–5000 | Phased + Rolling | 3–12 months | 10–30 people | +| **Enterprise** | 5000+ | Parallel Run / Rolling | 12–36 months | 30+ people | + +--- + +## Related + +- [DATACENTERS.en.md](DATACENTERS.en.md) — DC topologies, secondary DC, deployment order +- [CLOUD.en.md](CLOUD.en.md) — cloud migration strategies (6 Rs) +- [DR.en.md](DR.en.md) — disaster recovery, RTO/RPO +- [NETWORKING.en.md](NETWORKING.en.md) — BGP, DNS, VXLAN, traffic steering +- [STORAGE.en.md](STORAGE.en.md) — storage replication + +## Sources + +Links, books, and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md) + +*Last revision: 2026-06-12* \ No newline at end of file diff --git a/DC-MIGRATION.md b/DC-MIGRATION.md new file mode 100644 index 0000000..6ec0bfe --- /dev/null +++ b/DC-MIGRATION.md @@ -0,0 +1,246 @@ +# 🏗️ Migrace datových center + +## Strategie migrace + +| Strategie | RTO | RPO | Riziko | Náklady | Doba trvání | Popis | +|-----------|-----|-----|--------|---------|-------------|-------| +| **Cold / Big Bang** | hodiny–dny | dny | Vysoké | Nízké | dny | Vše najednou vypnout, přesunout, zapnout | +| **Phased / Wave** | minuty (per wave) | minuty | Střední | Střední | týdny–měsíce | Workloady po vlnách | +| **Rolling** | 0 (live) | 0 | Nízké | Vysoké | měsíce | Live migration per VM/služba | +| **Parallel Run** | 0 | 0 | Nízké | Velmi vysoké | měsíce | Oba DC v provozu, postupný přechod | +| **Pilot Light** | hodiny | minuty | Střední | Nízké | týdny | Kritické služby v novém DC, ostatní se přesouvají | +| **Lift & Shift** | hodiny | minuty | Střední | Nízké | týdny | VM/servery přesunuty bez změny konfigurace | +| **Re-platform** | hodiny | minuty | Nízké | Střední | měsíce | Optimalizace během migrace (OS upgrade, resize) | +| **Re-architect** | 0 | 0 | Nízké | Vysoké | měsíce–roky | Aplikace přepracována pro novou platformu | + +--- + +## Rozhodovací strom + +```mermaid +flowchart TD + Start(["Migrace DC"]) --> APP{"Aplikace\nstateful?"} + APP -->|"Ano"| DOWNTIME{"Toleruje\nvýpadek?"} + APP -->|"Ne"| ROLLING["Rolling / Parallel Run"] + + DOWNTIME -->|"Ano, hodiny+"| COLD["Cold / Big Bang\nNejjednodušší, nejlevnější\nRiziko: vše najednou"] + DOWNTIME -->|"Ano, minuty"| PHASED["Phased / Wave\nPo aplikacích / byznys jednotkách"] + DOWNTIME -->|"Ne (zero downtime)"| SYNC{"Sync replikace\nmožná?"} + + SYNC -->|"Ano, < 100 km"| ROLLING + SYNC -->|"Ne"| PARALLEL["Parallel Run\nOba DC aktivní, postupný cutover"] + + ROLLING --> ROLL_HA{"VMware,\nHyper-V?"} + ROLL_HA -->|"Ano"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"] + ROLL_HA -->|"Ne"| ROLL_REPL["Storage + DB replikace\nPostupný přesun workloadů"] +``` + +--- + +## Fáze migrace + +### 1. Discovery a assessment + +| Úkol | Nástroje | Výstup | +|------|----------|--------| +| Inventarizace HW a SW | RVTools, NetBox, CMDB | Seznam všech serverů, VM, služeb | +| Dependency mapping | ServiceNow, AppDynamics, manual | Aplikační dependency graf | +| Traffic analysis | NetFlow, sFlow, vRNI | BANDWIDTH, latency, peak usage | +| Výkonnostní baseline | Prometheus, Zabbix, vRealize | CPU/RAM/disk/network per workload | +| Licenční audit | Flexera, SAM | Licence, support, compliance | + +**Výstupem je:** seznam workloadů s RTO/RPO, závislostmi a kritičností. Bez toho nelze naplánovat migraci. + +### 2. Plánování + +- **Wave plán** — rozdělení workloadů do migračních vln (10–50 VM na vlnu) +- **Závislostní řazení** — DNS, NTP, LDAP, PKI musí být první +- **Cutover okno** — časové okno pro přepnutí (typicky víkend) +- **Rollback plán** — podmínky a postup pro vrácení +- **Testovací plán** — co a jak testovat po migraci +- **Komunikační plán** — kdo, kdy, jak je informován + +### 3. Příprava nového DC + +- **Infrastruktura** — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (viz [DATACENTERS.md](DATACENTERS.md) — deployment order) +- **Network** — BGP peering, VXLAN/VLAN, firewall pravidla, load balancery +- **Storage** — SAN zoning, NAS exports, Ceph cluster +- **Virtualizace** — vCenter, Hyper-V cluster, Proxmox + +### 4. Replikace a synchronizace + +| Vrstva | Metoda | Nástroje | +|--------|--------|----------| +| **Storage (block)** | SAN sync/async mirror, LUN replication | NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster | +| **Storage (file)** | DFS-R, rsync, robocopy | Windows DFS, Rsync | +| **Storage (object)** | Cross-region replication | MinIO replication, S3 CRR | +| **Databáze** | Log shipping, CDC, streaming replication | PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication | +| **VM** | Storage vMotion, replication | VMware vSphere Replication, Hyper-V Replica, Zerto | +| **Kubernetes** | Velero + Restic, Rook Ceph mirror | Velero, Rook | + +### 5. Migrace workloadů + +#### Wave migrace (doporučeno pro střední/větší DC) + +```mermaid +gantt + title Wave migrace + dateFormat YYYY-MM-DD + section Wave 1 - Core + DNS, NTP, LDAP :done, w1a, 2026-07-01, 3d + Monitoring + logging :done, w1b, after w1a, 2d + section Wave 2 - Network + Load balancers :active, w2a, 2026-07-06, 2d + Firewalls :active, w2b, 2026-07-08, 2d + section Wave 3 - Storage + NAS migrace :w3a, 2026-07-10, 5d + SAN replication :w3b, 2026-07-10, 3d + section Wave 4 - Dev/Test + Dev VMs :w4a, 2026-07-15, 5d + section Wave 5 - Prod tier 3 + Internal apps :w5a, 2026-07-22, 5d + section Wave 6 - Prod tier 2 + Business apps :w6a, 2026-07-29, 5d + section Wave 7 - Prod tier 1 + Critical apps :w7a, 2026-08-05, 5d +``` + +#### Typický postup jedné vlny: + +1. **Den -7**: Sync replikace dat (initial seed) +2. **Den -1**: Incremental sync, final test +3. **Den 0 (cutover)**: + - Zastavení aplikace ve zdrojovém DC + - Final sync (poslední delta) + - Start aplikace v cílovém DC + - DNS/Traffic switch + - Smoke test +4. **Den +1**: Monitorování (výkon, chyby, lag) +5. **Den +7**: Rollback window end (potvrzení úspěchu) + +### 6. Síťové strategie + +#### IP re-addressing + +| Přístup | Popis | Výhody | Nevýhody | +|---------|-------|--------|----------| +| **Keep IP** | Stejné IP, BGP anycast nebo stretch VLAN | Není třeba měnit konfiguraci aplikací | Stretched VLAN/L2 omezení | +| **Change IP** | Nový IP rozsah, DNS/BGP routing změna | Čistá architektura | Změny konfigurací, DNS TTL | +| **NAT překlad** | NAT mezi starým a novým IP spacem | Bez změny aplikací | Latence, komplexita troubleshooting | + +**Keep IP** je možný jen: +- L2 stretch mezi DC (VXLAN, OTV) — omezeno vzdáleností +- BGP anycast pro VIP (load balancery) +- Aplikace tolerující ARP cache změny + +#### DNS cutover + +``` +1. Snížit TTL na 60–300 s (týden předem) +2. Při cutoveru změnit A/AAAA záznamy na nové IP +3. Počkat na propagaci (dle TTL) +4. Monitorovat traffic +``` + +#### Traffic steering + +| Technika | Use case | +|----------|----------| +| **BGP** | Změna AS path / local pref pro přesměrování trafficu | +| **DNS** | Snížení TTL, change A records | +| **Load balancer** | Změna pool members, health check | +| **GSLB** | Global Server Load Balancing (F5 GTM, NSX ALB) | +| **Cloud DNS** | AWS Route53, Azure Traffic Manager, Google Cloud DNS | + +### 7. Databázová migrace + +Viz detail v jednotlivých DB souborech. Tabulka shrnutí: + +| DB | Metoda | RPO | RTO | Poznámka | +|----|--------|-----|-----|----------| +| **PostgreSQL** | Streaming replication + Patroni switchover | 0 (sync) / ~MB (async) | min | Patroni auto-failover | +| **MySQL** | Group Replication / async replication | 0 (sync) / sekundy | min | InnoDB Cluster | +| **Oracle** | Data Guard switchover | 0 (sync) | min | Far sync pro vzdálené DC | +| **MSSQL** | AlwaysOn AG failover | 0 (sync) | min | Cloud witness | +| **MongoDB** | Replica set election | sekundy | < 1 min | Priority-based failover | +| **Cassandra** | Multi-DC replication | eventual | 0 | Nativní multi-master | + +### 8. Testování + +| Fáze | Co testovat | Metoda | +|------|-------------|--------| +| **Pre-migrace** | Aplikace v novém DC (izolovaně) | Dry run na replikovaných datech | +| **Cutover** | Funkčnost, dostupnost, latence | Smoke test, synthetic transactions | +| **Post-migrace** | Výkon, integrace, monitoring | A/B comparison s baseline, canary traffic | +| **Rollback** | Návrat ke starému DC | Testovaný rollback plán | + +### 9. Rollback plán + +Každá vlna musí mít definovaný rollback: + +| Podmínka | Akce | +|----------|------| +| Aplikace nestartuje v novém DC | Přepnutí DNS zpět, zastavení replikace | +| Výkon horší než baseline (o > 20 %) | Rollback, analýza příčiny | +| Integrační selhání (API timeout, DB connection) | Rollback, dependency check | +| Bezpečnostní incident | Rollback, forenzní analýza | + +Rollback by měl být otestován **před** reálným cutoverem. + +--- + +## Speciální případy + +### Mainframe migrace + +- **IBM z/OS** — GDPS (Geographically Dispersed Parallel Sysplex) +- HyperSwap pro storage mirroring +- Cross-system coupling facility (XCF) +- Často poslední migrovaná komponenta + +### COTS aplikace (Oracle EBS, SAP) + +- Vyžadují specifické migrační postupy výrobce +- Oracle EBS: Autoconfig, cloning (ADXLC) +- SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM +- Licenční re-licensing při změně HW + +### Cloud migrace (On-prem → Cloud) + +Viz [CLOUD.md](CLOUD.md) — migrační strategie (6 Rs): + +| Strategie | Popis | +|-----------|-------| +| **Re-host (Lift & Shift)** | VM → Cloud VM (AWS MGN, Azure Migrate) | +| **Re-platform** | OS upgrade, managed DB (RDS, Cloud SQL) | +| **Re-architect** | Aplikace přepsána na cloud-native | +| **Retire** | Zastavení nepotřebných aplikací | +| **Retain** | Aplikace zůstává on-prem (revize později) | +| **Repurchase** | SaaS náhrada | + +--- + +## Doporučený postup per velikost DC + +| Velikost DC | Počet VM | Doporučená strategie | Doba trvání | Tým | +|-------------|----------|---------------------|-------------|-----| +| **Small** | < 50 | Big Bang (víkend) | 2–4 dny | 3–5 lidí | +| **Medium** | 50–500 | Phased (5–10 wave) | 2–8 týdnů | 5–10 lidí | +| **Large** | 500–5000 | Phased + Rolling | 3–12 měsíců | 10–30 lidí | +| **Enterprise** | 5000+ | Parallel Run / Rolling | 12–36 měsíců | 30+ lidí | + +--- + +## Související + +- [DATACENTERS.md](DATACENTERS.md) — DC topologie, sekundární DC, deployment order +- [CLOUD.md](CLOUD.md) — cloud migrační strategie (6 Rs) +- [DR.md](DR.md) — disaster recovery, RTO/RPO +- [NETWORKING.md](NETWORKING.md) — BGP, DNS, VXLAN, traffic steering +- [STORAGE.md](STORAGE.md) — storage replikace + +## Zdroje + +Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md) + +*Poslední revize: 2026-06-12* \ No newline at end of file diff --git a/DR.en.md b/DR.en.md new file mode 100644 index 0000000..95b291a --- /dev/null +++ b/DR.en.md @@ -0,0 +1,336 @@ +# 🔄 Disaster Recovery and Business Continuity + +## Terminology + +| Abbreviation | Meaning | Description | +|---------|--------|-------| +| **RTO** | Recovery Time Objective | Maximum time from outage to service recovery | +| **RPO** | Recovery Point Objective | Maximum acceptable data loss (time since last backup) | +| **MTD** | Maximum Tolerable Downtime | Total outage duration an organization can survive | +| **WRT** | Work Recovery Time | Time needed for full operations recovery after IT restoration | +| **MTBF** | Mean Time Between Failures | Mean time between failures | +| **MTTR** | Mean Time To Repair | Mean time to repair | +| **SLA** | Service Level Agreement | Contractual availability commitment | +| **SLO** | Service Level Objective | Internal availability target | +| **SLI** | Service Level Indicator | Measured availability value | + +### Relationship between RTO, RPO, MTD, WRT + +``` +Outage ──── RPO ────► Data restored ──── RTO ────► Service running ──── WRT ────► Full operations + │ │ │ + ▼ ▼ ▼ + Lost data Time without service Time to full capacity + + MTD = RTO + WRT (max. time the business tolerates) +``` + +--- + +## Uptime calculation + +### Nines table + +| Level | Uptime | Downtime / year | Downtime / month | Downtime / week | +|--------|--------|---------------|------------------|------------------| +| 90 % (one nine) | 0.9 | 36.5 days | 72 h | 16.8 h | +| 99 % (two nines) | 0.99 | 3.65 days | 7.2 h | 1.68 h | +| 99.5 % | 0.995 | 1.83 days | 3.6 h | 50.4 min | +| 99.9 % (three nines) | 0.999 | 8.76 h | 43.2 min | 10.1 min | +| 99.95 % | 0.9995 | 4.38 h | 21.6 min | 5.04 min | +| 99.99 % (four nines) | 0.9999 | 52.6 min | 4.32 min | 1.01 min | +| 99.995 % | 0.99995 | 26.3 min | 2.16 min | 30.2 s | +| 99.999 % (five nines) | 0.99999 | 5.26 min | 25.9 s | 6.05 s | +| 99.9999 % (six nines) | 0.999999 | 31.6 s | 2.59 s | 0.605 s | + +### Calculation + +``` +Availability = (Total time - Downtime) / Total time × 100 % + +Example: + Year = 365 × 24 × 60 = 525,600 minutes + Target: 99.9 % → allowed downtime = 525,600 × (1 - 0.999) = 525.6 minutes = 8.76 h + +Combined availability (chain of dependencies): + A_web = 99.9 % (3 nines) + A_api = 99.99 % (4 nines) + A_db = 99.999 % (5 nines) + + A_total = 0.999 × 0.9999 × 0.99999 = 0.99889 ≈ 99.89 % (less than 3 nines!) + +Parallel availability (redundancy): + A_total = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n) + + Example: 2 servers with 99% availability + A_total = 1 - (1-0.99) × (1-0.99) = 1 - 0.01 × 0.01 = 0.9999 (99.99 %) +``` + +### Calculator + +```python +def uptime_percent_to_downtime(pct, period_days=365): + """Convert uptime percentage to downtime in given period.""" + total_minutes = period_days * 24 * 60 + allowed_downtime = total_minutes * (1 - pct / 100) + return allowed_downtime # minutes + +def downtime_to_uptime_percent(downtime_minutes, period_days=365): + """Convert downtime in minutes to uptime percentage.""" + total_minutes = period_days * 24 * 60 + return (1 - downtime_minutes / total_minutes) * 100 + +def combined_availability(availabilities): + """Combined availability (series-connected components).""" + result = 1.0 + for a in availabilities: + result *= a + return result + +def redundant_availability(availabilities): + """Redundant availability (parallel components).""" + result = 1.0 + for a in availabilities: + result *= (1 - a) + return 1 - result +``` + +### Calculation fallacies + +- **Combined availability is not a sum** — adding another dependency always reduces total availability +- **Redundancy is not free** — adding a standby component requires failure detection + failover (MTTR does not improve automatically) +- **SLA is not a guarantee** — providers often calculate SLA as a monthly average, not per-incident +- **Measurement is key** — without SLI, SLO cannot be verified; "unmeasured availability does not exist" +- **Planned maintenance** — sometimes counted as uptime, sometimes not (depends on SLA definition) + +--- + +## DR scenarios + +### Classification + +| Category | Scenario | Typical RTO | Typical RPO | Frequency | +|-----------|--------|-------------|-------------|-----------| +| **Site** | Entire DC / region outage | hours | minutes | Low | +| **Infrastructure** | HW failure (storage, switch, server) | minutes–hours | seconds | Medium | +| **Software** | OS, application, DB failure | minutes | seconds | High | +| **Data** | Data corruption, deletion, cryptolocker | hours | backup point | Low–medium | +| **Human** | Wrong deployment, config change | minutes–hours | seconds | Medium | +| **Security** | Attack, breach, ransomware | days | before attack | Low | +| **Network** | Connectivity outage, DDoS | minutes–hours | N/A | Medium | +| **Cloud provider** | Regional outage (AWS, Azure, GCP) | hours | minutes | Very low | + +### Scenario details + +#### Site / Region failure + +| Aspect | Description | +|--------|-------| +| **Cause** | Blackout, fire, flood, earthquake, cloud provider outage | +| **Prevention** | Multi-AZ architecture, multi-region deployment, active-active | +| **Mitigation** | Automatic DNS failover (Route53, Azure Traffic Manager), replica in DR region | +| **Testing** | Game day: shut down primary region, verify automatic failover | + +#### Data corruption / human error + +| Aspect | Description | +|--------|-------| +| **Cause** | Wrong SQL command (DELETE without WHERE), accidentally deleted bucket, bad migration | +| **Prevention** | RBAC, MFA for destructive operations, change management, SQL peer review | +| **Mitigation** | Point-in-time recovery (PITR), transaction log replay, immutable backups | +| **Testing** | Restore backup to isolated environment, verify data integrity | + +#### Ransomware / cyber attack + +| Aspect | Description | +|--------|-------| +| **Cause** | Attack on production systems, data encryption, exfiltration | +| **Prevention** | Immutable backups (object lock), air-gapped backups, network segmentation | +| **Mitigation** | Restore from clean backup, rebuild infrastructure from IaC | +| **Testing** | Regular restore in isolated network, verify backup is not infected | + +--- + +## Prevention — strategies + +### Backup strategies + +| Approach | Description | Use case | +|---------|-------|----------| +| **3-2-1 rule** | 3 copies, 2 different media, 1 off-site | Universal | +| **3-2-1-0** | + 0 errors after restore (testing) | Enterprise, compliance | +| **GFS (Grandfather-Father-Son)** | Daily, weekly, monthly rotation | Long-term archive | +| **Incremental forever** | Full backup 1×, then only changes | Large data volumes | +| **Reverse incremental** | Full + incremental, full is always current | Fast recovery | + +### Backup methods + +| Method | RPO | RTO | Storage | Suitable for | +|--------|-----|-----|----------|------------| +| **Full backup** | Last full | Full restore time | Large | Small data, weekly | +| **Incremental** | Last incremental | Full + all incrementals | Small | Large data, daily | +| **Differential** | Last diff | Full + last diff | Medium | Compromise | +| **Snapshot** | Snapshot point-in-time | seconds | Copy-on-write | VM, storage array | +| **Continuous (CDC)** | < 1 s | Seconds | Log stream | DB (binlog, WAL) | +| **PITR** | Any point in time | Depends on volume | Full + WAL | RDS, PostgreSQL, SQL Server | + +### Backup immutability + +Key protection against ransomware: + +| Technique | Description | +|----------|-------| +| **Object Lock (WORM)** | Backup cannot be deleted or overwritten for a defined retention period (S3 Object Lock, Azure Blob Immutable) | +| **Air gap** | Backup is physically separated from the production network (offline disk, tape, cloud without VPN) | +| **Isolated backup network** | Backup traffic goes through a dedicated network without access from production VLAN | +| **Out-of-band access** | Backup management console is not accessible from the production network | + +--- + +## DR architectures + +### Multi-AZ (Single region) + +``` +Region ┌────────────────────────────────────┐ + │ AZ-1 AZ-2 │ + │ ┌──────────┐ ┌──────────┐ │ + │ │ App │ │ App │ │ + │ └─────┬────┘ └─────┬────┘ │ + │ │ │ │ + │ ┌─────▼────────────────▼─────┐ │ + │ │ Load Balancer (cross-AZ) │ │ + │ └─────────────┬──────────────┘ │ + │ │ │ + │ ┌─────────────▼──────────────┐ │ + │ │ DB Primary (AZ-1) │ │ + │ │ DB Standby (AZ-2) │ │ + │ │ Synchronous replication │ │ + │ └────────────────────────────┘ │ + └────────────────────────────────────┘ +``` + +- RTO: minutes (automatic failover) +- RPO: 0 (sync replication) +- Protection: against AZ failure, not region failure + +### Multi-Region + +``` +Region A (Primary) Region B (DR) +┌─────────────────────┐ ┌─────────────────────┐ +│ ┌───────────────┐ │ │ ┌───────────────┐ │ +│ │ App + DB │ │ │ │ App + DB │ │ +│ │ Active │──┼──Async───────┼─►│ Standby │ │ +│ └───────────────┘ │ replication │ └───────────────┘ │ +│ │ │ │ │ │ +│ ┌──────▼───────┐ │ │ ┌──────▼───────┐ │ +│ │ DNS / GSLB │ │ │ │ DNS / GSLB │ │ +│ └──────┬───────┘ │ │ └──────┬───────┘ │ +└─────────┼──────────┘ └─────────┼──────────┘ + │ │ + └──────────── Traffic Manager ───────┘ +``` + +| Variant | RTO | RPO | Cost | Failover | +|----------|-----|-----|---------|----------| +| **Active-Passive** | minutes–hours | seconds | Medium | Manual / auto | +| **Active-Active** | seconds | < 1 s | High | Automatic (DNS) | +| **Pilot Light** | tens of minutes | minutes | Low | Manual scaling | +| **Warm Standby** | minutes | seconds | High | Auto (reduced copy) | +| **Backup & Restore** | hours | 24 h | Low | Manual | + +### On-prem → Cloud DR (Hybrid) + +``` +On-prem DC Cloud (DR) +┌─────────────────────┐ ┌─────────────────────┐ +│ ┌───────────────┐ │ │ ┌───────────────┐ │ +│ │ Application │ │ │ │ VM / App │ │ +│ │ + DB │ │ │ │ + DB replica │ │ +│ └───────┬───────┘ │ │ └───────┬───────┘ │ +│ │ │ │ │ │ +│ ┌───────▼───────┐ │ site-to-site│ ┌───────▼───────┐ │ +│ │ Backup proxy │──┼────VPN───────┼─►│ Backup store │ │ +│ └───────────────┘ │ │ └───────────────┘ │ +│ │ │ │ +│ ┌───────────────┐ │ │ ┌───────────────┐ │ +│ │ Tape / NAS │ │ │ │ Veeam / Zerto│ │ +│ └───────────────┘ │ │ └───────────────┘ │ +└─────────────────────┘ └─────────────────────┘ +``` + +- **RTO**: tens of minutes (depends on VM startup) +- **RPO**: minutes–hours (depends on replication tool) +- **Tools**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault +- **Use case**: enterprise with on-prem DC that needs DR without a second DC + +--- + +## DR testing + +### Test types + +| Type | Description | Frequency | Risk | +|-----|-------|-----------|--------| +| **Tabletop exercise** | Manual scenario walkthrough, no impact on production | Monthly | None | +| **Walkthrough** | Runbook verification, ensure everyone knows what to do | Quarterly | None | +| **Component test** | Test of a single component (e.g., restore one DB) | Monthly | Low | +| **Integrated test** | Test of the entire stack in isolated environment | Quarterly | Low | +| **Full failover test** | Production failover to DR site | Annually | High | +| **Chaos experiment** | Targeted fault injection into production | Continuous | Medium | + +### Runbook structure + +Each DR scenario should have a runbook: + +```yaml +scenario: "Region A failure" +triggers: + - "CloudWatch alarm: Region A health check 5× timeout" + - "PagerDuty incident P0" +decision_tree: | + 1. Verify: is Region A really unavailable? (check from 3 different locations) + 2. Decide: is RTO at risk? If < 30 % RTO remaining → failover + 3. Failover: run playbook `dr-failover-region-b` + 4. Verification: smoke tests in Region B + 5. Communication: status page + stakeholders +rollback: | + 1. After Region A recovery → replicate changes from B back to A + 2. Repoint DNS to A + 3. Verify data consistency + 4. Shut down Region B (or keep as hot standby) +contacts: + primary: "on-call@example.com" + escalation: "infra-lead@example.com" + management: "vp-engineering@example.com" +``` + +--- + +## Best practices + +- **Test recovery, not backup** — a backup without tested recovery is not a backup +- **Automate DR** — Terraform / Ansible for DR environment spin-up, DNS failover +- **Document runbooks** — every scenario, contact, decision tree +- **Expect failure** — design for failure, don't expect everything to work +- **Don't underestimate WRT** — service recovery does not mean full operations (data warming, cache, connections) +- **Align RTO/RPO with business** — technical capabilities must match business requirements +- **Monitor SLI** — without data, SLO cannot be verified +- **DR is not just IT** — communication, PR, legal, compliance + +--- + +## Related + +- [CLOUD.md](CLOUD.md) — cloud DR strategy, AWS/Azure/GCP specific +- [DATACENTERS.md](DATACENTERS.md) — DC redundancy, Tier classification +- [MONITORING.md](MONITORING.md) — alerting, SLI/SLO/SLA +- [CICD.md](CICD.md) — deployment strategy, rollback +- [STORAGE.md](STORAGE.md) — backup storage, replication + +## Sources + +Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md) + +*Last revised: 2026-06-11* diff --git a/DR.md b/DR.md new file mode 100644 index 0000000..9844763 --- /dev/null +++ b/DR.md @@ -0,0 +1,336 @@ +# 🔄 Disaster Recovery a Business Continuity + +## Terminologie + +| Zkratka | Význam | Popis | +|---------|--------|-------| +| **RTO** | Recovery Time Objective | Maximální doba od výpadku do obnovení služby | +| **RPO** | Recovery Point Objective | Maximální přípustná ztráta dat (čas od poslední zálohy) | +| **MTD** | Maximum Tolerable Downtime | Celková doba výpadku, kterou organizace přežije | +| **WRT** | Work Recovery Time | Čas potřebný k plnému obnovení provozu po obnovení IT | +| **MTBF** | Mean Time Between Failures | Střední doba mezi poruchami | +| **MTTR** | Mean Time To Repair | Střední doba opravy | +| **SLA** | Service Level Agreement | Smluvní závazek dostupnosti | +| **SLO** | Service Level Objective | Interní cíl dostupnosti | +| **SLI** | Service Level Indicator | Naměřená hodnota dostupnosti | + +### Vztah RTO, RPO, MTD, WRT + +``` +Výpadek ──── RPO ────► Obnova dat ──── RTO ────► Služba běží ──── WRT ────► Plný provoz + │ │ │ + ▼ ▼ ▼ + Ztracená data Čas bez služby Čas do plného výkonu + + MTD = RTO + WRT (max. doba, kterou firma toleruje) +``` + +--- + +## Výpočet uptimu + +### Tabulka devítek + +| Úroveň | Uptime | Downtime / rok | Downtime / měsíc | Downtime / týden | +|--------|--------|---------------|------------------|------------------| +| 90 % (jedna devítka) | 0.9 | 36,5 dne | 72 h | 16,8 h | +| 99 % (dvě devítky) | 0.99 | 3,65 dne | 7,2 h | 1,68 h | +| 99,5 % | 0.995 | 1,83 dne | 3,6 h | 50,4 min | +| 99,9 % (tři devítky) | 0.999 | 8,76 h | 43,2 min | 10,1 min | +| 99,95 % | 0.9995 | 4,38 h | 21,6 min | 5,04 min | +| 99,99 % (čtyři devítky) | 0.9999 | 52,6 min | 4,32 min | 1,01 min | +| 99,995 % | 0.99995 | 26,3 min | 2,16 min | 30,2 s | +| 99,999 % (pět devítek) | 0.99999 | 5,26 min | 25,9 s | 6,05 s | +| 99,9999 % (šest devítek) | 0.999999 | 31,6 s | 2,59 s | 0,605 s | + +### Výpočet + +``` +Dostupnost = (Celkový čas - Downtime) / Celkový čas × 100 % + +Příklad: + Rok = 365 × 24 × 60 = 525 600 minut + Cíl: 99,9 % → povolený downtime = 525 600 × (1 - 0,999) = 525,6 minut = 8,76 h + +Složená dostupnost (řetězec závislostí): + A_web = 99,9 % (3 devítky) + A_api = 99,99 % (4 devítky) + A_db = 99,999 % (5 devítek) + + A_celkem = 0,999 × 0,9999 × 0,99999 = 0,99889 ≈ 99,89 % (méně než 3 devítky!) + +Paralelní dostupnost (redundance): + A_celkem = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n) + + Příklad: 2 servery s 99% dostupností + A_celkem = 1 - (1-0,99) × (1-0,99) = 1 - 0,01 × 0,01 = 0,9999 (99,99 %) +``` + +### Kalkulačka + +```python +def uptime_percent_to_downtime(pct, period_days=365): + """Převede procento uptimu na downtime v daném období.""" + total_minutes = period_days * 24 * 60 + allowed_downtime = total_minutes * (1 - pct / 100) + return allowed_downtime # minutes + +def downtime_to_uptime_percent(downtime_minutes, period_days=365): + """Převede downtime v minutách na procento uptimu.""" + total_minutes = period_days * 24 * 60 + return (1 - downtime_minutes / total_minutes) * 100 + +def combined_availability(availabilities): + """Složená dostupnost (sériově zapojené komponenty).""" + result = 1.0 + for a in availabilities: + result *= a + return result + +def redundant_availability(availabilities): + """Paralelní dostupnost (redundantní komponenty).""" + result = 1.0 + for a in availabilities: + result *= (1 - a) + return 1 - result +``` + +### Fallacies výpočtu + +- **Složená dostupnost není součet** — přidání další závislosti vždy snižuje celkovou dostupnost +- **Redundance není zadarmo** — přidání standby komponenty vyžaduje detekci selhání + failover (MTTR se nezlepší automaticky) +- **SLA není garance** — poskytovatelé často počítají SLA jako měsíční průměr, ne per-incident +- **Měření je klíčové** — bez SLI nelze ověřit SLO; "nedoměřená dostupnost neexistuje" +- **Plánovaná odstávka** — někdy se počítá do uptimu, někdy ne (záleží na definici SLA) + +--- + +## DR scénáře + +### Klasifikace + +| Kategorie | Scénář | Typický RTO | Typické RPO | Frekvence | +|-----------|--------|-------------|-------------|-----------| +| **Site** | Výpadek celého DC / regionu | hodiny | minuty | Nízká | +| **Infrastructure** | Selhání HW (storage, switch, server) | minuty–hodiny | sekundy | Střední | +| **Software** | Selhání OS, aplikace, DB | minuty | vteřiny | Vysoká | +| **Data** | Poškození dat, delete, cryptolocker | hodiny | okamžik zálohy | Nízká–střední | +| **Human** | Chybný deployment, config change | minuty–hodiny | vteřiny | Střední | +| **Security** | Útok, breach, ransomware | dny | před útokem | Nízká | +| **Network** | Výpadek konektivity, DDoS | minuty–hodiny | N/A | Střední | +| **Cloud provider** | Regionální výpadek (AWS, Azure, GCP) | hodiny | minuty | Velmi nízká | + +### Detail scénářů + +#### Site / Region failure + +| Aspekt | Popis | +|--------|-------| +| **Příčina** | Blackout, požár, povodeň, zemětřesení, výpadek cloud providera | +| **Prevence** | Multi-AZ architektura, multi-region deployment, active-active | +| **Mitigace** | Automatický DNS failover (Route53, Azure Traffic Manager), replica v DR regionu | +| **Testování** | Game day: vypnout primární region, ověřit automatický failover | + +#### Data corruption / human error + +| Aspekt | Popis | +|--------|-------| +| **Příčina** | Chybný SQL příkaz (DELETE bez WHERE), omylem smazaný bucket, chybná migrace | +| **Prevence** | RBAC, MFA pro destructive operace, change management, peer review SQL | +| **Mitigace** | Point-in-time recovery (PITR), transaction log replay, immutable backups | +| **Testování** | Obnova zálohy do izolovaného prostředí, ověření integrity dat | + +#### Ransomware / cyber attack + +| Aspekt | Popis | +|--------|-------| +| **Příčina** | Útok na produkční systémy, zašifrování dat, exfiltrace | +| **Prevence** | Immutable backups (object lock), air-gapped backups, network segmentation | +| **Mitigace** | Obnova z čisté zálohy, re-build infrastructure from IaC | +| **Testování** | Pravidelná obnova v izolované síti, ověření že backup není infikován | + +--- + +## Prevence — strategie + +### Backup strategie + +| Aproach | Popis | Use case | +|---------|-------|----------| +| **3-2-1 pravidlo** | 3 kopie, 2 různá média, 1 off-site | Univerzální | +| **3-2-1-0** | + 0 chyb po obnově (testování) | Enterprise, compliance | +| **GFS (Grandfather-Father-Son)** | Denní, týdenní, měsíční rotace | Dlouhodobý archiv | +| **Incremental forever** | Plná záloha 1×, pak jen změny | Velké objemy dat | +| **Reverse incremental** | Plná + inkrementální, plná je vždy aktuální | Rychlá obnova | + +### Zálohovací metody + +| Metoda | RPO | RTO | Úložiště | Vhodné pro | +|--------|-----|-----|----------|------------| +| **Full backup** | Poslední full | Doba obnovy full | Velké | Malá data, weekly | +| **Incremental** | Poslední inkrement | Full + všechny inkrementy | Malé | Velká data, daily | +| **Differential** | Poslední diff | Full + poslední diff | Střední | Kompromis | +| **Snapshot** | Okamžik snapshotu | vteřiny | Copy-on-write | VM, storage array | +| **Continuous (CDC)** | < 1 s | Sekundy | Log stream | DB (binlog, WAL) | +| **PITR** | Libovolný bod v čase | Dle objemu | Full + WAL | RDS, PostgreSQL, SQL Server | + +### Imunabilita backupů + +Klíčová ochrana proti ransomwaru: + +| Technika | Popis | +|----------|-------| +| **Object Lock (WORM)** | Backup nelze smazat ani přepsat po defined retention period (S3 Object Lock, Azure Blob Immutable) | +| **Air gap** | Backup je fyzicky oddělený od produkční sítě (offline disk, tape, cloud bez VPN) | +| **Isolated backup network** | Backup traffic jde přes dedikovanou síť bez přístupu z produkční VLAN | +| **Out-of-band access** | Backup management console není dostupná z produkční sítě | + +--- + +## DR architektury + +### Multi-AZ (Single region) + +``` +Region ┌────────────────────────────────────┐ + │ AZ-1 AZ-2 │ + │ ┌──────────┐ ┌──────────┐ │ + │ │ App │ │ App │ │ + │ └─────┬────┘ └─────┬────┘ │ + │ │ │ │ + │ ┌─────▼────────────────▼─────┐ │ + │ │ Load Balancer (cross-AZ) │ │ + │ └─────────────┬──────────────┘ │ + │ │ │ + │ ┌─────────────▼──────────────┐ │ + │ │ DB Primary (AZ-1) │ │ + │ │ DB Standby (AZ-2) │ │ + │ │ Synchronous replication │ │ + │ └────────────────────────────┘ │ + └────────────────────────────────────┘ +``` + +- RTO: minuty (automatický failover) +- RPO: 0 (sync replication) +- Ochrana: proti selhání AZ, nikoliv regionu + +### Multi-Region + +``` +Region A (Primary) Region B (DR) +┌─────────────────────┐ ┌─────────────────────┐ +│ ┌───────────────┐ │ │ ┌───────────────┐ │ +│ │ App + DB │ │ │ │ App + DB │ │ +│ │ Active │──┼──Async───────┼─►│ Standby │ │ +│ └───────────────┘ │ replikace │ └───────────────┘ │ +│ │ │ │ │ │ +│ ┌──────▼───────┐ │ │ ┌──────▼───────┐ │ +│ │ DNS / GSLB │ │ │ │ DNS / GSLB │ │ +│ └──────┬───────┘ │ │ └──────┬───────┘ │ +└─────────┼──────────┘ └─────────┼──────────┘ + │ │ + └──────────── Traffic Manager ───────┘ +``` + +| Varianta | RTO | RPO | Náklady | Failover | +|----------|-----|-----|---------|----------| +| **Active-Passive** | minuty–hodiny | sekundy | Střední | Manuální / auto | +| **Active-Active** | sekundy | < 1 s | Vysoké | Automatický (DNS) | +| **Pilot Light** | desítky minut | minuty | Nízké | Manuální škálování | +| **Warm Standby** | minuty | sekundy | Vysoké | Auto (zmenšená kopie) | +| **Backup & Restore** | hodiny | 24 h | Nízké | Manuální | + +### On-prem → Cloud DR (Hybrid) + +``` +On-prem DC Cloud (DR) +┌─────────────────────┐ ┌─────────────────────┐ +│ ┌───────────────┐ │ │ ┌───────────────┐ │ +│ │ Aplikace │ │ │ │ VM / Aplikace│ │ +│ │ + DB │ │ │ │ + DB replica │ │ +│ └───────┬───────┘ │ │ └───────┬───────┘ │ +│ │ │ │ │ │ +│ ┌───────▼───────┐ │ site-to-site│ ┌───────▼───────┐ │ +│ │ Backup proxy │──┼────VPN───────┼─►│ Backup store │ │ +│ └───────────────┘ │ │ └───────────────┘ │ +│ │ │ │ +│ ┌───────────────┐ │ │ ┌───────────────┐ │ +│ │ Tape / NAS │ │ │ │ Veeam / Zerto│ │ +│ └───────────────┘ │ │ └───────────────┘ │ +└─────────────────────┘ └─────────────────────┘ +``` + +- **RTO**: desítky minut (závisí na startup VM) +- **RPO**: minuty–hodiny (závisí na replikačním nástroji) +- **Nástroje**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault +- **Use case**: enterprise s on-prem DC, které potřebuje DR bez druhého DC + +--- + +## DR testování + +### Typy testů + +| Typ | Popis | Frekvence | Riziko | +|-----|-------|-----------|--------| +| **Tabletop exercise** | Manuální procházení scénáře, žádný dopad na produkci | Měsíčně | Žádné | +| **Walkthrough** | Verifikace runbooku, kontrola že všichni ví co dělat | Kvartálně | Žádné | +| **Component test** | Test jedné komponenty (např. obnova jedné DB) | Měsíčně | Nízké | +| **Integrated test** | Test celého stacku v izolovaném prostředí | Kvartálně | Nízké | +| **Full failover test** | Produkční failover do DR site | Ročně | Vysoké | +| **Chaos experiment** | Cílené vnášení poruch do produkce | Průběžně | Střední | + +### Runbook struktura + +Každý DR scénář by měl mít runbook: + +```yaml +scenario: "Region A failure" +triggers: + - "CloudWatch alarm: Region A health check 5× timeout" + - "PagerDuty incident P0" +decision_tree: | + 1. Ověřit: je Region A opravdu nedostupný? (check z 3 různých lokací) + 2. Rozhodnout: je RTO v ohrožení? Pokud zbývá < 30 % RTO → failover + 3. Failover: spustit playbook `dr-failover-region-b` + 4. Verifikace: smoke testy v Region B + 5. Komunikace: status page + stakeholders +rollback: | + 1. Po obnovení Region A → replikace změn z B zpět do A + 2. Repoint DNS na A + 3. Ověřit konzistenci dat + 4. Vypnout Region B (nebo ponechat jako hot standby) +contacts: + primary: "on-call@example.com" + escalation: "infra-lead@example.com" + management: "vp-engineering@example.com" +``` + +--- + +## Best practices + +- **Testuj obnovu, ne zálohu** — backup bez testované obnovy není backup +- **Automatizuj DR** — Terraform / Ansible pro spin-up DR prostředí, DNS failover +- **Dokumentuj runbooky** — každý scénář, kontakt, rozhodovací strom +- **Počítej se selháním** — design for failure, nečekej že všechno poběží +- **Nepodceňuj WRT** — obnova služby neznamená plný provoz (data warming, cache, connections) +- **Slaď RTO/RPO s businessem** — technické možnosti musí odpovídat obchodním požadavkům +- **Monitoruj SLI** — bez dat nelze ověřit SLO +- **DR není jen IT** — komunikace, PR, právní, regulace + +--- + +## Související + +- [CLOUD.md](CLOUD.md) — cloud DR strategie, AWS/Azure/GCP specific +- [DATACENTERS.md](DATACENTERS.md) — DC redundance, Tier klasifikace +- [MONITORING.md](MONITORING.md) — alerting, SLI/SLO/SLA +- [CICD.md](CICD.md) — deployment strategie, rollback +- [STORAGE.md](STORAGE.md) — backup storage, replication + +## Zdroje + +Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md) + +*Poslední revize: 2026-06-11* \ No newline at end of file diff --git a/MESSAGING.en.md b/MESSAGING.en.md new file mode 100644 index 0000000..68e5669 --- /dev/null +++ b/MESSAGING.en.md @@ -0,0 +1,275 @@ +# 📨 Messaging and streaming platforms + +## Platform overview + +| Platform | Type | Language | Protocol | Persistence | Use case | +|-----------|-----|-------|----------|-------------|----------| +| **Apache Kafka** | Distributed event store | Java/Scala | Binary (TCP) | Disk (log) | Event streaming, data pipeline, log aggregation | +| **RabbitMQ** | Message broker | Erlang | AMQP 0-9-1, MQTT, STOMP | Disk / RAM | Application messaging, task queue, RPC | +| **Apache Pulsar** | Distributed messaging + streaming | Java | Binary (TCP) + REST | Disk (segmented log) | Streaming + queue in one, multi-tenant | +| **NATS** | Lightweight messaging | Go | NATS protocol (TCP) | Memory / JetStream (disk) | Microservices, IoT, edge, low-latency | +| **AWS SQS** | Managed queue | — | HTTPS | Managed | Decoupling services, serverless | +| **AWS SNS** | Managed pub/sub | — | HTTPS, SQS, Lambda, email | Managed | Push notifications, fanout | +| **Azure Service Bus** | Managed messaging | — | AMQP, HTTPS | Managed | Enterprise messaging, sessions, transactions | +| **Google Pub/Sub** | Managed streaming | — | gRPC, REST | Managed | Event-driven, data pipeline | +| **Red Hat AMQ 7** (Artemis) | Message broker | Java | AMQP, MQTT, STOMP, OpenWire | Disk | Enterprise, JMS, high-availability | +| **Oracle Service Bus (OSB)** | Enterprise ESB | Java | HTTP/S, JMS, SOAP, REST, MQ, FTP, AQ | Managed (WebLogic) | Enterprise integration, SOA, protocol mediation, routing | + +--- + +## Platform details + +### Apache Kafka + +**Architecture:** + +``` +Producer ──► Topic ──► Partition ──► Consumer Group + │ + ├── Partition 0 (Leader) ──► Broker 1 + ├── Partition 1 (Follower) ──► Broker 2 + └── Partition 2 (Follower) ──► Broker 3 +``` + +| Concept | Description | +|---------|-------| +| **Topic** | Logical message category | +| **Partition** | Append-only log, ordered sequence of messages | +| **Broker** | Server in Kafka cluster | +| **Producer** | Publishes messages to topic | +| **Consumer** | Reads messages from partition (within consumer group) | +| **Consumer Group** | Group of consumers sharing topic reading | +| **Offset** | Position in partition (tracked by consumer) | +| **KRaft** | Controller quorum (replaces Zookeeper from Kafka 3.x) | + +**Replication and HA:** + +| Parameter | Value | +|----------|---------| +| Replication factor | 2–3 (typically 3 for production) | +| ISR (In-Sync Replicas) | Number of replicas keeping up with leader | +| Min ISR | Minimum ISR for acknowledging writes (acks=all) | +| acks=0 | Fire-and-forget (fastest, possible data loss) | +| acks=1 | Write acknowledged by leader (compromise) | +| acks=all | Write acknowledged by all ISR (safest) | +| Leader failover | Automatic election of new leader from ISR | + +**Important configuration:** + +```properties +# Production +replication.factor=3 +min.insync.replicas=2 +default.replication.factor=3 + +# Retention +log.retention.hours=168 # 7 days +log.retention.bytes=-1 # unlimited (or limit) +log.segment.bytes=1073741824 # 1 GB per segment + +# Performance +num.partitions=3 # adjust per need (scale-out) +compression.type=snappy # (snappy, gzip, lz4, zstd) +``` + +**Partitioning strategies:** + +| Strategy | Key | Advantage | Disadvantage | +|----------|------|--------|----------| +| Round-robin | null | Even distribution | Per-key ordering lost | +| Key-based | user_id, order_id | Same key → same partition | Uneven distribution (hot keys) | +| Custom partitioner | Custom logic | Per use-case optimization | More complex maintenance | + +### RabbitMQ + +**Architecture:** + +``` +Producer ──► Exchange ──► Binding ──► Queue ──► Consumer + │ + ┌───────────┼───────────┐ + ▼ ▼ ▼ + Direct Topic Fanout + Exchange Exchange Exchange +``` + +| Concept | Description | +|---------|-------| +| **Exchange** | Receives messages from producer, routes to queue | +| **Binding** | Exchange → queue link with routing key | +| **Queue** | FIFO message queue (consumed by consumer) | +| **Virtual Host (vhost)** | Tenant isolation within a single cluster | +| **Publisher Confirm** | Broker acknowledges message receipt | +| **Consumer Ack** | Consumer acknowledges message processing | + +**Exchange types:** + +| Type | Routing | Use case | +|-----|---------|----------| +| **Direct** | routing_key = binding_key | Task queue, point-to-point | +| **Topic** | routing_key match binding pattern (wildcard `*`, `#`) | Pub/sub with filtering | +| **Fanout** | All bound queues | Broadcast, event notification | +| **Headers** | AMQP headers match | Complex routing (not routing key dependent) | + +**Queue types:** + +```properties +# Classic Queue (deprecated in production) +x-queue-type: classic + +# Quorum Queue (recommended for production) +x-queue-type: quorum +x-quorum-initial-group-size: 3 +x-dead-letter-exchange: dlx + +# Stream Queue (for large backlogs) +x-queue-type: stream +x-max-length-bytes: 1073741824 +``` + +**HA and clustering:** + +| Mode | Description | Use case | +|-------|-------|----------| +| **Quorum Queues** | Raft-based replication (3–5 node), auto failover | Production, HA messaging | +| **Federation** | Async message forwarding between independent RabbitMQ clusters | Multi-region, DR | +| **Shovel** | Point-to-point message forwarding (Federation at queue level) | Migration, specific routing | +| **Warm Standby (DR)** | Secondary cluster, started on failover | Cold DR | + +### Apache Pulsar + +**Unique architecture (compute/storage separation):** + +``` +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ Producer │ │ Consumer │ │ Consumer │ +└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ + │ │ │ +┌──────▼───────────────────▼───────────────────▼──────┐ +│ Broker (stateless) │ +│ Subscription: Exclusive / Shared / Failover │ +└──────────────────────┬──────────────────────────────┘ + │ +┌──────────────────────▼──────────────────────────────┐ +│ BookKeeper (stateful storage) │ +│ ├── Bookie 1 ├── Bookie 2 ├── Bookie 3 ├── ... │ +│ └── Ledger (append-only, segmented log) │ +└─────────────────────────────────────────────────────┘ +``` + +| Concept | Description | +|---------|-------| +| **Topic** | Logical category (partitioned or non-partitioned) | +| **Subscription** | Delivery mode (Exclusive, Shared, Failover, Key_Shared) | +| **Ledger** | Storage unit in BookKeeper (append-only) | +| **Bookie** | Storage node (BookKeeper) | +| **Managed Ledger** | Segmented log with cache and retention | + +**Advantages over Kafka:** +- Compute/storage separation — independent scaling +- Geo-replication built-in (native) +- Multi-tenant (namespaces, isolation) +- TTL, retry, dead letter topic (built-in) +- Read-at-least-once / effectively-once + +### NATS + +| Feature | Description | +|---------|-------| +| **Core NATS** | Pub/sub, request-reply, < 1 ms latency | +| **JetStream** | Persistence, exactly-once, key-value store, object store | +| **Leaf nodes** | Hierarchical cluster connection | +| **Super-cluster** | Multi-region clustering (global) | + +**Use case:** IoT, edge computing, microservices communication, low-latency messaging. + +### Oracle Service Bus (OSB) + +Part of Oracle SOA Suite, runs on WebLogic Server. Enterprise service bus for integration in Oracle-heavy environments. + +| Concept | Description | +|---------|-------| +| **Proxy Service** | Inbound endpoint (HTTP, JMS, MQ, SOAP, REST) | +| **Business Service** | Target backend service | +| **Pipeline** | Message processing — routing, transformation, validation | +| **Split-Join** | Parallel/sequential orchestration of multiple services | +| **Reporting** | Message tracking, SLA monitoring | + +**Key features:** +- **Protocol mediation** — translation between SOAP/REST/JMS/MQ/FTP +- **Message transformation** — XSLT, XQuery, MFL (non-XML) +- **Throttling, SLA, alerting** — built-in +- **Oracle AQ (Advanced Queuing)** — integration with Oracle DB queues +- **XPath, XQuery, XSLT 2.0/3.0** — native support +- **Error handling** — fault policies, error queues, retry + +**Use case:** Enterprise SOA, Oracle DB → Kafka bridging, legacy mainframe wrapping, B2B integration. + +**Alternatives:** IBM Integration Bus (IIB), MuleSoft Anypoint, WSO2 EI, Apache Camel / ServiceMix. + +--- + +## Platform comparison + +### Performance and scaling + +| Platform | Max throughput | Latency (P99) | Messages/s (1 broker) | Scaling | +|-----------|--------------|---------------|-------------------------|-----------| +| **Kafka** | > 1 GB/s | 2–10 ms | ~1,000,000 | Partitions (horizontal) | +| **Pulsar** | > 1 GB/s | 5–15 ms | ~1,000,000 | Brokers + Bookies | +| **RabbitMQ** | ~100 MB/s | < 1 ms (RAM) | ~100,000 | Clustering (node) | +| **NATS** | > 10 GB/s | < 0.5 ms | ~10,000,000 | Clustering + Leaf nodes | +| **OSB** | < 1 GB/s | 10–100 ms | ~10,000 | Vertical (WebLogic cluster) + +### Delivery guarantees + +| Platform | At most once | At least once | Exactly once | Ordering | +|-----------|-------------|---------------|-------------|----------| +| **Kafka** | Yes | Yes (acks=all + min.insync) | Yes (idempotent + transactional) | Per partition | +| **Pulsar** | Yes | Yes | Yes (dedup + transactional) | Per partition | +| **RabbitMQ** | Yes | Yes (Publisher Confirm + Consumer Ack) | Limited | Per queue | +| **NATS** | Yes | Yes (JetStream) | Limited | Per subject | +| **OSB** | Yes | Yes (XA transactions, exactly-once delivery) | Yes (XA + WS-AT) | Per pipeline | + +### When to use what + +| Use case | Recommended platform | Reasoning | +|----------|---------------------|------------| +| **Event sourcing / audit log** | Kafka, Pulsar | Append-only log, high throughput, replay | +| **CDC (Change Data Capture)** | Kafka (Kafka Connect + Debezium) | Connector ecosystem | +| **Task queue (job processing)** | RabbitMQ, SQS | Dead letter, retry, priority, scheduling | +| **API messaging / microservices** | NATS, RabbitMQ | Low latency, simplicity | +| **Data pipeline (ETL)** | Kafka (KSQL, Kafka Streams) | Stream processing in platform | +| **IoT / Edge** | NATS, MQTT (RabbitMQ) | Lightweight, leaf nodes | +| **Enterprise SOA / EAI** | OSB, IBM IIB, MuleSoft | Protocol mediation, XA, B2B, legacy wrapping | +| **Multi-tenant cloud** | Pulsar | Native multi-tenant, geo-replication | +| **Serverless / event-driven** | SQS/SNS, Pub/Sub | Managed, auto-scaling | + +--- + +## DR and high availability + +See [DATACENTERS.en.md](DATACENTERS.en.md) — section "Impact of individual technologies on DC topology selection" for detailed DR mapping per platform. + +### Best practices + +- **Don't lose messages in queue** — prefer acknowledgement-based consumption (not auto-ack) +- **Dead letter queue** — every main queue has a DLQ for undeliverable messages +- **Monitor lag** — consumer lag is a key metric (Kafka: `kafka.consumer:consumer_lag`) +- **Idempotent consumer** — same message may be delivered twice +- **Retry with backoff** — exponential backoff on processing failure +- **Schema registry** — avoid deserialization errors (Avro, Protobuf, JSON Schema) +- **Encryption** — TLS in transit, encryption at rest (Kafka: cluster-side + topic-level) + +--- + +## Related + +- [DATACENTERS.en.md](DATACENTERS.en.md) — DR topology, per-platform mapping +- [CLOUD.en.md](CLOUD.en.md) — managed messaging (SQS, SNS, Service Bus, Pub/Sub) + +## Sources + +Links, books, and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md) + +*Last revision: 2026-06-12* \ No newline at end of file diff --git a/MESSAGING.md b/MESSAGING.md new file mode 100644 index 0000000..2c7a332 --- /dev/null +++ b/MESSAGING.md @@ -0,0 +1,275 @@ +# 📨 Messaging a streaming platformy + +## Přehled platformem + +| Platforma | Typ | Jazyk | Protokol | Persistence | Use case | +|-----------|-----|-------|----------|-------------|----------| +| **Apache Kafka** | Distributed event store | Java/Scala | Binary (TCP) | Disk (log) | Event streaming, data pipeline, log aggregation | +| **RabbitMQ** | Message broker | Erlang | AMQP 0-9-1, MQTT, STOMP | Disk / RAM | Aplikační messaging, task queue, RPC | +| **Apache Pulsar** | Distributed messaging + streaming | Java | Binary (TCP) + REST | Disk (segmented log) | Streaming + queue v jednom, multi-tenant | +| **NATS** | Lightweight messaging | Go | NATS protocol (TCP) | Memory / JetStream (disk) | Microservices, IoT, edge, low-latency | +| **AWS SQS** | Managed queue | — | HTTPS | Managed | Decoupling services, serverless | +| **AWS SNS** | Managed pub/sub | — | HTTPS, SQS, Lambda, email | Managed | Push notifications, fanout | +| **Azure Service Bus** | Managed messaging | — | AMQP, HTTPS | Managed | Enterprise messaging, sessions, transactions | +| **Google Pub/Sub** | Managed streaming | — | gRPC, REST | Managed | Event-driven, data pipeline | +| **Red Hat AMQ 7** (Artemis) | Message broker | Java | AMQP, MQTT, STOMP, OpenWire | Disk | Enterprise, JMS, high-availability | +| **Oracle Service Bus (OSB)** | Enterprise ESB | Java | HTTP/S, JMS, SOAP, REST, MQ, FTP, AQ | Managed (WebLogic) | Enterprise integration, SOA, protocol mediation, routing | + +--- + +## Detail platformem + +### Apache Kafka + +**Architektura:** + +``` +Producer ──► Topic ──► Partition ──► Consumer Group + │ + ├── Partition 0 (Leader) ──► Broker 1 + ├── Partition 1 (Follower) ──► Broker 2 + └── Partition 2 (Follower) ──► Broker 3 +``` + +| Koncept | Popis | +|---------|-------| +| **Topic** | Logická kategorie zpráv | +| **Partition** | Append-only log, ordered sequence of messages | +| **Broker** | Server v Kafka clusteru | +| **Producer** | Publikuje zprávy do topicu | +| **Consumer** | Čte zprávy z partition (v rámci consumer group) | +| **Consumer Group** | Skupina consumerů sdílejících čtení topicu | +| **Offset** | Pozice v partition (sledovaná consumerem) | +| **KRaft** | Controller quorum (nahrazuje Zookeeper od Kafka 3.x) | + +**Replikace a HA:** + +| Parametr | Hodnota | +|----------|---------| +| Replication factor | 2–3 (typicky 3 pro produkci) | +| ISR (In-Sync Replicas) | Počet replik, které drží krok s leaderem | +| Min ISR | Minimální počet ISR pro potvrzení zápisu (acks=all) | +| acks=0 | Fire-and-forget (nejrychlejší, možná ztráta dat) | +| acks=1 | Zápis potvrzen leaderem (kompromis) | +| acks=all | Zápis potvrzen všemi ISR (nejbezpečnější) | +| Leader failover | Automatický výběr nového leadera z ISR | + +**Důležité konfigurace:** + +```properties +# Produkce +replication.factor=3 +min.insync.replicas=2 +default.replication.factor=3 + +# Retention +log.retention.hours=168 # 7 dní +log.retention.bytes=-1 # neomezeno (nebo limit) +log.segment.bytes=1073741824 # 1 GB per segment + +# Performance +num.partitions=3 # podle potřeb (scale-out) +compression.type=snappy # (snappy, gzip, lz4, zstd) +``` + +**Partitioning strategies:** + +| Strategy | Klíč | Výhoda | Nevýhoda | +|----------|------|--------|----------| +| Round-robin | null | Rovnoměrné rozložení | Ztráta pořadí per klíč | +| Key-based | user_id, order_id | Zprávy se stejným klíčem → stejná partition | Nerovnoměrné rozložení (hot keys) | +| Custom partitioner | Vlastní logika | Optimalizace per use case | Složitější na údržbu | + +### RabbitMQ + +**Architektura:** + +``` +Producer ──► Exchange ──► Binding ──► Queue ──► Consumer + │ + ┌───────────┼───────────┐ + ▼ ▼ ▼ + Direct Topic Fanout + Exchange Exchange Exchange +``` + +| Koncept | Popis | +|---------|-------| +| **Exchange** | Přijímá zprávy od producera, routuje do queue | +| **Binding** | Vazba exchange → queue s routing key | +| **Queue** | FIFO fronta zpráv (consumer čte) | +| **Virtual Host (vhost)** | Izolace tenantů v rámci jednoho clusteru | +| **Publisher Confirm** | Potvrzení že broker zprávu přijal | +| **Consumer Ack** | Potvrzení že consumer zprávu zpracoval | + +**Exchange typy:** + +| Typ | Routing | Use case | +|-----|---------|----------| +| **Direct** | routing_key = binding_key | Task queue, point-to-point | +| **Topic** | routing_key match binding pattern (wildcard `*`, `#`) | Pub/sub s filtrováním | +| **Fanout** | Všem bindovaným queue | Broadcast, event notification | +| **Headers** | AMQP headers match | Komplexní routing (není závislý na routing key) | + +**Queue typy:** + +```properties +# Classic Queue (deprecated v produkci) +x-queue-type: classic + +# Quorum Queue (doporučeno pro produkci) +x-queue-type: quorum +x-quorum-initial-group-size: 3 +x-dead-letter-exchange: dlx + +# Stream Queue (pro large backlogs) +x-queue-type: stream +x-max-length-bytes: 1073741824 +``` + +**HA a clustering:** + +| Režim | Popis | Use case | +|-------|-------|----------| +| **Quorum Queues** | Raft-based replikace (3–5 node), auto failover | Produkce, HA messaging | +| **Federation** | Async forwarding zpráv mezi nezávislými RabbitMQ clustery | Multi-region, DR | +| **Shovel** | Point-to-point forwarding zpráv (Federation na úrovni queue) | Migrace, specifický routing | +| **Warm Standby (DR)** | Druhý cluster, start až při failoveru | Cold DR | + +### Apache Pulsar + +**Unikátní architektura (compute/storage separation):** + +``` +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ Producer │ │ Consumer │ │ Consumer │ +└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ + │ │ │ +┌──────▼───────────────────▼───────────────────▼──────┐ +│ Broker (stateless) │ +│ Subscription: Exclusive / Shared / Failover │ +└──────────────────────┬──────────────────────────────┘ + │ +┌──────────────────────▼──────────────────────────────┐ +│ BookKeeper (stateful storage) │ +│ ├── Bookie 1 ├── Bookie 2 ├── Bookie 3 ├── ... │ +│ └── Ledger (append-only, segmented log) │ +└─────────────────────────────────────────────────────┘ +``` + +| Koncept | Popis | +|---------|-------| +| **Topic** | Logická kategorie (partitioned nebo non-partitioned) | +| **Subscription** | Způsob doručení (Exclusive, Shared, Failover, Key_Shared) | +| **Ledger** | Storage unit v BookKeeper (append-only) | +| **Bookie** | Storage node (BookKeeper) | +| **Managed Ledger** | Segmentovaný log s cache a retention | + +**Výhody oproti Kafce:** +- Compute/storage separation — nezávislé škálování +- Geo-replication built-in (nativní) +- Multi-tenant (namespaces, isolation) +- TTL, retry, dead letter topic (built-in) +- Read-at-least-once / effectively-once + +### NATS + +| Feature | Popis | +|---------|-------| +| **Core NATS** | Pub/sub, request-reply, < 1 ms latence | +| **JetStream** | Persistence, exactly-once, key-value store, object store | +| **Leaf nodes** | Hierarchické propojení clusterů | +| **Super-cluster** | Multi-region clustering (global) | + +**Use case:** IoT, edge computing, microservices communication, low-latency messaging. + +### Oracle Service Bus (OSB) + +Součást Oracle SOA Suite, běží na WebLogic Serveru. Enterprise service bus pro integraci v Oracle-heavy prostředích. + +| Koncept | Popis | +|---------|-------| +| **Proxy Service** | Vstupní endpoint (HTTP, JMS, MQ, SOAP, REST) | +| **Business Service** | Cílový backend service | +| **Pipeline** | Message processing — routing, transformation, validation | +| **Split-Join** | Parallel/sequential orchestration více služeb | +| **Reporting** | Message tracking, SLA monitoring | + +**Klíčové vlastnosti:** +- **Protocol mediation** — překlad mezi SOAP/REST/JMS/MQ/FTP +- **Message transformation** — XSLT, XQuery, MFL (neXML) +- **Throttling, SLA, alerting** — built-in +- **Oracle AQ (Advanced Queuing)** — integrace s Oracle DB frontami +- **XPath, XQuery, XSLT 2.0/3.0** — nativní podpora +- **Error handling** — fault policies, error queues, retry + +**Use case:** Enterprise SOA, Oracle DB → Kafka bridging, legacy mainframe wrapping, B2B integration. + +**Alternativy:** IBM Integration Bus (IIB), MuleSoft Anypoint, WSO2 EI, Apache Camel / ServiceMix. + +--- + +## Srovnání platformem + +### Výkon a škálování + +| Platforma | Max throughput | Latence (P99) | Počet zpráv/s (1 broker) | Škálování | +|-----------|--------------|---------------|-------------------------|-----------| +| **Kafka** | > 1 GB/s | 2–10 ms | ~1 000 000 | Partitions (horizontální) | +| **Pulsar** | > 1 GB/s | 5–15 ms | ~1 000 000 | Brokers + Bookies | +| **RabbitMQ** | ~100 MB/s | < 1 ms (RAM) | ~100 000 | Clustering (node) | +| **NATS** | > 10 GB/s | < 0,5 ms | ~10 000 000 | Clustering + Leaf nodes | +| **OSB** | < 1 GB/s | 10–100 ms | ~10 000 | Vertikální (WebLogic cluster) + +### Delivery guarantees + +| Platforma | At most once | At least once | Exactly once | Ordering | +|-----------|-------------|---------------|-------------|----------| +| **Kafka** | Ano | Ano (acks=all + min.insync) | Ano (idempotent + transactional) | Per partition | +| **Pulsar** | Ano | Ano | Ano (dedup + transactional) | Per partition | +| **RabbitMQ** | Ano | Ano (Publisher Confirm + Consumer Ack) | Omezeně | Per queue | +| **NATS** | Ano | Ano (JetStream) | Omezeně | Per subject | +| **OSB** | Ano | Ano (XA transactions, exactly-once delivery) | Ano (XA + WS-AT) | Per pipeline | + +### Kdy co použít + +| Use case | Doporučená platforma | Zdůvodnění | +|----------|---------------------|------------| +| **Event sourcing / audit log** | Kafka, Pulsar | Append-only log, high throughput, replay | +| **CDC (Change Data Capture)** | Kafka (Kafka Connect + Debezium) | Ekosystém konektorů | +| **Task queue (job processing)** | RabbitMQ, SQS | Dead letter, retry, priority, scheduling | +| **API messaging / microservices** | NATS, RabbitMQ | Nízká latence, jednoduchost | +| **Data pipeline (ETL)** | Kafka (KSQL, Kafka Streams) | Stream processing v platformě | +| **IoT / Edge** | NATS, MQTT (RabbitMQ) | Lightweight, leaf nodes | +| **Enterprise SOA / EAI** | OSB, IBM IIB, MuleSoft | Protocol mediation, XA, B2B, legacy wrapping | +| **Multi-tenant cloud** | Pulsar | Nativní multi-tenant, geo-replication | +| **Serverless / event-driven** | SQS/SNS, Pub/Sub | Managed, auto-scaling | + +--- + +## DR a vysoká dostupnost + +Viz [DATACENTERS.md](DATACENTERS.md) — sekce "Vliv jednotlivých technologií na výběr DC topologie" pro detail DR mapping per platforma. + +### Best practices + +- **Neztrať zprávu v queue** — preferovat aknowledge-based consumption (ne auto-ack) +- **Dead letter queue** — každá hlavní queue má DLQ pro nedoručitelné zprávy +- **Monitoring lag** — consumer lag je klíčová metrika (Kafka: `kafka.consumer:consumer_lag`) +- **Idempotentní consumer** — stejná zpráva může být doručena dvakrát +- **Retry s backoff** — exponenciální backoff při selhání zpracování +- **Schema registry** — vyhnout se deserialization errors (Avro, Protobuf, JSON Schema) +- **Šifrování** — TLS in transit, encryption at rest (Kafka: cluster-side + topic-level) + +--- + +## Související + +- [DATACENTERS.md](DATACENTERS.md) — DR topologie, per-platforma mapping +- [CLOUD.md](CLOUD.md) — managed messaging (SQS, SNS, Service Bus, Pub/Sub) + +## Zdroje + +Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md) + +*Poslední revize: 2026-06-12* \ No newline at end of file diff --git a/README.en.md b/README.en.md index 6db2db2..d94b69f 100644 --- a/README.en.md +++ b/README.en.md @@ -52,9 +52,10 @@ Bilingual: Czech (`.md`) and English (`.en.md`). | 🌐 Network architecture | [NETWORKING.md](NETWORKING.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD | | 📊 Monitoring & observability | [MONITORING.md](MONITORING.md) | Prometheus, Grafana, OTel, logging, alerting | — | | 🔄 CI/CD & DevOps | [CICD.md](CICD.md) | Pipelines, GitOps, IaC (Terraform), deployment | — | +| 🔄 Disaster Recovery | [DR.md](DR.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING | | 🗄️ Database architecture | [DATABASES.md](DATABASES.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VEKTOROVE-DB, DATABAZOVE-ENGINY | | 🖥️ Hypervisors | [HYPERVISORS.md](HYPERVISORS.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW | -| 🏭 Data centers | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC services | MONITORING | +| 🏭 Data centers | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING | | 💾 Storage | [STORAGE.md](STORAGE.md) | SAN/NAS/object, RAID, SDS, Ceph, OpenStack Cinder/Swift/Manila | — | | 🔌 Server connectivity | [CONNECTIVITY.md](CONNECTIVITY.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — | | 🔧 Server hardware | [SERVER-HW.md](SERVER-HW.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY | @@ -89,9 +90,10 @@ Bilingual: Czech (`.md`) and English (`.en.md`). | 🌐 Network architecture | [NETWORKING.en.md](NETWORKING.en.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD | | 📊 Monitoring & observability | [MONITORING.en.md](MONITORING.en.md) | Prometheus, Grafana, OTel, logging, alerting | — | | 🔄 CI/CD & DevOps | [CICD.en.md](CICD.en.md) | Pipelines, GitOps, IaC (Terraform), deployment | — | +| 🔄 Disaster Recovery | [DR.en.md](DR.en.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING | | 🗄️ Database architecture | [DATABASES.en.md](DATABASES.en.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VECTOR-DBS, DATABASE-ENGINES | | 🖥️ Hypervisors | [HYPERVISORS.en.md](HYPERVISORS.en.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW | -| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services | MONITORING | +| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING | | 💾 Storage | [STORAGE.en.md](STORAGE.en.md) | SAN/NAS/object, RAID, SDS, Ceph | — | | 🔌 Server connectivity | [CONNECTIVITY.en.md](CONNECTIVITY.en.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — | | 🔧 Server hardware | [SERVER-HW.en.md](SERVER-HW.en.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY | @@ -136,6 +138,7 @@ Bilingual: Czech (`.md`) and English (`.en.md`). | `DATACENTERS.md` / `DATACENTERS.en.md` | [`MONITORING.md`](MONITORING.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | | `MONITORING.md` / `MONITORING.en.md` | [`sources/monitoring/sources.md`](sources/monitoring/sources.md) | | `CICD.md` / `CICD.en.md` | [`sources/cicd/sources.md`](sources/cicd/sources.md) | +| `DR.md` / `DR.en.md` | [`CLOUD.md`](CLOUD.md), [`DATACENTERS.md`](DATACENTERS.md), [`MONITORING.md`](MONITORING.md), [`CICD.md`](CICD.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | | `PROVISIONING.md` / `PROVISIONING.en.md` | [`CICD.md`](CICD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | | `STORAGE.md` / `STORAGE.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | | `GPU.md` / `GPU.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | diff --git a/README.md b/README.md index d59971d..63b9c04 100644 --- a/README.md +++ b/README.md @@ -52,15 +52,18 @@ Bilingual: Czech (`.md`) and English (`.en.md`). | 🌐 Síťová architektura | [NETWORKING.md](NETWORKING.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD | | 📊 Monitoring a observabilita | [MONITORING.md](MONITORING.md) | Prometheus, Grafana, OTel, logging, alerting, SLO | — | | 🔄 CI/CD a DevOps | [CICD.md](CICD.md) | Pipelines, GitOps, IaC (Terraform), deployment strategie | — | +| 🔄 Disaster Recovery | [DR.md](DR.md) | RTO, RPO, scénáře, prevence, výpočet uptimu | CLOUD, DATACENTERS, MONITORING | | 🗄️ Databázová architektura | [DATABASES.md](DATABASES.md) | Klasifikace, sharding, replikace, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VEKTOROVE-DB, DATABAZOVE-ENGINY | | 🖥️ Hypervisory | [HYPERVISORS.md](HYPERVISORS.md) | VMware, Hyper-V, KVM, Proxmox, migrace | STORAGE, SERVER-HW | -| 🏭 Datová centra | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC služby | MONITORING | +| 🏭 Datová centra | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC služby, sekundární DC topologie | MONITORING, MESSAGING | | 💾 Storage | [STORAGE.md](STORAGE.md) | SAN/NAS/object, RAID, SDS, Ceph, OpenStack Cinder/Swift/Manila | — | | 🔌 Server connectivity | [CONNECTIVITY.md](CONNECTIVITY.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — | | 🔧 Server hardware | [SERVER-HW.md](SERVER-HW.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY | | 🎮 GPU | [GPU.md](GPU.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — | | ⚙️ Server config | [SERVER-CONFIG.md](SERVER-CONFIG.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — | | 📦 Provisioning | [PROVISIONING.md](PROVISIONING.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD | +| 📨 Messaging & streaming | [MESSAGING.md](MESSAGING.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD | +| 🏗️ Migrace DC | [DC-MIGRATION.md](DC-MIGRATION.md) | Strategie, fáze, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE | | 📋 Původní rozcestník | [HARDWARE.md](HARDWARE.md) | Legacy index → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | | 📋 Původní infrastruktura | [INFRASTRUCTURE.md](INFRASTRUCTURE.md) | Legacy index → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | | 📋 Review workflow | [REVIEW.md](REVIEW.md) | Proces oponentury a kontroly obsahu | — | @@ -89,15 +92,18 @@ Bilingual: Czech (`.md`) and English (`.en.md`). | 🌐 Network architecture | [NETWORKING.en.md](NETWORKING.en.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD | | 📊 Monitoring & observability | [MONITORING.en.md](MONITORING.en.md) | Prometheus, Grafana, OTel, logging, alerting | — | | 🔄 CI/CD & DevOps | [CICD.en.md](CICD.en.md) | Pipelines, GitOps, IaC (Terraform), deployment | — | +| 🔄 Disaster Recovery | [DR.en.md](DR.en.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING | | 🗄️ Database architecture | [DATABASES.en.md](DATABASES.en.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VECTOR-DBS, DATABASE-ENGINES | | 🖥️ Hypervisors | [HYPERVISORS.en.md](HYPERVISORS.en.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW | -| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services | MONITORING | +| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING, MESSAGING | | 💾 Storage | [STORAGE.en.md](STORAGE.en.md) | SAN/NAS/object, RAID, SDS, Ceph | — | | 🔌 Server connectivity | [CONNECTIVITY.en.md](CONNECTIVITY.en.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — | | 🔧 Server hardware | [SERVER-HW.en.md](SERVER-HW.en.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY | | 🎮 GPU | [GPU.en.md](GPU.en.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — | | ⚙️ Server config | [SERVER-CONFIG.en.md](SERVER-CONFIG.en.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — | | 📦 Provisioning | [PROVISIONING.en.md](PROVISIONING.en.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD | +| 📨 Messaging & streaming | [MESSAGING.en.md](MESSAGING.en.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD | +| 🏗️ DC Migration | [DC-MIGRATION.en.md](DC-MIGRATION.en.md) | Strategies, phases, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE | | 📋 Legacy index | [HARDWARE.en.md](HARDWARE.en.md) | → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | | 📋 Legacy infra | [INFRASTRUCTURE.en.md](INFRASTRUCTURE.en.md) | → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | | 📋 Review workflow | [REVIEW.en.md](REVIEW.en.md) | Review and content control process | — | @@ -136,6 +142,9 @@ Bilingual: Czech (`.md`) and English (`.en.md`). | `DATACENTERS.md` / `DATACENTERS.en.md` | [`MONITORING.md`](MONITORING.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | | `MONITORING.md` / `MONITORING.en.md` | [`sources/monitoring/sources.md`](sources/monitoring/sources.md) | | `CICD.md` / `CICD.en.md` | [`sources/cicd/sources.md`](sources/cicd/sources.md) | +| `DR.md` / `DR.en.md` | [`CLOUD.md`](CLOUD.md), [`DATACENTERS.md`](DATACENTERS.md), [`MONITORING.md`](MONITORING.md), [`CICD.md`](CICD.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | +| `MESSAGING.md` / `MESSAGING.en.md` | [`DATACENTERS.md`](DATACENTERS.md), [`CLOUD.md`](CLOUD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | +| `DC-MIGRATION.md` / `DC-MIGRATION.en.md` | [`DATACENTERS.md`](DATACENTERS.md), [`CLOUD.md`](CLOUD.md), [`DR.md`](DR.md), [`NETWORKING.md`](NETWORKING.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | | `PROVISIONING.md` / `PROVISIONING.en.md` | [`CICD.md`](CICD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | | `STORAGE.md` / `STORAGE.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | | `GPU.md` / `GPU.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) | @@ -187,4 +196,4 @@ Raw referenční data (dokumentace, knihy, standardy) podle oblastí: --- -*Rozcestník je automaticky udržován agentem `kb-index`. Poslední aktualizace: 2026-06-11.* +*Rozcestník je automaticky udržován agentem `kb-index`. Poslední aktualizace: 2026-06-12.* diff --git a/sources/infrastructure/sources.md b/sources/infrastructure/sources.md index cddb3dc..937bc01 100644 --- a/sources/infrastructure/sources.md +++ b/sources/infrastructure/sources.md @@ -111,7 +111,22 @@ Rozděleno do samostatných souborů: | VMware Migration in 2026: Proxmox, KVM, XCP-ng & Veeam — StarWind | https://starwindsoftware.com/blog/vmware-migration-to-proxmox-kvm-xcp-ng-2026 | `[done]` | | Complete guide to modern vSphere alternatives — Spectro Cloud | https://www.spectrocloud.com/blog/vsphere-alternatives | `[done]` | | Broadcom VMware Acquisition: What's Next — Sayers | https://www.sayers.com/blog/after-the-deal-whats-next-for-vmware-customers | `[done]` | -| Stanford University migration from VMware to Proxmox | https://itcommunity.stanford.edu/news/enterprise-technology-completes-successful-virtual-infrastructure-migration-vmware-proxmox | `[done]` | + | Stanford University migration from VMware to Proxmox | https://itcommunity.stanford.edu/news/enterprise-technology-completes-successful-virtual-infrastructure-migration-vmware-proxmox | `[done]` | +| | **Messaging / streaming** | | +| Apache Kafka docs | https://kafka.apache.org/documentation/ | `[done]` | +| RabbitMQ docs | https://www.rabbitmq.com/documentation.html | `[done]` | +| Apache Pulsar docs | https://pulsar.apache.org/docs/ | `[done]` | +| NATS docs | https://docs.nats.io/ | `[done]` | +| Designing Event-Driven Systems (Confluent) | https://www.confluent.io/designing-event-driven-systems/ | `[done]` | +| Kafka: The Definitive Guide (2nd ed.) — Confluent | https://www.confluent.io/resources/kafka-the-definitive-guide/ | `[done]` | +| Enterprise Integration Patterns — Hohpe & Woolf | https://www.enterpriseintegrationpatterns.com/ | `[done]` | +| | **DC migrace** | | +| AWS Cloud Migration — 6 Strategies for Migrating to the Cloud | https://aws.amazon.com/blogs/enterprise-strategy/6-strategies-for-migrating-applications-to-the-cloud/ | `[done]` | +| Azure Cloud Migration — Microsoft Cloud Adoption Framework | https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ | `[done]` | +| Gartner 5 Rs of Cloud Migration | https://www.gartner.com/en/documents/3984835 | `[done]` | +| VMware Site Recovery Manager — documentation | https://docs.vmware.com/en/Site-Recovery-Manager/ | `[done]` | +| Zerto — Disaster Recovery & Migration | https://www.zerto.com/resources/ | `[done]` | +| The Phoenix Project — IT Ops & Migration patterns | https://itrevolution.com/product/the-phoenix-project/ | `[done]` | ## Výrobci hardware