new files

This commit is contained in:
Stanislav Hubacek
2026-06-16 15:47:45 +02:00
parent 3fa11ef0f6
commit b53714113c
11 changed files with 2298 additions and 7 deletions

View File

@@ -658,6 +658,281 @@ flowchart TD
CLIM -->|"Cold (SE, NO)"| FC3["Free cooling 7000+ h/year<br/>Air-side economizer<br/>PUE < 1.2"]
```
## Secondary data center topologies
When planning a second DC, the choice of topology is key based on distance, RPO/RTO, and budget.
### Distance classification
| Category | Distance | Latency (round-trip) | Use case |
|-----------|-----------|---------------------|----------|
| **Metro (Campus)** | 120 km | < 1 ms | Synchronous replication, stretched cluster |
| **Metro** | 20100 km | 15 ms | Metro cluster, mostly sync replication |
| **Regional** | 100500 km | 520 ms | Asynchronous replication, warm standby |
| **Continent** | 5003000 km | 20100 ms | Asynchronous replication, cold standby |
| **Global** | 3000+ km | > 100 ms | Async only, no real-time dependencies |
### Topologies by operational mode
#### Active-Active (Hot-Hot)
```
DC-A (Primary) DC-B (Active)
┌────────────────────┐ ┌────────────────────┐
│ App Active │ │ App Active │
│ DB Active │◄─sync─►│ DB Active │
│ Users → LB → A │ │ Users → LB → B │
└────────────────────┘ └────────────────────┘
│ │
└──── Global Load Balancer ────┘
```
| Parameter | Value |
|----------|---------|
| **RTO** | 0seconds (automatic failover, traffic is redirected) |
| **RPO** | 0 (sync replication, commit is confirmed only after write to both DCs) |
| **Max distance** | < 100 km (latency < 5 ms RTT for sync DB replication) |
| **Operating costs** | 2× (both DCs fully active, both fully equipped) |
| **Advantages** | Zero downtime, instant switchover, full utilization of both DCs |
| **Disadvantages** | Requires synchronous replication → distance limit, complex networking, split-brain risk |
**Split-brain solutions**: STONITH (Shoot The Other Node In The Head), watchdog, quorum (3rd node in 3rd location / cloud), fencing, SCSI-3 persistent reservation.
**Use case**: Financial services, telco, payment gateways — where even a minute of downtime = millions.
#### Active-Passive (Hot-Warm, MetroCluster)
```
DC-A (Primary) DC-B (Standby)
┌────────────────────┐ ┌────────────────────┐
│ App Active │ │ App Standby │
│ DB Primary │──sync──►│ DB Standby │
│ Users → LB → A │ │ ~~~ (waiting) ~~~ │
│ DNS: A-record │ │ DNS: health check │
└────────────────────┘ └────────────────────┘
```
| Parameter | Value |
|----------|---------|
| **RTO** | tens of secondsminutes (DNS failover + App startup) |
| **RPO** | 0 (sync) or seconds (async) |
| **Max distance** | sync < 100 km, async unlimited |
| **Operating costs** | 1.51.8× (second DC has reduced or idle compute) |
| **MetroCluster** | Specific implementation: FC SAN over DWDM, sync mirror, automatic failover |
**MetroCluster** (NetApp, Dell EMC, HPE):
- Storage-based cluster with synchronous mirroring between DCs
- Automatic failover on entire DC failure
- Requires dedicated DWDM or dark fiber interconnection
- Typical distance: up to 50 km (for latency < 1 ms RTT)
- Use case: enterprise storage, primary+secondary DC in metropolitan area
#### Hot-Cold (Warm Standby → Cold)
```
DC-A (Primary) DC-B (Cold Standby)
┌────────────────────┐ ┌────────────────────┐
│ App Active │ │ ~~~ powered off ~~~│
│ DB Active │──async─►│ Backup storage │
│ Users → A │ │ ~~~ no compute ~~~│
└────────────────────┘ └────────────────────┘
```
| Parameter | Value |
|----------|---------|
| **RTO** | hoursdays (purchase/rent HW, restore from backup) |
| **RPO** | hours (last backup) |
| **Max distance** | unlimited |
| **Operating costs** | 1.11.3× (only storage and facility, compute only at failover) |
| **Typical use case** | Low-cost DR, compliance, last resort |
#### Pilot Light
```
DC-A (Primary) DC-B (Pilot Light)
┌────────────────────┐ ┌────────────────────┐
│ App Active │ │ ~~~ off ~~~ │
│ DB Active │──async─►│ DB replica (mini) │
│ All services │ │ Core services only│
│ │ │ (DNS, LDAP, mon) │
└────────────────────┘ └────────────────────┘
On DR: spin-up compute
from IaC, rest from backup
```
- DC-B runs with minimum compute (only core services and DB replica)
- Application layer is spun up from IaC (Terraform, Ansible) only during DR
- Compromise between cost and RTO
### Comparison table
| Topology | RTO | RPO | Cost (× primary) | Max distance | Failover |
|-----------|-----|-----|-------------------|-------------|----------|
| **Active-Active** | 0s | 0 | 2.0× | < 100 km | Auto (traffic) |
| **MetroCluster** | smin | 0 | 1.82.0× | < 50 km | Auto (storage) |
| **Active-Passive (sync)** | min | 0 | 1.51.8× | < 100 km | Semi-auto |
| **Active-Passive (async)** | minh | smin | 1.31.5× | unlimited | Semi-auto |
| **Pilot Light** | h | minh | 1.21.4× | unlimited | Manual |
| **Warm Standby** | minh | smin | 1.51.8× | unlimited | Semi-auto |
| **Cold Standby** | days | h | 1.11.3× | unlimited | Manual |
### Stretched Cluster
```
┌──── Site A (50 km) ────┐ ┌──── Site B ──────────┐
│ ┌──────────────────┐ │ │ ┌──────────────────┐ │
│ │ ESXi / Hyper-V │ │ │ │ ESXi / Hyper-V │ │
│ │ VM │ │ │ │ VM (complement) │ │
│ └────────┬─────────┘ │ │ └────────┬─────────┘ │
│ │ │ │ │ │
│ ┌────────▼─────────┐ │ │ ┌────────▼─────────┐ │
│ │ Storage (SAN) │──┼────┼──│ Storage (SAN) │ │
│ │ MetroCluster │ │ │ │ MetroCluster │ │
│ └──────────────────┘ │ │ └──────────────────┘ │
└────────────────────────┘ └────────────────────────┘
┌─────▼──────┐
│ vCenter / │
│ Cluster │
│ (single) │
└────────────┘
```
- One cluster stretched across two sites (single management domain)
- VMs can live-migrate between sites (vMotion over distance)
- Storage synchronously mirrored (MetroCluster, VPLEX, vSAN延伸)
- **Requirements**: dark fiber / DWDM, low latency (< 5 ms), high link reliability
- **Risks**: split-brain, brain drain (split-site cluster), network dependency
- **Use case**: enterprise with own dark fiber between two DCs in a metropolitan area
### Decision tree
```mermaid
flowchart TD
Start(["Secondary DC"]) --> RPO{"Required RPO?"}
RPO -->|"0 (no data loss)"| SYNC{"Sync replication possible?"}
SYNC -->|"Yes, < 100 km"| ACT{"Want zero downtime?"}
ACT -->|"Yes"| AA["Active-Active<br/>RTO=0, RPO=0, 2× cost"]
ACT -->|"No"| AP["Active-Passive<br/>RTO=min, RPO=0, 1.5×"]
SYNC -->|"No, > 100 km"| ASYNC["Active-Passive (async)<br/>RTO=min, RPO=s, 1.3×"]
RPO -->|"minuteshours"| WARM{"Want fast failover?"}
WARM -->|"Yes"| PILOT["Pilot Light<br/>RTO=h, RPO=min, 1.2×"]
WARM -->|"No"| COLD["Cold Standby<br/>RTO=days, RPO=h, 1.1×"]
Start --> DIST{"Distance between DCs"}
DIST -->|"< 50 km, own fiber"| MC["MetroCluster / Stretched Cluster<br/>Single management, sync storage"]
DIST -->|"50300 km"| REG["Regional DR<br/>Active-Passive, async replication"]
DIST -->|"> 300 km"| GLOBAL["Global DR<br/>Cold standby, backup & restore"]
```
### Physical infrastructure for DC interconnection
| Technology | Bandwidth | Max distance | Latency | Use case |
|------------|-----------|-------------|---------|----------|
| **Dark fiber** | 100 GbE800 GbE | 1080 km (single-mode) | < 0.1 ms | MetroCluster, stretched cluster |
| **DWDM** | 400 GbE1.6 TbE (per lambda) | 80120 km (without amplifier) | < 0.5 ms | Metro, metro cluster |
| **CWDM** | 1025 GbE (per channel) | 1040 km | < 0.3 ms | Campus, smaller metro |
| **MPLS L2VPN** | 10100 GbE | unlimited | 110 ms | Regional DR, async replication |
| **Internet IPsec** | 110 GbE | unlimited | 550 ms | Cold standby, backup |
### Impact of individual technologies on DC topology selection
Choosing a secondary DC topology is not purely an infrastructure decision — each layer (DB, hypervisor, orchestration, messaging) brings its own constraints.
#### Databases
| DB technology | Sync replication | Max distance | Auto-failover | Split-brain handling | Note |
|---------------|---------------|-------------|---------------|-------------------|----------|
| **PostgreSQL** | Synchronous commit (synchronous_standby_names) | < 100 km (latency < 10 ms) | Patroni / repmgr + etcd | Quorum (etcd, 3+ node) | Streaming replication, needs wal_keep_segments |
| **MySQL** | Group Replication (multi-primary, single-primary) | < 100 km | MySQL InnoDB Cluster + MySQL Router | Paxos (Group Replication, 3+ node) | Semi-sync as compromise |
| **Oracle** | Data Guard (SYNC/FASTSYNC/ASYNC), RAC extended | sync < 100 km, async unlimited | Data Guard Broker / FSFO (Fast Start Failover) | Observer (3rd node) | Far Sync for remote DCs |
| **MSSQL** | AlwaysOn Availability Groups (SYNCHRONOUS_COMMIT) | < 100 km | AlwaysOn + Cluster quorum | File share majority / cloud witness | Multi-site cluster support |
| **MongoDB** | Majority write concern + journaling | < 100 km | Replica set auto-election | Arbitration node (voting member) | Priority-based failover |
| **Cassandra** | N/A (multi-master, eventual consistency) | unlimited | Yes (peer-to-peer) | None (multi-master, gossip protocol) | Snitch-aware topology, NetworkTopologyStrategy |
| **Redis** | Redis Sentinel / Redis Cluster (async) | unlimited (async) | Sentinel / Cluster failover | Quorum (Sentinel, majority) | PSYNC replication, replication lag |
Key limitation for **sync replication**: latency < 5 ms RTT (commit must wait for confirmation from both DCs). At 100 km RTT ~1 ms — OK. At 1000 km (~10 ms RTT) sync replication reduces transaction throughput by 80+ %.
Suitable for **Active-Active**:
- **Cassandra / ScyllaDB** — native multi-DC, eventual consistency, no split-brain
- **MySQL Group Replication (multi-primary)** — 3+ DC for quorum
- **CockroachDB / TiDB** — native multi-region, ACID across DCs
- **Redis Enterprise** — Active-Active (CRDT-based)
Suitable for **Active-Passive**:
- **PostgreSQL + Patroni** — auto-failover, etcd quorum
- **Oracle Data Guard** — FSFO, far sync for remote DCs
- **MSSQL AlwaysOn** — cloud witness
- **MongoDB Replica Set** — arbitration node in 3rd location
#### Hypervisors
| Hypervisor | Cluster technology | Stretched cluster | Max distance | Split-brain |
|-----------|-------------------|-------------------|-------------|-------------|
| **VMware vSphere** | vSAN延伸, Metro vCenter, Site Recovery Manager | Yes (vSAN延伸, Metro Cluster) | < 50 km (vSAN延伸), < 10 ms RTT | Fencing (STONITH), witness host |
| **Hyper-V** | Storage Replica + Failover Cluster | Yes (Cluster Sets) | < 50 km (sync), unlimited (async) | File share witness / cloud witness |
| **Proxmox VE** | Proxmox HA + Ceph | Limited (Ceph stretch cluster) | < 50 km (Ceph sync) | Ceph monitor quorum (3+ DC) |
| **XCP-ng / XenServer** | Xen Orchestra HA + SR (Storage Repository) replication | Limited | depends on storage replication | — |
| **Nutanix AHV** | Metro Availability (sync), Async DR | Yes (Metro) | < 100 km (sync), unlimited (async) | Witness VM (cloud / 3rd site) |
| **KVM / oVirt** | oVirt HA + GlusterFS / NFS | Limited | depends on storage replication | — |
**vSAN延伸 specific requirements:**
- Dedicated vSAN network (25 GbE min., < 5 ms RTT)
- Witness host in 3rd location (or cloud witness)
- All VM policies (FTT=1, mirroring striped)
- Storage policy: `site-A + site-B + witness`
#### Kubernetes and container platforms
| Platform | Multi-cluster DR | Replication | Max distance | Failover |
|-----------|-----------------|-----------|-------------|----------|
| **Vanilla K8s** | KubeFed, Cluster API, Velero + Restic | Velero (backup/restore), Rook (Ceph) | unlimited | Manual (Velero restore) |
| **OpenShift** | ACM (Advanced Cluster Management), Velero | OADP (OpenShift API for Data Protection) | unlimited | ACM failover (subscription) |
| **Rancher** | Rancher Multi-Cluster App, Velero | Longhorn (sync/async DR), Velero | unlimited | Semi-auto |
| **Google GKE** | Multi-cluster Services, Backup for GKE | Config Sync, Backup for GKE | unlimited | Manual |
| **Azure AKS** | Azure ARC + Velero + Azure Traffic Manager | AKS backup (velero), Azure Site Recovery | unlimited | Manual (Velero) |
| **AWS EKS** | EKS multi-cluster, Velero + S3 cross-region | Velero (S3), Rook (EBS snapshots) | unlimited | Manual |
**Key K8s DR principles:**
- **Applications must be stateless** (or state externalized to DB/storage)
- **Velero** — backup/restore entire cluster (PV, resources, helm releases)
- **Rook/Ceph** — cross-region mirroring RBD volumes
- **KubeFed / ACM** — subscription-based deploy to multiple clusters
- **Ingress/Gateway API** — traffic routing between clusters
- **External DNS** — DNS failover on cluster outage
#### Messaging / streaming
| Platform | Replication | Topology | DR support | Max distance |
|-----------|-----------|-----------|------------|-------------|
| **Apache Kafka** | MirrorMaker 2, Confluent Cluster Linking, KRaft quorum | Active-Passive (MM2), Active-Active (Cluster Linking) | MM2: async, Cluster Linking: async | unlimited |
| **RabbitMQ** | Classic Queue Mirroring, Quorum Queues | Active-Passive (Warm Standby) | Federation / Shovel (async) | unlimited |
| **Red Hat AMQ** | (Artemis) Cluster + HA | Active-Passive (shared store / replication) | Live-backup pair | < 100 km (sync) |
| **NATS** | NATS JetStream (cluster + cross-account) | Active-Active (Leaf nodes, cross-account) | Super-cluster, failover | unlimited |
| **Apache Pulsar** | BookKeeper (bookie rack-aware), geo-replication | Active-Active (geo-replication) | Built-in (cluster-level) | unlimited (async) |
| **AWS SQS/SNS** | Managed, AWS region pairs | Active-Active (multi-region) | Built-in (AWS managed) | unlimited |
| **Azure Service Bus** | Managed, paired region | Active-Passive (paired region) | Built-in (geo-recovery) | unlimited |
| **Oracle Service Bus (OSB)** | Oracle WebLogic Cluster + JDBC store + AQ | Active-Passive (WebLogic Cluster + Data Guard) | OSB/WLS cluster + Oracle RAC/Data Guard sync | < 100 km (Data Guard sync), unlimited (async) |
**Messaging DR recommendations:**
- **Kafka**: use Cluster Linking for Active-Active, or MirrorMaker 2 for Active-Passive; replicate only critical topics
- **RabbitMQ**: Quorum Queues + Federation upstream for DR; avoid Classic Queue Mirroring (deprecated)
- **Pulsar**: native geo-replication, bookie rack-aware for stretched cluster; easiest DR among messaging platforms
- **OSB**: WebLogic cluster + Oracle RAC/Data Guard; DR depends on DB layer, not on OSB itself
### Per-layer limitations summary table
| Layer | Limiting factor for secondary DC | Max distance for sync | Impact on topology selection |
|--------|-----------------------------------|----------------------|--------------------------|
| **Storage** | Sync mirror latency, DWDM cost | < 50 km (MetroCluster) | Stretched cluster only in metro |
| **Databases** | Commit wait for sync replication | < 100 km (5 ms RTT) | Active-Active only with multi-master DB |
| **Hypervisor** | Stretched cluster quorum + fencing | < 50 km (vSAN, 5 ms) | MetroCluster / stretched cluster |
| **Kubernetes** | Velero restore time, Rook mirror latency | unlimited (async) | Active-Passive, cold standby |
| **Messaging** | Replication lag, offset management | unlimited (async) | Active-Active (Kafka, Pulsar, NATS) or Active-Passive |
| **Network** | Dark fiber/DWDM cost, latency | < 100 km (metro fiber) | Limits sync replication options |
| **Application** | Stateful/stateless, connection draining | depends on architecture | Stateless app → any topology |
## Disk monitoring — S.M.A.R.T.
Self-Monitoring, Analysis and Reporting Technology — predictive monitoring of HDD/SSD.

View File

@@ -658,6 +658,281 @@ flowchart TD
CLIM -->|"Chladná (SE, NO)"| FC3["Free cooling 7000+ h/rok<br/>Air-side economizer<br/>PUE < 1.2"]
```
## Topologie sekundárního datového centra
Při plánování druhého DC je klíčová volba topologie podle vzdálenosti, RPO/RTO a rozpočtu.
### Klasifikace vzdáleností
| Kategorie | Vzdálenost | Latence (round-trip) | Use case |
|-----------|-----------|---------------------|----------|
| **Metro (Campus)** | 120 km | < 1 ms | Synchronní replikace, stretched cluster |
| **Metro** | 20100 km | 15 ms | Metro cluster, většinou sync replikace |
| **Regional** | 100500 km | 520 ms | Asynchronní replikace, warm standby |
| **Continent** | 5003000 km | 20100 ms | Asynchronní replikace, cold standby |
| **Global** | 3000+ km | > 100 ms | Pouze async, žádné real-time závislosti |
### Topologie podle provozního režimu
#### Active-Active (Hot-Hot)
```
DC-A (Primary) DC-B (Active)
┌────────────────────┐ ┌────────────────────┐
│ App Active │ │ App Active │
│ DB Active │◄─sync─►│ DB Active │
│ Users → LB → A │ │ Users → LB → B │
└────────────────────┘ └────────────────────┘
│ │
└──── Global Load Balancer ────┘
```
| Parametr | Hodnota |
|----------|---------|
| **RTO** | 0vteřiny (automatický failover, traffic se přesměruje) |
| **RPO** | 0 (sync replikace, commit je potvrzen až po zápisu do obou DC) |
| **Max distance** | < 100 km (latence < 5 ms RTT pro sync DB replikaci) |
| **Provozní náklady** | 2× (obě DC plně aktivní, obě plně vybavené) |
| **Výhody** | Nulový výpadek, okamžité přepnutí, plné využití obou DC |
| **Nevýhody** | Nutná synchronní replikace → limit vzdálenosti, komplexní networking, split-brain risk |
**Split-brain řešení**: STONITH (Shoot The Other Node In The Head), watchdog, quorum (3. node v 3. lokaci / cloud), fencing, SCSI-3 persistent reservation.
**Use case**: Finanční služby, telco, platební brány — kde i minuta výpadku = miliony.
#### Active-Passive (Hot-Warm, MetroCluster)
```
DC-A (Primary) DC-B (Standby)
┌────────────────────┐ ┌────────────────────┐
│ App Active │ │ App Standby │
│ DB Primary │──sync──►│ DB Standby │
│ Users → LB → A │ │ ~~~ (čeká) ~~~ │
│ DNS: A-record │ │ DNS: health check │
└────────────────────┘ └────────────────────┘
```
| Parametr | Hodnota |
|----------|---------|
| **RTO** | desítky vteřinminuty (DNS failover + startup App) |
| **RPO** | 0 (sync) nebo sekundy (async) |
| **Max distance** | sync < 100 km, async neomezeně |
| **Provozní náklady** | 1,51,8× (druhé DC má zmenšený nebo idle compute) |
| **MetroCluster** | Specifická implementace: FC SAN přes DWDM, sync mirror, automatický failover |
**MetroCluster** (NetApp, Dell EMC, HPE):
- Storage-based cluster se synchronním mirroringem mezi DC
- Automatic failover při selhání celého DC
- Vyžaduje dedikované DWDM nebo dark fiber propojení
- Typická vzdálenost: do 50 km (pro latenci < 1 ms RTT)
- Use case: enterprise storage, primary+secondary DC v metropolitní oblasti
#### Hot-Cold (Warm Standby → Cold)
```
DC-A (Primary) DC-B (Cold Standby)
┌────────────────────┐ ┌────────────────────┐
│ App Active │ │ ~~~ powered off ~~~│
│ DB Active │──async─►│ Backup storage │
│ Users → A │ │ ~~~ no compute ~~~│
└────────────────────┘ └────────────────────┘
```
| Parametr | Hodnota |
|----------|---------|
| **RTO** | hodinydny (nákup/najmutí HW, obnova z backupu) |
| **RPO** | hodiny (poslední backup) |
| **Max distance** | neomezena |
| **Provozní náklady** | 1,11,3× (jen storage a facility, compute až při failoveru) |
| **Typ use case** | Low-cost DR, compliance, poslední záchrana |
#### Pilot Light
```
DC-A (Primary) DC-B (Pilot Light)
┌────────────────────┐ ┌────────────────────┐
│ App Active │ │ ~~~ off ~~~ │
│ DB Active │──async─►│ DB replica (mini) │
│ Všechny služby │ │ Core services jen │
│ │ │ (DNS, LDAP, mon) │
└────────────────────┘ └────────────────────┘
Při DR: spin-up compute
z IaC, zbytek z backupu
```
- DC-B běží s minimem compute (jen core služby a DB replica)
- Aplikační vrstva se spin-up z IaC (Terraform, Ansible) až při DR
- Kompromis mezi náklady a RTO
### Srovnávací tabulka
| Topologie | RTO | RPO | Náklady (× primár) | Max distance | Failover |
|-----------|-----|-----|-------------------|-------------|----------|
| **Active-Active** | 0s | 0 | 2,0× | < 100 km | Auto (traffic) |
| **MetroCluster** | smin | 0 | 1,82,0× | < 50 km | Auto (storage) |
| **Active-Passive (sync)** | min | 0 | 1,51,8× | < 100 km | Polo-auto |
| **Active-Passive (async)** | minh | smin | 1,31,5× | neomezena | Polo-auto |
| **Pilot Light** | h | minh | 1,21,4× | neomezena | Manuální |
| **Warm Standby** | minh | smin | 1,51,8× | neomezena | Polo-auto |
| **Cold Standby** | dny | h | 1,11,3× | neomezena | Manuální |
### Stretched Cluster
```
┌──── Site A (50 km) ────┐ ┌──── Site B ──────────┐
│ ┌──────────────────┐ │ │ ┌──────────────────┐ │
│ │ ESXi / Hyper-V │ │ │ │ ESXi / Hyper-V │ │
│ │ VM │ │ │ │ VM (komplement) │ │
│ └────────┬─────────┘ │ │ └────────┬─────────┘ │
│ │ │ │ │ │
│ ┌────────▼─────────┐ │ │ ┌────────▼─────────┐ │
│ │ Storage (SAN) │──┼────┼──│ Storage (SAN) │ │
│ │ MetroCluster │ │ │ │ MetroCluster │ │
│ └──────────────────┘ │ │ └──────────────────┘ │
└────────────────────────┘ └────────────────────────┘
┌─────▼──────┐
│ vCenter / │
│ Cluster │
│ (single) │
└────────────┘
```
- Jeden cluster roztažený přes dvě lokality (single management domain)
- VM mohou live-migrovat mezi site (vMotion nad vzdálenost)
- Storage synchronně mirrorovaná (MetroCluster, VPLEX, vSAN延伸)
- **Požadavky**: dark fiber / DWDM, nízká latence (< 5 ms), vysoká spolehlivost linky
- **Riziko**: split-brain, brain drain (split-site cluster), závislost na síti
- **Use case**: enterprise s vlastní dark fiber mezi dvěma DC v metropolitní oblasti
### Rozhodovací strom
```mermaid
flowchart TD
Start(["Sekundární DC"]) --> RPO{"Požadované RPO?"}
RPO -->|"0 (žádná ztráta dat)"| SYNC{"Sync replikace možná?"}
SYNC -->|"Ano, < 100 km"| ACT{"Chceš nulový výpadek?"}
ACT -->|"Ano"| AA["Active-Active<br/>RTO=0, RPO=0, 2× náklady"]
ACT -->|"Ne"| AP["Active-Passive<br/>RTO=min, RPO=0, 1,5×"]
SYNC -->|"Ne, > 100 km"| ASYNC["Active-Passive (async)<br/>RTO=min, RPO=s, 1,3×"]
RPO -->|"minutyhodiny"| WARM{"Chceš rychlý failover?"}
WARM -->|"Ano"| PILOT["Pilot Light<br/>RTO=h, RPO=min, 1,2×"]
WARM -->|"Ne"| COLD["Cold Standby<br/>RTO=dny, RPO=h, 1,1×"]
Start --> DIST{"Vzdálenost mezi DC"}
DIST -->|"< 50 km, vlastní fiber"| MC["MetroCluster / Stretched Cluster<br/>Single management, sync storage"]
DIST -->|"50300 km"| REG["Regionální DR<br/>Active-Passive, async replikace"]
DIST -->|"> 300 km"| GLOBAL["Globální DR<br/>Cold standby, backup & restore"]
```
### Fyzická infrastruktura pro propojení DC
| Technologie | Bandwidth | Max distance | Latence | Use case |
|------------|-----------|-------------|---------|----------|
| **Dark fiber** | 100 GbE800 GbE | 1080 km (single-mode) | < 0,1 ms | MetroCluster, stretched cluster |
| **DWDM** | 400 GbE1,6 TbE (per lambda) | 80120 km (bez zesilovače) | < 0,5 ms | Metro, metro cluster |
| **CWDM** | 1025 GbE (per channel) | 1040 km | < 0,3 ms | Campus, menší metro |
| **MPLS L2VPN** | 10100 GbE | neomezena | 110 ms | Regional DR, async replikace |
| **Internet IPsec** | 110 GbE | neomezena | 550 ms | Cold standby, backup |
### Vliv jednotlivých technologií na výběr DC topologie
Volba topologie sekundárního DC není čistě infrastrukturní rozhodnutí — každá vrstva (DB, hypervisor, orchestrace, messaging) přináší vlastní omezení.
#### Databáze
| DB technologie | Sync replikace | Max distance | Auto-failover | Split-brain řešení | Poznámka |
|---------------|---------------|-------------|---------------|-------------------|----------|
| **PostgreSQL** | Synchronous commit (synchronous_standby_names) | < 100 km (latence < 10 ms) | Patroni / repmgr + etcd | Quorum (etcd, 3+ node) | Streaming replication, nutné wal_keep_segments |
| **MySQL** | Group Replication (multi-primary, single-primary) | < 100 km | MySQL InnoDB Cluster + MySQL Router | Paxos (Group Replication, 3+ node) | Semi-sync jako kompromis |
| **Oracle** | Data Guard (SYNC/FASTSYNC/ASYNC), RAC extended | sync < 100 km, async neomezena | Data Guard Broker / FSFO (Fast Start Failover) | Observer (3. node) | Far Sync pro vzdálená DC |
| **MSSQL** | AlwaysOn Availability Groups (SYNCHRONOUS_COMMIT) | < 100 km | AlwaysOn + Cluster quorum | File share majority / cloud witness | Multi-site cluster podpora |
| **MongoDB** | Majority write concern + journaling | < 100 km | Replica set auto-election | Arbitration node (voting member) | Priority-based failover |
| **Cassandra** | N/A (multi-master, eventual consistency) | neomezena | Ano (peer-to-peer) | Žádné (multi-master, gossip protokol) | Snitch-aware topologie, NetworkTopologyStrategy |
| **Redis** | Redis Sentinel / Redis Cluster (async) | neomezena (async) | Sentinel / Cluster failover | Quorum (Sentinel, majority) | PSYNC replikace, replication lag |
Klíčové omezení pro **sync replikaci**: latence < 5 ms RTT (commit musí počkat na potvrzení z obou DC). Při 100 km je RTT ~1 ms v pořádku. Při 1000 km (~10 ms RTT) sync replikace snižuje výkon transakcí o 80+ %.
Pro **Active-Active** jsou vhodné:
- **Cassandra / ScyllaDB** — nativní multi-DC, eventual consistency, žádný split-brain
- **MySQL Group Replication (multi-primary)** — 3+ DC pro kvorum
- **CockroachDB / TiDB** — nativní multi-region, ACID napříč DC
- **Redis Enterprise** — Active-Active (CRDT-based)
Pro **Active-Passive** jsou vhodné:
- **PostgreSQL + Patroni** — auto-failover, etcd kvorum
- **Oracle Data Guard** — FSFO, far sync pro vzdálené DC
- **MSSQL AlwaysOn** — cloud witness
- **MongoDB Replica Set** — arbitration node v 3. lokaci
#### Hypervisory
| Hypervisor | Cluster technologie | Stretched cluster | Max distance | Split-brain |
|-----------|-------------------|-------------------|-------------|-------------|
| **VMware vSphere** | vSAN延伸, Metro vCenter, Site Recovery Manager | Ano (vSAN延伸, Metro Cluster) | < 50 km (vSAN延伸), < 10 ms RTT | Fencing (STONITH), witness host |
| **Hyper-V** | Storage Replica + Failover Cluster | Ano (Cluster Sets) | < 50 km (sync), neomezena (async) | File share witness / cloud witness |
| **Proxmox VE** | Proxmox HA + Ceph | Omezeně (Ceph stretch cluster) | < 50 km (Ceph sync) | Ceph monitor quorum (3+ DC) |
| **XCP-ng / XenServer** | Xen Orchestra HA + SR (Storage Repository) replication | Omezeně | závisí na storage replikaci | — |
| **Nutanix AHV** | Metro Availability (sync), Async DR | Ano (Metro) | < 100 km (sync), neomezena (async) | Witness VM (cloud / 3. site) |
| **KVM / oVirt** | oVirt HA + GlusterFS / NFS | Omezeně | závisí na storage replikaci | — |
**vSAN延伸** specifické požadavky:
- Dedikovaná síť pro vSAN (25 GbE min., < 5 ms RTT)
- Witness host v 3. lokaci (nebo cloud witness)
- Všechny VM protokoly (FTT=1, mirroring striped)
- Storage policy: `site-A + site-B + witness`
#### Kubernetes a kontejnerové platformy
| Platforma | Multi-cluster DR | Replikace | Max distance | Failover |
|-----------|-----------------|-----------|-------------|----------|
| **Vanilla K8s** | KubeFed, Cluster API, Velero + Restic | Velero (backup/restore), Rook (Ceph) | neomezena | Manuální (Velero restore) |
| **OpenShift** | ACM (Advanced Cluster Management), Velero | OADP (OpenShift API for Data Protection) | neomezena | ACM failover (subscription) |
| **Rancher** | Rancher Multi-Cluster App, Velero | Longhorn (sync/async DR), Velero | neomezena | Polo-auto |
| **Google GKE** | Multi-cluster Services, Backup for GKE | Config Sync, Backup for GKE | neomezena | Manuální |
| **Azure AKS** | Azure ARC + Velero + Azure Traffic Manager | AKS backup (velero), Azure Site Recovery | neomezena | Manuální (Velero) |
| **AWS EKS** | EKS multi-cluster, Velero + S3 cross-region | Velero (S3), Rook (EBS snapshots) | neomezena | Manuální |
**Klíčové principy K8s DR:**
- **Aplikace musí být stateless** (nebo state externalizovaný do DB/storage)
- **Velero** — backup/restore celého clusteru (PV, resources, helm releases)
- **Rook/Ceph** — cross-region mirroring RBD volumes
- **KubeFed / ACM** — subscription-based deploy do více clusterů
- **Ingress/Gateway API** — traffic routing mezi clustery
- **External DNS** — DNS failover při výpadku clusteru
#### Messaging / streaming
| Platforma | Replikace | Topologie | DR podpora | Max distance |
|-----------|-----------|-----------|------------|-------------|
| **Apache Kafka** | MirrorMaker 2, Confluent Cluster Linking, KRaft quorum | Active-Passive (MM2), Active-Active (Cluster Linking) | MM2: async, Cluster Linking: async | neomezena |
| **RabbitMQ** | Classic Queue Mirroring, Quorum Queues | Active-Passive (Warm Standby) | Federation / Shovel (async) | neomezena |
| **Red Hat AMQ** | (Artemis) Cluster + HA | Active-Passive (shared store / replication) | Live-backup pair | < 100 km (sync) |
| **NATS** | NATS JetStream (cluster + cross-account) | Active-Active (Leaf nodes, cross-account) | Super-cluster, failover | neomezena |
| **Apache Pulsar** | BookKeeper (bookie rack-aware), geo-replication | Active-Active (geo-replication) | Built-in (cluster-level) | neomezena (async) |
| **AWS SQS/SNS** | Managed, AWS region pairs | Active-Active (multi-region) | Built-in (AWS managed) | neomezena |
| **Azure Service Bus** | Managed, paired region | Active-Passive (paired region) | Built-in (geo-recovery) | neomezena |
| **Oracle Service Bus (OSB)** | Oracle WebLogic Cluster + JDBC store + AQ | Active-Passive (WebLogic Cluster + Data Guard) | OSB/WLS cluster + Oracle RAC/Data Guard sync | < 100 km (Data Guard sync), neomezena (async) |
**Doporučení pro DR messagingu:**
- **Kafka**: použít Cluster Linking pro Active-Active, nebo MirrorMaker 2 pro Active-Passive; replikovat jen kritická témata
- **RabbitMQ**: Quorum Queues + Federation upstream pro DR; vyhnout se Classic Queue Mirroring (deprecated)
- **Pulsar**: nativní geo-replication, bookie rack-aware pro stretch cluster; nejjednodušší DR mezi messaging platformami
- **OSB**: WebLogic cluster + Oracle RAC/Data Guard; DR závisí na DB vrstvě, ne na OSB samotném
### Hlavní omezení per vrstva (shrnující tabulka)
| Vrstva | Omezující faktor pro sekundární DC | Max distance pro sync | Dopad na výběr topologie |
|--------|-----------------------------------|----------------------|--------------------------|
| **Storage** | Latence sync mirroru, DWDM náklady | < 50 km (MetroCluster) | Stretched cluster jen v metru |
| **Databáze** | Commit wait pro sync replikaci | < 100 km (5 ms RTT) | Active-Active jen s DB podporující multi-master |
| **Hypervisor** | Stretched cluster quorum + fencing | < 50 km (vSAN, 5 ms) | MetroCluster / stretched cluster |
| **Kubernetes** | Velero restore time, Rook mirror latency | neomezena (async) | Active-Passive, cold standby |
| **Messaging** | Replication lag, offset management | neomezena (async) | Active-Active (Kafka, Pulsar, NATS) nebo Active-Passive |
| **Network** | Dark fiber/DWDM náklady, latency | < 100 km (metro fiber) | Omezuje možnosti sync replikace |
| **Aplikace** | Stateful/stateless, connection draining | závisí na architektuře | Stateless app → libovolná topologie |
## Monitoring disků — S.M.A.R.T.
Self-Monitoring, Analysis and Reporting Technology — prediktivní monitoring HDD/SSD.
@@ -785,4 +1060,4 @@ OpenStack přináší do DC softwarovou abstrakční vrstvu, která umožňuje m
- Akademické / HPC clustery (Ironic, Cyborg, Manila)
- Government / regulated prostředí (on-prem, audit trail)
*Poslední revize: 2026-06-03*
*Poslední revize: 2026-06-12*

246
DC-MIGRATION.en.md Normal file
View File

@@ -0,0 +1,246 @@
# 🏗️ Data Center Migration
## Migration strategies
| Strategy | RTO | RPO | Risk | Cost | Duration | Description |
|-----------|-----|-----|--------|---------|-------------|-------|
| **Cold / Big Bang** | hoursdays | days | High | Low | days | Shut everything down, move, power up |
| **Phased / Wave** | minutes (per wave) | minutes | Medium | Medium | weeksmonths | Workloads moved in waves |
| **Rolling** | 0 (live) | 0 | Low | High | months | Live migration per VM/service |
| **Parallel Run** | 0 | 0 | Low | Very high | months | Both DCs operational, gradual cutover |
| **Pilot Light** | hours | minutes | Medium | Low | weeks | Critical services in new DC, rest migrates |
| **Lift & Shift** | hours | minutes | Medium | Low | weeks | VMs/servers moved without configuration changes |
| **Re-platform** | hours | minutes | Low | Medium | months | Optimization during migration (OS upgrade, resize) |
| **Re-architect** | 0 | 0 | Low | High | monthsyears | Application redesigned for new platform |
---
## Decision tree
```mermaid
flowchart TD
Start(["DC Migration"]) --> APP{"Application\nstateful?"}
APP -->|"Yes"| DOWNTIME{"Tolerates\ndowntime?"}
APP -->|"No"| ROLLING["Rolling / Parallel Run"]
DOWNTIME -->|"Yes, hours+"| COLD["Cold / Big Bang\nSimplest, cheapest\nRisk: all at once"]
DOWNTIME -->|"Yes, minutes"| PHASED["Phased / Wave\nBy application / business unit"]
DOWNTIME -->|"No (zero downtime)"| SYNC{"Sync replication\npossible?"}
SYNC -->|"Yes, < 100 km"| ROLLING
SYNC -->|"No"| PARALLEL["Parallel Run\nBoth DCs active, gradual cutover"]
ROLLING --> ROLL_HA{"VMware,\nHyper-V?"}
ROLL_HA -->|"Yes"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"]
ROLL_HA -->|"No"| ROLL_REPL["Storage + DB replication\nGradual workload migration"]
```
---
## Migration phases
### 1. Discovery and assessment
| Task | Tools | Output |
|------|----------|--------|
| HW and SW inventory | RVTools, NetBox, CMDB | Server, VM, and service list |
| Dependency mapping | ServiceNow, AppDynamics, manual | Application dependency graph |
| Traffic analysis | NetFlow, sFlow, vRNI | Bandwidth, latency, peak usage |
| Performance baseline | Prometheus, Zabbix, vRealize | CPU/RAM/disk/network per workload |
| License audit | Flexera, SAM | Licenses, support, compliance |
**Output:** workload list with RTO/RPO, dependencies, and criticality.
### 2. Planning
- **Wave plan** — workload division into migration waves (1050 VMs per wave)
- **Dependency ordering** — DNS, NTP, LDAP, PKI first
- **Cutover window** — time window for switching (typically weekend)
- **Rollback plan** — conditions and procedure for reversal
- **Test plan** — what and how to test post-migration
- **Communication plan** — who, when, how is informed
### 3. New DC preparation
- **Infrastructure** — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (see [DATACENTERS.en.md](DATACENTERS.en.md) — deployment order)
- **Network** — BGP peering, VXLAN/VLAN, firewall rules, load balancers
- **Storage** — SAN zoning, NAS exports, Ceph cluster
- **Virtualization** — vCenter, Hyper-V cluster, Proxmox
### 4. Replication and synchronization
| Layer | Method | Tools |
|--------|--------|----------|
| **Storage (block)** | SAN sync/async mirror, LUN replication | NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster |
| **Storage (file)** | DFS-R, rsync, robocopy | Windows DFS, Rsync |
| **Storage (object)** | Cross-region replication | MinIO replication, S3 CRR |
| **Databases** | Log shipping, CDC, streaming replication | PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication |
| **VM** | Storage vMotion, replication | VMware vSphere Replication, Hyper-V Replica, Zerto |
| **Kubernetes** | Velero + Restic, Rook Ceph mirror | Velero, Rook |
### 5. Workload migration
#### Wave migration (recommended for medium/large DCs)
```mermaid
gantt
title Wave migration
dateFormat YYYY-MM-DD
section Wave 1 - Core
DNS, NTP, LDAP :done, w1a, 2026-07-01, 3d
Monitoring + logging :done, w1b, after w1a, 2d
section Wave 2 - Network
Load balancers :active, w2a, 2026-07-06, 2d
Firewalls :active, w2b, 2026-07-08, 2d
section Wave 3 - Storage
NAS migration :w3a, 2026-07-10, 5d
SAN replication :w3b, 2026-07-10, 3d
section Wave 4 - Dev/Test
Dev VMs :w4a, 2026-07-15, 5d
section Wave 5 - Prod tier 3
Internal apps :w5a, 2026-07-22, 5d
section Wave 6 - Prod tier 2
Business apps :w6a, 2026-07-29, 5d
section Wave 7 - Prod tier 1
Critical apps :w7a, 2026-08-05, 5d
```
#### Typical single wave procedure:
1. **Day -7**: Sync data replication (initial seed)
2. **Day -1**: Incremental sync, final test
3. **Day 0 (cutover)**:
- Stop application in source DC
- Final sync (last delta)
- Start application in target DC
- DNS/Traffic switch
- Smoke test
4. **Day +1**: Monitoring (performance, errors, lag)
5. **Day +7**: Rollback window end (success confirmation)
### 6. Network strategies
#### IP re-addressing
| Approach | Description | Pros | Cons |
|---------|-------|--------|----------|
| **Keep IP** | Same IPs, BGP anycast or stretch VLAN | No application config changes | Stretched VLAN/L2 limitations |
| **Change IP** | New IP range, DNS/BGP routing change | Clean architecture | Config changes, DNS TTL |
| **NAT translation** | NAT between old and new IP space | No application changes | Latency, troubleshooting complexity |
**Keep IP** is only possible with:
- L2 stretch between DCs (VXLAN, OTV) — distance limited
- BGP anycast for VIPs (load balancers)
- Applications tolerant to ARP cache changes
#### DNS cutover
```
1. Lower TTL to 60300 s (one week ahead)
2. At cutover, change A/AAAA records to new IPs
3. Wait for propagation (per TTL)
4. Monitor traffic
```
#### Traffic steering
| Technique | Use case |
|----------|----------|
| **BGP** | Change AS path / local pref for traffic steering |
| **DNS** | Lower TTL, change A records |
| **Load balancer** | Change pool members, health check |
| **GSLB** | Global Server Load Balancing (F5 GTM, NSX ALB) |
| **Cloud DNS** | AWS Route53, Azure Traffic Manager, Google Cloud DNS |
### 7. Database migration
See individual DB files for details. Summary table:
| DB | Method | RPO | RTO | Note |
|----|--------|-----|-----|----------|
| **PostgreSQL** | Streaming replication + Patroni switchover | 0 (sync) / ~MB (async) | min | Patroni auto-failover |
| **MySQL** | Group Replication / async replication | 0 (sync) / seconds | min | InnoDB Cluster |
| **Oracle** | Data Guard switchover | 0 (sync) | min | Far sync for remote DCs |
| **MSSQL** | AlwaysOn AG failover | 0 (sync) | min | Cloud witness |
| **MongoDB** | Replica set election | seconds | < 1 min | Priority-based failover |
| **Cassandra** | Multi-DC replication | eventual | 0 | Native multi-master |
### 8. Testing
| Phase | What to test | Method |
|------|-------------|--------|
| **Pre-migration** | Application in new DC (isolated) | Dry run on replicated data |
| **Cutover** | Functionality, availability, latency | Smoke test, synthetic transactions |
| **Post-migration** | Performance, integration, monitoring | A/B comparison with baseline, canary traffic |
| **Rollback** | Return to old DC | Tested rollback plan |
### 9. Rollback plan
Each wave must have a defined rollback:
| Condition | Action |
|----------|------|
| Application fails to start in new DC | DNS switch back, stop replication |
| Performance worse than baseline (> 20 %) | Rollback, root cause analysis |
| Integration failure (API timeout, DB connection) | Rollback, dependency check |
| Security incident | Rollback, forensic analysis |
Rollback must be tested **before** the real cutover.
---
## Special cases
### Mainframe migration
- **IBM z/OS** — GDPS (Geographically Dispersed Parallel Sysplex)
- HyperSwap for storage mirroring
- Cross-system coupling facility (XCF)
- Often the last migrated component
### COTS applications (Oracle EBS, SAP)
- Require vendor-specific migration procedures
- Oracle EBS: Autoconfig, cloning (ADXLC)
- SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM
- License re-licensing on HW change
### Cloud migration (On-prem → Cloud)
See [CLOUD.en.md](CLOUD.en.md) — migration strategies (6 Rs):
| Strategy | Description |
|-----------|-------|
| **Re-host (Lift & Shift)** | VM → Cloud VM (AWS MGN, Azure Migrate) |
| **Re-platform** | OS upgrade, managed DB (RDS, Cloud SQL) |
| **Re-architect** | Application rewritten as cloud-native |
| **Retire** | Decommission unnecessary applications |
| **Retain** | Application stays on-prem (review later) |
| **Repurchase** | SaaS replacement |
---
## Recommended approach per DC size
| DC Size | VM Count | Recommended strategy | Duration | Team |
|-------------|----------|---------------------|-------------|-----|
| **Small** | < 50 | Big Bang (weekend) | 24 days | 35 people |
| **Medium** | 50500 | Phased (510 waves) | 28 weeks | 510 people |
| **Large** | 5005000 | Phased + Rolling | 312 months | 1030 people |
| **Enterprise** | 5000+ | Parallel Run / Rolling | 1236 months | 30+ people |
---
## Related
- [DATACENTERS.en.md](DATACENTERS.en.md) — DC topologies, secondary DC, deployment order
- [CLOUD.en.md](CLOUD.en.md) — cloud migration strategies (6 Rs)
- [DR.en.md](DR.en.md) — disaster recovery, RTO/RPO
- [NETWORKING.en.md](NETWORKING.en.md) — BGP, DNS, VXLAN, traffic steering
- [STORAGE.en.md](STORAGE.en.md) — storage replication
## Sources
Links, books, and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
*Last revision: 2026-06-12*

246
DC-MIGRATION.md Normal file
View File

@@ -0,0 +1,246 @@
# 🏗️ Migrace datových center
## Strategie migrace
| Strategie | RTO | RPO | Riziko | Náklady | Doba trvání | Popis |
|-----------|-----|-----|--------|---------|-------------|-------|
| **Cold / Big Bang** | hodinydny | dny | Vysoké | Nízké | dny | Vše najednou vypnout, přesunout, zapnout |
| **Phased / Wave** | minuty (per wave) | minuty | Střední | Střední | týdnyměsíce | Workloady po vlnách |
| **Rolling** | 0 (live) | 0 | Nízké | Vysoké | měsíce | Live migration per VM/služba |
| **Parallel Run** | 0 | 0 | Nízké | Velmi vysoké | měsíce | Oba DC v provozu, postupný přechod |
| **Pilot Light** | hodiny | minuty | Střední | Nízké | týdny | Kritické služby v novém DC, ostatní se přesouvají |
| **Lift & Shift** | hodiny | minuty | Střední | Nízké | týdny | VM/servery přesunuty bez změny konfigurace |
| **Re-platform** | hodiny | minuty | Nízké | Střední | měsíce | Optimalizace během migrace (OS upgrade, resize) |
| **Re-architect** | 0 | 0 | Nízké | Vysoké | měsíceroky | Aplikace přepracována pro novou platformu |
---
## Rozhodovací strom
```mermaid
flowchart TD
Start(["Migrace DC"]) --> APP{"Aplikace\nstateful?"}
APP -->|"Ano"| DOWNTIME{"Toleruje\nvýpadek?"}
APP -->|"Ne"| ROLLING["Rolling / Parallel Run"]
DOWNTIME -->|"Ano, hodiny+"| COLD["Cold / Big Bang\nNejjednodušší, nejlevnější\nRiziko: vše najednou"]
DOWNTIME -->|"Ano, minuty"| PHASED["Phased / Wave\nPo aplikacích / byznys jednotkách"]
DOWNTIME -->|"Ne (zero downtime)"| SYNC{"Sync replikace\nmožná?"}
SYNC -->|"Ano, < 100 km"| ROLLING
SYNC -->|"Ne"| PARALLEL["Parallel Run\nOba DC aktivní, postupný cutover"]
ROLLING --> ROLL_HA{"VMware,\nHyper-V?"}
ROLL_HA -->|"Ano"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"]
ROLL_HA -->|"Ne"| ROLL_REPL["Storage + DB replikace\nPostupný přesun workloadů"]
```
---
## Fáze migrace
### 1. Discovery a assessment
| Úkol | Nástroje | Výstup |
|------|----------|--------|
| Inventarizace HW a SW | RVTools, NetBox, CMDB | Seznam všech serverů, VM, služeb |
| Dependency mapping | ServiceNow, AppDynamics, manual | Aplikační dependency graf |
| Traffic analysis | NetFlow, sFlow, vRNI | BANDWIDTH, latency, peak usage |
| Výkonnostní baseline | Prometheus, Zabbix, vRealize | CPU/RAM/disk/network per workload |
| Licenční audit | Flexera, SAM | Licence, support, compliance |
**Výstupem je:** seznam workloadů s RTO/RPO, závislostmi a kritičností. Bez toho nelze naplánovat migraci.
### 2. Plánování
- **Wave plán** — rozdělení workloadů do migračních vln (1050 VM na vlnu)
- **Závislostní řazení** — DNS, NTP, LDAP, PKI musí být první
- **Cutover okno** — časové okno pro přepnutí (typicky víkend)
- **Rollback plán** — podmínky a postup pro vrácení
- **Testovací plán** — co a jak testovat po migraci
- **Komunikační plán** — kdo, kdy, jak je informován
### 3. Příprava nového DC
- **Infrastruktura** — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (viz [DATACENTERS.md](DATACENTERS.md) — deployment order)
- **Network** — BGP peering, VXLAN/VLAN, firewall pravidla, load balancery
- **Storage** — SAN zoning, NAS exports, Ceph cluster
- **Virtualizace** — vCenter, Hyper-V cluster, Proxmox
### 4. Replikace a synchronizace
| Vrstva | Metoda | Nástroje |
|--------|--------|----------|
| **Storage (block)** | SAN sync/async mirror, LUN replication | NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster |
| **Storage (file)** | DFS-R, rsync, robocopy | Windows DFS, Rsync |
| **Storage (object)** | Cross-region replication | MinIO replication, S3 CRR |
| **Databáze** | Log shipping, CDC, streaming replication | PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication |
| **VM** | Storage vMotion, replication | VMware vSphere Replication, Hyper-V Replica, Zerto |
| **Kubernetes** | Velero + Restic, Rook Ceph mirror | Velero, Rook |
### 5. Migrace workloadů
#### Wave migrace (doporučeno pro střední/větší DC)
```mermaid
gantt
title Wave migrace
dateFormat YYYY-MM-DD
section Wave 1 - Core
DNS, NTP, LDAP :done, w1a, 2026-07-01, 3d
Monitoring + logging :done, w1b, after w1a, 2d
section Wave 2 - Network
Load balancers :active, w2a, 2026-07-06, 2d
Firewalls :active, w2b, 2026-07-08, 2d
section Wave 3 - Storage
NAS migrace :w3a, 2026-07-10, 5d
SAN replication :w3b, 2026-07-10, 3d
section Wave 4 - Dev/Test
Dev VMs :w4a, 2026-07-15, 5d
section Wave 5 - Prod tier 3
Internal apps :w5a, 2026-07-22, 5d
section Wave 6 - Prod tier 2
Business apps :w6a, 2026-07-29, 5d
section Wave 7 - Prod tier 1
Critical apps :w7a, 2026-08-05, 5d
```
#### Typický postup jedné vlny:
1. **Den -7**: Sync replikace dat (initial seed)
2. **Den -1**: Incremental sync, final test
3. **Den 0 (cutover)**:
- Zastavení aplikace ve zdrojovém DC
- Final sync (poslední delta)
- Start aplikace v cílovém DC
- DNS/Traffic switch
- Smoke test
4. **Den +1**: Monitorování (výkon, chyby, lag)
5. **Den +7**: Rollback window end (potvrzení úspěchu)
### 6. Síťové strategie
#### IP re-addressing
| Přístup | Popis | Výhody | Nevýhody |
|---------|-------|--------|----------|
| **Keep IP** | Stejné IP, BGP anycast nebo stretch VLAN | Není třeba měnit konfiguraci aplikací | Stretched VLAN/L2 omezení |
| **Change IP** | Nový IP rozsah, DNS/BGP routing změna | Čistá architektura | Změny konfigurací, DNS TTL |
| **NAT překlad** | NAT mezi starým a novým IP spacem | Bez změny aplikací | Latence, komplexita troubleshooting |
**Keep IP** je možný jen:
- L2 stretch mezi DC (VXLAN, OTV) — omezeno vzdáleností
- BGP anycast pro VIP (load balancery)
- Aplikace tolerující ARP cache změny
#### DNS cutover
```
1. Snížit TTL na 60300 s (týden předem)
2. Při cutoveru změnit A/AAAA záznamy na nové IP
3. Počkat na propagaci (dle TTL)
4. Monitorovat traffic
```
#### Traffic steering
| Technika | Use case |
|----------|----------|
| **BGP** | Změna AS path / local pref pro přesměrování trafficu |
| **DNS** | Snížení TTL, change A records |
| **Load balancer** | Změna pool members, health check |
| **GSLB** | Global Server Load Balancing (F5 GTM, NSX ALB) |
| **Cloud DNS** | AWS Route53, Azure Traffic Manager, Google Cloud DNS |
### 7. Databázová migrace
Viz detail v jednotlivých DB souborech. Tabulka shrnutí:
| DB | Metoda | RPO | RTO | Poznámka |
|----|--------|-----|-----|----------|
| **PostgreSQL** | Streaming replication + Patroni switchover | 0 (sync) / ~MB (async) | min | Patroni auto-failover |
| **MySQL** | Group Replication / async replication | 0 (sync) / sekundy | min | InnoDB Cluster |
| **Oracle** | Data Guard switchover | 0 (sync) | min | Far sync pro vzdálené DC |
| **MSSQL** | AlwaysOn AG failover | 0 (sync) | min | Cloud witness |
| **MongoDB** | Replica set election | sekundy | < 1 min | Priority-based failover |
| **Cassandra** | Multi-DC replication | eventual | 0 | Nativní multi-master |
### 8. Testování
| Fáze | Co testovat | Metoda |
|------|-------------|--------|
| **Pre-migrace** | Aplikace v novém DC (izolovaně) | Dry run na replikovaných datech |
| **Cutover** | Funkčnost, dostupnost, latence | Smoke test, synthetic transactions |
| **Post-migrace** | Výkon, integrace, monitoring | A/B comparison s baseline, canary traffic |
| **Rollback** | Návrat ke starému DC | Testovaný rollback plán |
### 9. Rollback plán
Každá vlna musí mít definovaný rollback:
| Podmínka | Akce |
|----------|------|
| Aplikace nestartuje v novém DC | Přepnutí DNS zpět, zastavení replikace |
| Výkon horší než baseline (o > 20 %) | Rollback, analýza příčiny |
| Integrační selhání (API timeout, DB connection) | Rollback, dependency check |
| Bezpečnostní incident | Rollback, forenzní analýza |
Rollback by měl být otestován **před** reálným cutoverem.
---
## Speciální případy
### Mainframe migrace
- **IBM z/OS** — GDPS (Geographically Dispersed Parallel Sysplex)
- HyperSwap pro storage mirroring
- Cross-system coupling facility (XCF)
- Často poslední migrovaná komponenta
### COTS aplikace (Oracle EBS, SAP)
- Vyžadují specifické migrační postupy výrobce
- Oracle EBS: Autoconfig, cloning (ADXLC)
- SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM
- Licenční re-licensing při změně HW
### Cloud migrace (On-prem → Cloud)
Viz [CLOUD.md](CLOUD.md) — migrační strategie (6 Rs):
| Strategie | Popis |
|-----------|-------|
| **Re-host (Lift & Shift)** | VM → Cloud VM (AWS MGN, Azure Migrate) |
| **Re-platform** | OS upgrade, managed DB (RDS, Cloud SQL) |
| **Re-architect** | Aplikace přepsána na cloud-native |
| **Retire** | Zastavení nepotřebných aplikací |
| **Retain** | Aplikace zůstává on-prem (revize později) |
| **Repurchase** | SaaS náhrada |
---
## Doporučený postup per velikost DC
| Velikost DC | Počet VM | Doporučená strategie | Doba trvání | Tým |
|-------------|----------|---------------------|-------------|-----|
| **Small** | < 50 | Big Bang (víkend) | 24 dny | 35 lidí |
| **Medium** | 50500 | Phased (510 wave) | 28 týdnů | 510 lidí |
| **Large** | 5005000 | Phased + Rolling | 312 měsíců | 1030 lidí |
| **Enterprise** | 5000+ | Parallel Run / Rolling | 1236 měsíců | 30+ lidí |
---
## Související
- [DATACENTERS.md](DATACENTERS.md) — DC topologie, sekundární DC, deployment order
- [CLOUD.md](CLOUD.md) — cloud migrační strategie (6 Rs)
- [DR.md](DR.md) — disaster recovery, RTO/RPO
- [NETWORKING.md](NETWORKING.md) — BGP, DNS, VXLAN, traffic steering
- [STORAGE.md](STORAGE.md) — storage replikace
## Zdroje
Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
*Poslední revize: 2026-06-12*

336
DR.en.md Normal file
View File

@@ -0,0 +1,336 @@
# 🔄 Disaster Recovery and Business Continuity
## Terminology
| Abbreviation | Meaning | Description |
|---------|--------|-------|
| **RTO** | Recovery Time Objective | Maximum time from outage to service recovery |
| **RPO** | Recovery Point Objective | Maximum acceptable data loss (time since last backup) |
| **MTD** | Maximum Tolerable Downtime | Total outage duration an organization can survive |
| **WRT** | Work Recovery Time | Time needed for full operations recovery after IT restoration |
| **MTBF** | Mean Time Between Failures | Mean time between failures |
| **MTTR** | Mean Time To Repair | Mean time to repair |
| **SLA** | Service Level Agreement | Contractual availability commitment |
| **SLO** | Service Level Objective | Internal availability target |
| **SLI** | Service Level Indicator | Measured availability value |
### Relationship between RTO, RPO, MTD, WRT
```
Outage ──── RPO ────► Data restored ──── RTO ────► Service running ──── WRT ────► Full operations
│ │ │
▼ ▼ ▼
Lost data Time without service Time to full capacity
MTD = RTO + WRT (max. time the business tolerates)
```
---
## Uptime calculation
### Nines table
| Level | Uptime | Downtime / year | Downtime / month | Downtime / week |
|--------|--------|---------------|------------------|------------------|
| 90 % (one nine) | 0.9 | 36.5 days | 72 h | 16.8 h |
| 99 % (two nines) | 0.99 | 3.65 days | 7.2 h | 1.68 h |
| 99.5 % | 0.995 | 1.83 days | 3.6 h | 50.4 min |
| 99.9 % (three nines) | 0.999 | 8.76 h | 43.2 min | 10.1 min |
| 99.95 % | 0.9995 | 4.38 h | 21.6 min | 5.04 min |
| 99.99 % (four nines) | 0.9999 | 52.6 min | 4.32 min | 1.01 min |
| 99.995 % | 0.99995 | 26.3 min | 2.16 min | 30.2 s |
| 99.999 % (five nines) | 0.99999 | 5.26 min | 25.9 s | 6.05 s |
| 99.9999 % (six nines) | 0.999999 | 31.6 s | 2.59 s | 0.605 s |
### Calculation
```
Availability = (Total time - Downtime) / Total time × 100 %
Example:
Year = 365 × 24 × 60 = 525,600 minutes
Target: 99.9 % → allowed downtime = 525,600 × (1 - 0.999) = 525.6 minutes = 8.76 h
Combined availability (chain of dependencies):
A_web = 99.9 % (3 nines)
A_api = 99.99 % (4 nines)
A_db = 99.999 % (5 nines)
A_total = 0.999 × 0.9999 × 0.99999 = 0.99889 ≈ 99.89 % (less than 3 nines!)
Parallel availability (redundancy):
A_total = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)
Example: 2 servers with 99% availability
A_total = 1 - (1-0.99) × (1-0.99) = 1 - 0.01 × 0.01 = 0.9999 (99.99 %)
```
### Calculator
```python
def uptime_percent_to_downtime(pct, period_days=365):
"""Convert uptime percentage to downtime in given period."""
total_minutes = period_days * 24 * 60
allowed_downtime = total_minutes * (1 - pct / 100)
return allowed_downtime # minutes
def downtime_to_uptime_percent(downtime_minutes, period_days=365):
"""Convert downtime in minutes to uptime percentage."""
total_minutes = period_days * 24 * 60
return (1 - downtime_minutes / total_minutes) * 100
def combined_availability(availabilities):
"""Combined availability (series-connected components)."""
result = 1.0
for a in availabilities:
result *= a
return result
def redundant_availability(availabilities):
"""Redundant availability (parallel components)."""
result = 1.0
for a in availabilities:
result *= (1 - a)
return 1 - result
```
### Calculation fallacies
- **Combined availability is not a sum** — adding another dependency always reduces total availability
- **Redundancy is not free** — adding a standby component requires failure detection + failover (MTTR does not improve automatically)
- **SLA is not a guarantee** — providers often calculate SLA as a monthly average, not per-incident
- **Measurement is key** — without SLI, SLO cannot be verified; "unmeasured availability does not exist"
- **Planned maintenance** — sometimes counted as uptime, sometimes not (depends on SLA definition)
---
## DR scenarios
### Classification
| Category | Scenario | Typical RTO | Typical RPO | Frequency |
|-----------|--------|-------------|-------------|-----------|
| **Site** | Entire DC / region outage | hours | minutes | Low |
| **Infrastructure** | HW failure (storage, switch, server) | minuteshours | seconds | Medium |
| **Software** | OS, application, DB failure | minutes | seconds | High |
| **Data** | Data corruption, deletion, cryptolocker | hours | backup point | Lowmedium |
| **Human** | Wrong deployment, config change | minuteshours | seconds | Medium |
| **Security** | Attack, breach, ransomware | days | before attack | Low |
| **Network** | Connectivity outage, DDoS | minuteshours | N/A | Medium |
| **Cloud provider** | Regional outage (AWS, Azure, GCP) | hours | minutes | Very low |
### Scenario details
#### Site / Region failure
| Aspect | Description |
|--------|-------|
| **Cause** | Blackout, fire, flood, earthquake, cloud provider outage |
| **Prevention** | Multi-AZ architecture, multi-region deployment, active-active |
| **Mitigation** | Automatic DNS failover (Route53, Azure Traffic Manager), replica in DR region |
| **Testing** | Game day: shut down primary region, verify automatic failover |
#### Data corruption / human error
| Aspect | Description |
|--------|-------|
| **Cause** | Wrong SQL command (DELETE without WHERE), accidentally deleted bucket, bad migration |
| **Prevention** | RBAC, MFA for destructive operations, change management, SQL peer review |
| **Mitigation** | Point-in-time recovery (PITR), transaction log replay, immutable backups |
| **Testing** | Restore backup to isolated environment, verify data integrity |
#### Ransomware / cyber attack
| Aspect | Description |
|--------|-------|
| **Cause** | Attack on production systems, data encryption, exfiltration |
| **Prevention** | Immutable backups (object lock), air-gapped backups, network segmentation |
| **Mitigation** | Restore from clean backup, rebuild infrastructure from IaC |
| **Testing** | Regular restore in isolated network, verify backup is not infected |
---
## Prevention — strategies
### Backup strategies
| Approach | Description | Use case |
|---------|-------|----------|
| **3-2-1 rule** | 3 copies, 2 different media, 1 off-site | Universal |
| **3-2-1-0** | + 0 errors after restore (testing) | Enterprise, compliance |
| **GFS (Grandfather-Father-Son)** | Daily, weekly, monthly rotation | Long-term archive |
| **Incremental forever** | Full backup 1×, then only changes | Large data volumes |
| **Reverse incremental** | Full + incremental, full is always current | Fast recovery |
### Backup methods
| Method | RPO | RTO | Storage | Suitable for |
|--------|-----|-----|----------|------------|
| **Full backup** | Last full | Full restore time | Large | Small data, weekly |
| **Incremental** | Last incremental | Full + all incrementals | Small | Large data, daily |
| **Differential** | Last diff | Full + last diff | Medium | Compromise |
| **Snapshot** | Snapshot point-in-time | seconds | Copy-on-write | VM, storage array |
| **Continuous (CDC)** | < 1 s | Seconds | Log stream | DB (binlog, WAL) |
| **PITR** | Any point in time | Depends on volume | Full + WAL | RDS, PostgreSQL, SQL Server |
### Backup immutability
Key protection against ransomware:
| Technique | Description |
|----------|-------|
| **Object Lock (WORM)** | Backup cannot be deleted or overwritten for a defined retention period (S3 Object Lock, Azure Blob Immutable) |
| **Air gap** | Backup is physically separated from the production network (offline disk, tape, cloud without VPN) |
| **Isolated backup network** | Backup traffic goes through a dedicated network without access from production VLAN |
| **Out-of-band access** | Backup management console is not accessible from the production network |
---
## DR architectures
### Multi-AZ (Single region)
```
Region ┌────────────────────────────────────┐
│ AZ-1 AZ-2 │
│ ┌──────────┐ ┌──────────┐ │
│ │ App │ │ App │ │
│ └─────┬────┘ └─────┬────┘ │
│ │ │ │
│ ┌─────▼────────────────▼─────┐ │
│ │ Load Balancer (cross-AZ) │ │
│ └─────────────┬──────────────┘ │
│ │ │
│ ┌─────────────▼──────────────┐ │
│ │ DB Primary (AZ-1) │ │
│ │ DB Standby (AZ-2) │ │
│ │ Synchronous replication │ │
│ └────────────────────────────┘ │
└────────────────────────────────────┘
```
- RTO: minutes (automatic failover)
- RPO: 0 (sync replication)
- Protection: against AZ failure, not region failure
### Multi-Region
```
Region A (Primary) Region B (DR)
┌─────────────────────┐ ┌─────────────────────┐
│ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ App + DB │ │ │ │ App + DB │ │
│ │ Active │──┼──Async───────┼─►│ Standby │ │
│ └───────────────┘ │ replication │ └───────────────┘ │
│ │ │ │ │ │
│ ┌──────▼───────┐ │ │ ┌──────▼───────┐ │
│ │ DNS / GSLB │ │ │ │ DNS / GSLB │ │
│ └──────┬───────┘ │ │ └──────┬───────┘ │
└─────────┼──────────┘ └─────────┼──────────┘
│ │
└──────────── Traffic Manager ───────┘
```
| Variant | RTO | RPO | Cost | Failover |
|----------|-----|-----|---------|----------|
| **Active-Passive** | minuteshours | seconds | Medium | Manual / auto |
| **Active-Active** | seconds | < 1 s | High | Automatic (DNS) |
| **Pilot Light** | tens of minutes | minutes | Low | Manual scaling |
| **Warm Standby** | minutes | seconds | High | Auto (reduced copy) |
| **Backup & Restore** | hours | 24 h | Low | Manual |
### On-prem → Cloud DR (Hybrid)
```
On-prem DC Cloud (DR)
┌─────────────────────┐ ┌─────────────────────┐
│ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ Application │ │ │ │ VM / App │ │
│ │ + DB │ │ │ │ + DB replica │ │
│ └───────┬───────┘ │ │ └───────┬───────┘ │
│ │ │ │ │ │
│ ┌───────▼───────┐ │ site-to-site│ ┌───────▼───────┐ │
│ │ Backup proxy │──┼────VPN───────┼─►│ Backup store │ │
│ └───────────────┘ │ │ └───────────────┘ │
│ │ │ │
│ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ Tape / NAS │ │ │ │ Veeam / Zerto│ │
│ └───────────────┘ │ │ └───────────────┘ │
└─────────────────────┘ └─────────────────────┘
```
- **RTO**: tens of minutes (depends on VM startup)
- **RPO**: minuteshours (depends on replication tool)
- **Tools**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
- **Use case**: enterprise with on-prem DC that needs DR without a second DC
---
## DR testing
### Test types
| Type | Description | Frequency | Risk |
|-----|-------|-----------|--------|
| **Tabletop exercise** | Manual scenario walkthrough, no impact on production | Monthly | None |
| **Walkthrough** | Runbook verification, ensure everyone knows what to do | Quarterly | None |
| **Component test** | Test of a single component (e.g., restore one DB) | Monthly | Low |
| **Integrated test** | Test of the entire stack in isolated environment | Quarterly | Low |
| **Full failover test** | Production failover to DR site | Annually | High |
| **Chaos experiment** | Targeted fault injection into production | Continuous | Medium |
### Runbook structure
Each DR scenario should have a runbook:
```yaml
scenario: "Region A failure"
triggers:
- "CloudWatch alarm: Region A health check 5× timeout"
- "PagerDuty incident P0"
decision_tree: |
1. Verify: is Region A really unavailable? (check from 3 different locations)
2. Decide: is RTO at risk? If < 30 % RTO remaining → failover
3. Failover: run playbook `dr-failover-region-b`
4. Verification: smoke tests in Region B
5. Communication: status page + stakeholders
rollback: |
1. After Region A recovery → replicate changes from B back to A
2. Repoint DNS to A
3. Verify data consistency
4. Shut down Region B (or keep as hot standby)
contacts:
primary: "on-call@example.com"
escalation: "infra-lead@example.com"
management: "vp-engineering@example.com"
```
---
## Best practices
- **Test recovery, not backup** — a backup without tested recovery is not a backup
- **Automate DR** — Terraform / Ansible for DR environment spin-up, DNS failover
- **Document runbooks** — every scenario, contact, decision tree
- **Expect failure** — design for failure, don't expect everything to work
- **Don't underestimate WRT** — service recovery does not mean full operations (data warming, cache, connections)
- **Align RTO/RPO with business** — technical capabilities must match business requirements
- **Monitor SLI** — without data, SLO cannot be verified
- **DR is not just IT** — communication, PR, legal, compliance
---
## Related
- [CLOUD.md](CLOUD.md) — cloud DR strategy, AWS/Azure/GCP specific
- [DATACENTERS.md](DATACENTERS.md) — DC redundancy, Tier classification
- [MONITORING.md](MONITORING.md) — alerting, SLI/SLO/SLA
- [CICD.md](CICD.md) — deployment strategy, rollback
- [STORAGE.md](STORAGE.md) — backup storage, replication
## Sources
Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
*Last revised: 2026-06-11*

336
DR.md Normal file
View File

@@ -0,0 +1,336 @@
# 🔄 Disaster Recovery a Business Continuity
## Terminologie
| Zkratka | Význam | Popis |
|---------|--------|-------|
| **RTO** | Recovery Time Objective | Maximální doba od výpadku do obnovení služby |
| **RPO** | Recovery Point Objective | Maximální přípustná ztráta dat (čas od poslední zálohy) |
| **MTD** | Maximum Tolerable Downtime | Celková doba výpadku, kterou organizace přežije |
| **WRT** | Work Recovery Time | Čas potřebný k plnému obnovení provozu po obnovení IT |
| **MTBF** | Mean Time Between Failures | Střední doba mezi poruchami |
| **MTTR** | Mean Time To Repair | Střední doba opravy |
| **SLA** | Service Level Agreement | Smluvní závazek dostupnosti |
| **SLO** | Service Level Objective | Interní cíl dostupnosti |
| **SLI** | Service Level Indicator | Naměřená hodnota dostupnosti |
### Vztah RTO, RPO, MTD, WRT
```
Výpadek ──── RPO ────► Obnova dat ──── RTO ────► Služba běží ──── WRT ────► Plný provoz
│ │ │
▼ ▼ ▼
Ztracená data Čas bez služby Čas do plného výkonu
MTD = RTO + WRT (max. doba, kterou firma toleruje)
```
---
## Výpočet uptimu
### Tabulka devítek
| Úroveň | Uptime | Downtime / rok | Downtime / měsíc | Downtime / týden |
|--------|--------|---------------|------------------|------------------|
| 90 % (jedna devítka) | 0.9 | 36,5 dne | 72 h | 16,8 h |
| 99 % (dvě devítky) | 0.99 | 3,65 dne | 7,2 h | 1,68 h |
| 99,5 % | 0.995 | 1,83 dne | 3,6 h | 50,4 min |
| 99,9 % (tři devítky) | 0.999 | 8,76 h | 43,2 min | 10,1 min |
| 99,95 % | 0.9995 | 4,38 h | 21,6 min | 5,04 min |
| 99,99 % (čtyři devítky) | 0.9999 | 52,6 min | 4,32 min | 1,01 min |
| 99,995 % | 0.99995 | 26,3 min | 2,16 min | 30,2 s |
| 99,999 % (pět devítek) | 0.99999 | 5,26 min | 25,9 s | 6,05 s |
| 99,9999 % (šest devítek) | 0.999999 | 31,6 s | 2,59 s | 0,605 s |
### Výpočet
```
Dostupnost = (Celkový čas - Downtime) / Celkový čas × 100 %
Příklad:
Rok = 365 × 24 × 60 = 525 600 minut
Cíl: 99,9 % → povolený downtime = 525 600 × (1 - 0,999) = 525,6 minut = 8,76 h
Složená dostupnost (řetězec závislostí):
A_web = 99,9 % (3 devítky)
A_api = 99,99 % (4 devítky)
A_db = 99,999 % (5 devítek)
A_celkem = 0,999 × 0,9999 × 0,99999 = 0,99889 ≈ 99,89 % (méně než 3 devítky!)
Paralelní dostupnost (redundance):
A_celkem = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)
Příklad: 2 servery s 99% dostupností
A_celkem = 1 - (1-0,99) × (1-0,99) = 1 - 0,01 × 0,01 = 0,9999 (99,99 %)
```
### Kalkulačka
```python
def uptime_percent_to_downtime(pct, period_days=365):
"""Převede procento uptimu na downtime v daném období."""
total_minutes = period_days * 24 * 60
allowed_downtime = total_minutes * (1 - pct / 100)
return allowed_downtime # minutes
def downtime_to_uptime_percent(downtime_minutes, period_days=365):
"""Převede downtime v minutách na procento uptimu."""
total_minutes = period_days * 24 * 60
return (1 - downtime_minutes / total_minutes) * 100
def combined_availability(availabilities):
"""Složená dostupnost (sériově zapojené komponenty)."""
result = 1.0
for a in availabilities:
result *= a
return result
def redundant_availability(availabilities):
"""Paralelní dostupnost (redundantní komponenty)."""
result = 1.0
for a in availabilities:
result *= (1 - a)
return 1 - result
```
### Fallacies výpočtu
- **Složená dostupnost není součet** — přidání další závislosti vždy snižuje celkovou dostupnost
- **Redundance není zadarmo** — přidání standby komponenty vyžaduje detekci selhání + failover (MTTR se nezlepší automaticky)
- **SLA není garance** — poskytovatelé často počítají SLA jako měsíční průměr, ne per-incident
- **Měření je klíčové** — bez SLI nelze ověřit SLO; "nedoměřená dostupnost neexistuje"
- **Plánovaná odstávka** — někdy se počítá do uptimu, někdy ne (záleží na definici SLA)
---
## DR scénáře
### Klasifikace
| Kategorie | Scénář | Typický RTO | Typické RPO | Frekvence |
|-----------|--------|-------------|-------------|-----------|
| **Site** | Výpadek celého DC / regionu | hodiny | minuty | Nízká |
| **Infrastructure** | Selhání HW (storage, switch, server) | minutyhodiny | sekundy | Střední |
| **Software** | Selhání OS, aplikace, DB | minuty | vteřiny | Vysoká |
| **Data** | Poškození dat, delete, cryptolocker | hodiny | okamžik zálohy | Nízkástřední |
| **Human** | Chybný deployment, config change | minutyhodiny | vteřiny | Střední |
| **Security** | Útok, breach, ransomware | dny | před útokem | Nízká |
| **Network** | Výpadek konektivity, DDoS | minutyhodiny | N/A | Střední |
| **Cloud provider** | Regionální výpadek (AWS, Azure, GCP) | hodiny | minuty | Velmi nízká |
### Detail scénářů
#### Site / Region failure
| Aspekt | Popis |
|--------|-------|
| **Příčina** | Blackout, požár, povodeň, zemětřesení, výpadek cloud providera |
| **Prevence** | Multi-AZ architektura, multi-region deployment, active-active |
| **Mitigace** | Automatický DNS failover (Route53, Azure Traffic Manager), replica v DR regionu |
| **Testování** | Game day: vypnout primární region, ověřit automatický failover |
#### Data corruption / human error
| Aspekt | Popis |
|--------|-------|
| **Příčina** | Chybný SQL příkaz (DELETE bez WHERE), omylem smazaný bucket, chybná migrace |
| **Prevence** | RBAC, MFA pro destructive operace, change management, peer review SQL |
| **Mitigace** | Point-in-time recovery (PITR), transaction log replay, immutable backups |
| **Testování** | Obnova zálohy do izolovaného prostředí, ověření integrity dat |
#### Ransomware / cyber attack
| Aspekt | Popis |
|--------|-------|
| **Příčina** | Útok na produkční systémy, zašifrování dat, exfiltrace |
| **Prevence** | Immutable backups (object lock), air-gapped backups, network segmentation |
| **Mitigace** | Obnova z čisté zálohy, re-build infrastructure from IaC |
| **Testování** | Pravidelná obnova v izolované síti, ověření že backup není infikován |
---
## Prevence — strategie
### Backup strategie
| Aproach | Popis | Use case |
|---------|-------|----------|
| **3-2-1 pravidlo** | 3 kopie, 2 různá média, 1 off-site | Univerzální |
| **3-2-1-0** | + 0 chyb po obnově (testování) | Enterprise, compliance |
| **GFS (Grandfather-Father-Son)** | Denní, týdenní, měsíční rotace | Dlouhodobý archiv |
| **Incremental forever** | Plná záloha 1×, pak jen změny | Velké objemy dat |
| **Reverse incremental** | Plná + inkrementální, plná je vždy aktuální | Rychlá obnova |
### Zálohovací metody
| Metoda | RPO | RTO | Úložiště | Vhodné pro |
|--------|-----|-----|----------|------------|
| **Full backup** | Poslední full | Doba obnovy full | Velké | Malá data, weekly |
| **Incremental** | Poslední inkrement | Full + všechny inkrementy | Malé | Velká data, daily |
| **Differential** | Poslední diff | Full + poslední diff | Střední | Kompromis |
| **Snapshot** | Okamžik snapshotu | vteřiny | Copy-on-write | VM, storage array |
| **Continuous (CDC)** | < 1 s | Sekundy | Log stream | DB (binlog, WAL) |
| **PITR** | Libovolný bod v čase | Dle objemu | Full + WAL | RDS, PostgreSQL, SQL Server |
### Imunabilita backupů
Klíčová ochrana proti ransomwaru:
| Technika | Popis |
|----------|-------|
| **Object Lock (WORM)** | Backup nelze smazat ani přepsat po defined retention period (S3 Object Lock, Azure Blob Immutable) |
| **Air gap** | Backup je fyzicky oddělený od produkční sítě (offline disk, tape, cloud bez VPN) |
| **Isolated backup network** | Backup traffic jde přes dedikovanou síť bez přístupu z produkční VLAN |
| **Out-of-band access** | Backup management console není dostupná z produkční sítě |
---
## DR architektury
### Multi-AZ (Single region)
```
Region ┌────────────────────────────────────┐
│ AZ-1 AZ-2 │
│ ┌──────────┐ ┌──────────┐ │
│ │ App │ │ App │ │
│ └─────┬────┘ └─────┬────┘ │
│ │ │ │
│ ┌─────▼────────────────▼─────┐ │
│ │ Load Balancer (cross-AZ) │ │
│ └─────────────┬──────────────┘ │
│ │ │
│ ┌─────────────▼──────────────┐ │
│ │ DB Primary (AZ-1) │ │
│ │ DB Standby (AZ-2) │ │
│ │ Synchronous replication │ │
│ └────────────────────────────┘ │
└────────────────────────────────────┘
```
- RTO: minuty (automatický failover)
- RPO: 0 (sync replication)
- Ochrana: proti selhání AZ, nikoliv regionu
### Multi-Region
```
Region A (Primary) Region B (DR)
┌─────────────────────┐ ┌─────────────────────┐
│ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ App + DB │ │ │ │ App + DB │ │
│ │ Active │──┼──Async───────┼─►│ Standby │ │
│ └───────────────┘ │ replikace │ └───────────────┘ │
│ │ │ │ │ │
│ ┌──────▼───────┐ │ │ ┌──────▼───────┐ │
│ │ DNS / GSLB │ │ │ │ DNS / GSLB │ │
│ └──────┬───────┘ │ │ └──────┬───────┘ │
└─────────┼──────────┘ └─────────┼──────────┘
│ │
└──────────── Traffic Manager ───────┘
```
| Varianta | RTO | RPO | Náklady | Failover |
|----------|-----|-----|---------|----------|
| **Active-Passive** | minutyhodiny | sekundy | Střední | Manuální / auto |
| **Active-Active** | sekundy | < 1 s | Vysoké | Automatický (DNS) |
| **Pilot Light** | desítky minut | minuty | Nízké | Manuální škálování |
| **Warm Standby** | minuty | sekundy | Vysoké | Auto (zmenšená kopie) |
| **Backup & Restore** | hodiny | 24 h | Nízké | Manuální |
### On-prem → Cloud DR (Hybrid)
```
On-prem DC Cloud (DR)
┌─────────────────────┐ ┌─────────────────────┐
│ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ Aplikace │ │ │ │ VM / Aplikace│ │
│ │ + DB │ │ │ │ + DB replica │ │
│ └───────┬───────┘ │ │ └───────┬───────┘ │
│ │ │ │ │ │
│ ┌───────▼───────┐ │ site-to-site│ ┌───────▼───────┐ │
│ │ Backup proxy │──┼────VPN───────┼─►│ Backup store │ │
│ └───────────────┘ │ │ └───────────────┘ │
│ │ │ │
│ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ Tape / NAS │ │ │ │ Veeam / Zerto│ │
│ └───────────────┘ │ │ └───────────────┘ │
└─────────────────────┘ └─────────────────────┘
```
- **RTO**: desítky minut (závisí na startup VM)
- **RPO**: minutyhodiny (závisí na replikačním nástroji)
- **Nástroje**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
- **Use case**: enterprise s on-prem DC, které potřebuje DR bez druhého DC
---
## DR testování
### Typy testů
| Typ | Popis | Frekvence | Riziko |
|-----|-------|-----------|--------|
| **Tabletop exercise** | Manuální procházení scénáře, žádný dopad na produkci | Měsíčně | Žádné |
| **Walkthrough** | Verifikace runbooku, kontrola že všichni ví co dělat | Kvartálně | Žádné |
| **Component test** | Test jedné komponenty (např. obnova jedné DB) | Měsíčně | Nízké |
| **Integrated test** | Test celého stacku v izolovaném prostředí | Kvartálně | Nízké |
| **Full failover test** | Produkční failover do DR site | Ročně | Vysoké |
| **Chaos experiment** | Cílené vnášení poruch do produkce | Průběžně | Střední |
### Runbook struktura
Každý DR scénář by měl mít runbook:
```yaml
scenario: "Region A failure"
triggers:
- "CloudWatch alarm: Region A health check 5× timeout"
- "PagerDuty incident P0"
decision_tree: |
1. Ověřit: je Region A opravdu nedostupný? (check z 3 různých lokací)
2. Rozhodnout: je RTO v ohrožení? Pokud zbývá < 30 % RTO → failover
3. Failover: spustit playbook `dr-failover-region-b`
4. Verifikace: smoke testy v Region B
5. Komunikace: status page + stakeholders
rollback: |
1. Po obnovení Region A → replikace změn z B zpět do A
2. Repoint DNS na A
3. Ověřit konzistenci dat
4. Vypnout Region B (nebo ponechat jako hot standby)
contacts:
primary: "on-call@example.com"
escalation: "infra-lead@example.com"
management: "vp-engineering@example.com"
```
---
## Best practices
- **Testuj obnovu, ne zálohu** — backup bez testované obnovy není backup
- **Automatizuj DR** — Terraform / Ansible pro spin-up DR prostředí, DNS failover
- **Dokumentuj runbooky** — každý scénář, kontakt, rozhodovací strom
- **Počítej se selháním** — design for failure, nečekej že všechno poběží
- **Nepodceňuj WRT** — obnova služby neznamená plný provoz (data warming, cache, connections)
- **Slaď RTO/RPO s businessem** — technické možnosti musí odpovídat obchodním požadavkům
- **Monitoruj SLI** — bez dat nelze ověřit SLO
- **DR není jen IT** — komunikace, PR, právní, regulace
---
## Související
- [CLOUD.md](CLOUD.md) — cloud DR strategie, AWS/Azure/GCP specific
- [DATACENTERS.md](DATACENTERS.md) — DC redundance, Tier klasifikace
- [MONITORING.md](MONITORING.md) — alerting, SLI/SLO/SLA
- [CICD.md](CICD.md) — deployment strategie, rollback
- [STORAGE.md](STORAGE.md) — backup storage, replication
## Zdroje
Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
*Poslední revize: 2026-06-11*

275
MESSAGING.en.md Normal file
View File

@@ -0,0 +1,275 @@
# 📨 Messaging and streaming platforms
## Platform overview
| Platform | Type | Language | Protocol | Persistence | Use case |
|-----------|-----|-------|----------|-------------|----------|
| **Apache Kafka** | Distributed event store | Java/Scala | Binary (TCP) | Disk (log) | Event streaming, data pipeline, log aggregation |
| **RabbitMQ** | Message broker | Erlang | AMQP 0-9-1, MQTT, STOMP | Disk / RAM | Application messaging, task queue, RPC |
| **Apache Pulsar** | Distributed messaging + streaming | Java | Binary (TCP) + REST | Disk (segmented log) | Streaming + queue in one, multi-tenant |
| **NATS** | Lightweight messaging | Go | NATS protocol (TCP) | Memory / JetStream (disk) | Microservices, IoT, edge, low-latency |
| **AWS SQS** | Managed queue | — | HTTPS | Managed | Decoupling services, serverless |
| **AWS SNS** | Managed pub/sub | — | HTTPS, SQS, Lambda, email | Managed | Push notifications, fanout |
| **Azure Service Bus** | Managed messaging | — | AMQP, HTTPS | Managed | Enterprise messaging, sessions, transactions |
| **Google Pub/Sub** | Managed streaming | — | gRPC, REST | Managed | Event-driven, data pipeline |
| **Red Hat AMQ 7** (Artemis) | Message broker | Java | AMQP, MQTT, STOMP, OpenWire | Disk | Enterprise, JMS, high-availability |
| **Oracle Service Bus (OSB)** | Enterprise ESB | Java | HTTP/S, JMS, SOAP, REST, MQ, FTP, AQ | Managed (WebLogic) | Enterprise integration, SOA, protocol mediation, routing |
---
## Platform details
### Apache Kafka
**Architecture:**
```
Producer ──► Topic ──► Partition ──► Consumer Group
├── Partition 0 (Leader) ──► Broker 1
├── Partition 1 (Follower) ──► Broker 2
└── Partition 2 (Follower) ──► Broker 3
```
| Concept | Description |
|---------|-------|
| **Topic** | Logical message category |
| **Partition** | Append-only log, ordered sequence of messages |
| **Broker** | Server in Kafka cluster |
| **Producer** | Publishes messages to topic |
| **Consumer** | Reads messages from partition (within consumer group) |
| **Consumer Group** | Group of consumers sharing topic reading |
| **Offset** | Position in partition (tracked by consumer) |
| **KRaft** | Controller quorum (replaces Zookeeper from Kafka 3.x) |
**Replication and HA:**
| Parameter | Value |
|----------|---------|
| Replication factor | 23 (typically 3 for production) |
| ISR (In-Sync Replicas) | Number of replicas keeping up with leader |
| Min ISR | Minimum ISR for acknowledging writes (acks=all) |
| acks=0 | Fire-and-forget (fastest, possible data loss) |
| acks=1 | Write acknowledged by leader (compromise) |
| acks=all | Write acknowledged by all ISR (safest) |
| Leader failover | Automatic election of new leader from ISR |
**Important configuration:**
```properties
# Production
replication.factor=3
min.insync.replicas=2
default.replication.factor=3
# Retention
log.retention.hours=168 # 7 days
log.retention.bytes=-1 # unlimited (or limit)
log.segment.bytes=1073741824 # 1 GB per segment
# Performance
num.partitions=3 # adjust per need (scale-out)
compression.type=snappy # (snappy, gzip, lz4, zstd)
```
**Partitioning strategies:**
| Strategy | Key | Advantage | Disadvantage |
|----------|------|--------|----------|
| Round-robin | null | Even distribution | Per-key ordering lost |
| Key-based | user_id, order_id | Same key → same partition | Uneven distribution (hot keys) |
| Custom partitioner | Custom logic | Per use-case optimization | More complex maintenance |
### RabbitMQ
**Architecture:**
```
Producer ──► Exchange ──► Binding ──► Queue ──► Consumer
┌───────────┼───────────┐
▼ ▼ ▼
Direct Topic Fanout
Exchange Exchange Exchange
```
| Concept | Description |
|---------|-------|
| **Exchange** | Receives messages from producer, routes to queue |
| **Binding** | Exchange → queue link with routing key |
| **Queue** | FIFO message queue (consumed by consumer) |
| **Virtual Host (vhost)** | Tenant isolation within a single cluster |
| **Publisher Confirm** | Broker acknowledges message receipt |
| **Consumer Ack** | Consumer acknowledges message processing |
**Exchange types:**
| Type | Routing | Use case |
|-----|---------|----------|
| **Direct** | routing_key = binding_key | Task queue, point-to-point |
| **Topic** | routing_key match binding pattern (wildcard `*`, `#`) | Pub/sub with filtering |
| **Fanout** | All bound queues | Broadcast, event notification |
| **Headers** | AMQP headers match | Complex routing (not routing key dependent) |
**Queue types:**
```properties
# Classic Queue (deprecated in production)
x-queue-type: classic
# Quorum Queue (recommended for production)
x-queue-type: quorum
x-quorum-initial-group-size: 3
x-dead-letter-exchange: dlx
# Stream Queue (for large backlogs)
x-queue-type: stream
x-max-length-bytes: 1073741824
```
**HA and clustering:**
| Mode | Description | Use case |
|-------|-------|----------|
| **Quorum Queues** | Raft-based replication (35 node), auto failover | Production, HA messaging |
| **Federation** | Async message forwarding between independent RabbitMQ clusters | Multi-region, DR |
| **Shovel** | Point-to-point message forwarding (Federation at queue level) | Migration, specific routing |
| **Warm Standby (DR)** | Secondary cluster, started on failover | Cold DR |
### Apache Pulsar
**Unique architecture (compute/storage separation):**
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Producer │ │ Consumer │ │ Consumer │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
┌──────▼───────────────────▼───────────────────▼──────┐
│ Broker (stateless) │
│ Subscription: Exclusive / Shared / Failover │
└──────────────────────┬──────────────────────────────┘
┌──────────────────────▼──────────────────────────────┐
│ BookKeeper (stateful storage) │
│ ├── Bookie 1 ├── Bookie 2 ├── Bookie 3 ├── ... │
│ └── Ledger (append-only, segmented log) │
└─────────────────────────────────────────────────────┘
```
| Concept | Description |
|---------|-------|
| **Topic** | Logical category (partitioned or non-partitioned) |
| **Subscription** | Delivery mode (Exclusive, Shared, Failover, Key_Shared) |
| **Ledger** | Storage unit in BookKeeper (append-only) |
| **Bookie** | Storage node (BookKeeper) |
| **Managed Ledger** | Segmented log with cache and retention |
**Advantages over Kafka:**
- Compute/storage separation — independent scaling
- Geo-replication built-in (native)
- Multi-tenant (namespaces, isolation)
- TTL, retry, dead letter topic (built-in)
- Read-at-least-once / effectively-once
### NATS
| Feature | Description |
|---------|-------|
| **Core NATS** | Pub/sub, request-reply, < 1 ms latency |
| **JetStream** | Persistence, exactly-once, key-value store, object store |
| **Leaf nodes** | Hierarchical cluster connection |
| **Super-cluster** | Multi-region clustering (global) |
**Use case:** IoT, edge computing, microservices communication, low-latency messaging.
### Oracle Service Bus (OSB)
Part of Oracle SOA Suite, runs on WebLogic Server. Enterprise service bus for integration in Oracle-heavy environments.
| Concept | Description |
|---------|-------|
| **Proxy Service** | Inbound endpoint (HTTP, JMS, MQ, SOAP, REST) |
| **Business Service** | Target backend service |
| **Pipeline** | Message processing — routing, transformation, validation |
| **Split-Join** | Parallel/sequential orchestration of multiple services |
| **Reporting** | Message tracking, SLA monitoring |
**Key features:**
- **Protocol mediation** — translation between SOAP/REST/JMS/MQ/FTP
- **Message transformation** — XSLT, XQuery, MFL (non-XML)
- **Throttling, SLA, alerting** — built-in
- **Oracle AQ (Advanced Queuing)** — integration with Oracle DB queues
- **XPath, XQuery, XSLT 2.0/3.0** — native support
- **Error handling** — fault policies, error queues, retry
**Use case:** Enterprise SOA, Oracle DB → Kafka bridging, legacy mainframe wrapping, B2B integration.
**Alternatives:** IBM Integration Bus (IIB), MuleSoft Anypoint, WSO2 EI, Apache Camel / ServiceMix.
---
## Platform comparison
### Performance and scaling
| Platform | Max throughput | Latency (P99) | Messages/s (1 broker) | Scaling |
|-----------|--------------|---------------|-------------------------|-----------|
| **Kafka** | > 1 GB/s | 210 ms | ~1,000,000 | Partitions (horizontal) |
| **Pulsar** | > 1 GB/s | 515 ms | ~1,000,000 | Brokers + Bookies |
| **RabbitMQ** | ~100 MB/s | < 1 ms (RAM) | ~100,000 | Clustering (node) |
| **NATS** | > 10 GB/s | < 0.5 ms | ~10,000,000 | Clustering + Leaf nodes |
| **OSB** | < 1 GB/s | 10100 ms | ~10,000 | Vertical (WebLogic cluster)
### Delivery guarantees
| Platform | At most once | At least once | Exactly once | Ordering |
|-----------|-------------|---------------|-------------|----------|
| **Kafka** | Yes | Yes (acks=all + min.insync) | Yes (idempotent + transactional) | Per partition |
| **Pulsar** | Yes | Yes | Yes (dedup + transactional) | Per partition |
| **RabbitMQ** | Yes | Yes (Publisher Confirm + Consumer Ack) | Limited | Per queue |
| **NATS** | Yes | Yes (JetStream) | Limited | Per subject |
| **OSB** | Yes | Yes (XA transactions, exactly-once delivery) | Yes (XA + WS-AT) | Per pipeline |
### When to use what
| Use case | Recommended platform | Reasoning |
|----------|---------------------|------------|
| **Event sourcing / audit log** | Kafka, Pulsar | Append-only log, high throughput, replay |
| **CDC (Change Data Capture)** | Kafka (Kafka Connect + Debezium) | Connector ecosystem |
| **Task queue (job processing)** | RabbitMQ, SQS | Dead letter, retry, priority, scheduling |
| **API messaging / microservices** | NATS, RabbitMQ | Low latency, simplicity |
| **Data pipeline (ETL)** | Kafka (KSQL, Kafka Streams) | Stream processing in platform |
| **IoT / Edge** | NATS, MQTT (RabbitMQ) | Lightweight, leaf nodes |
| **Enterprise SOA / EAI** | OSB, IBM IIB, MuleSoft | Protocol mediation, XA, B2B, legacy wrapping |
| **Multi-tenant cloud** | Pulsar | Native multi-tenant, geo-replication |
| **Serverless / event-driven** | SQS/SNS, Pub/Sub | Managed, auto-scaling |
---
## DR and high availability
See [DATACENTERS.en.md](DATACENTERS.en.md) — section "Impact of individual technologies on DC topology selection" for detailed DR mapping per platform.
### Best practices
- **Don't lose messages in queue** — prefer acknowledgement-based consumption (not auto-ack)
- **Dead letter queue** — every main queue has a DLQ for undeliverable messages
- **Monitor lag** — consumer lag is a key metric (Kafka: `kafka.consumer:consumer_lag`)
- **Idempotent consumer** — same message may be delivered twice
- **Retry with backoff** — exponential backoff on processing failure
- **Schema registry** — avoid deserialization errors (Avro, Protobuf, JSON Schema)
- **Encryption** — TLS in transit, encryption at rest (Kafka: cluster-side + topic-level)
---
## Related
- [DATACENTERS.en.md](DATACENTERS.en.md) — DR topology, per-platform mapping
- [CLOUD.en.md](CLOUD.en.md) — managed messaging (SQS, SNS, Service Bus, Pub/Sub)
## Sources
Links, books, and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
*Last revision: 2026-06-12*

275
MESSAGING.md Normal file
View File

@@ -0,0 +1,275 @@
# 📨 Messaging a streaming platformy
## Přehled platformem
| Platforma | Typ | Jazyk | Protokol | Persistence | Use case |
|-----------|-----|-------|----------|-------------|----------|
| **Apache Kafka** | Distributed event store | Java/Scala | Binary (TCP) | Disk (log) | Event streaming, data pipeline, log aggregation |
| **RabbitMQ** | Message broker | Erlang | AMQP 0-9-1, MQTT, STOMP | Disk / RAM | Aplikační messaging, task queue, RPC |
| **Apache Pulsar** | Distributed messaging + streaming | Java | Binary (TCP) + REST | Disk (segmented log) | Streaming + queue v jednom, multi-tenant |
| **NATS** | Lightweight messaging | Go | NATS protocol (TCP) | Memory / JetStream (disk) | Microservices, IoT, edge, low-latency |
| **AWS SQS** | Managed queue | — | HTTPS | Managed | Decoupling services, serverless |
| **AWS SNS** | Managed pub/sub | — | HTTPS, SQS, Lambda, email | Managed | Push notifications, fanout |
| **Azure Service Bus** | Managed messaging | — | AMQP, HTTPS | Managed | Enterprise messaging, sessions, transactions |
| **Google Pub/Sub** | Managed streaming | — | gRPC, REST | Managed | Event-driven, data pipeline |
| **Red Hat AMQ 7** (Artemis) | Message broker | Java | AMQP, MQTT, STOMP, OpenWire | Disk | Enterprise, JMS, high-availability |
| **Oracle Service Bus (OSB)** | Enterprise ESB | Java | HTTP/S, JMS, SOAP, REST, MQ, FTP, AQ | Managed (WebLogic) | Enterprise integration, SOA, protocol mediation, routing |
---
## Detail platformem
### Apache Kafka
**Architektura:**
```
Producer ──► Topic ──► Partition ──► Consumer Group
├── Partition 0 (Leader) ──► Broker 1
├── Partition 1 (Follower) ──► Broker 2
└── Partition 2 (Follower) ──► Broker 3
```
| Koncept | Popis |
|---------|-------|
| **Topic** | Logická kategorie zpráv |
| **Partition** | Append-only log, ordered sequence of messages |
| **Broker** | Server v Kafka clusteru |
| **Producer** | Publikuje zprávy do topicu |
| **Consumer** | Čte zprávy z partition (v rámci consumer group) |
| **Consumer Group** | Skupina consumerů sdílejících čtení topicu |
| **Offset** | Pozice v partition (sledovaná consumerem) |
| **KRaft** | Controller quorum (nahrazuje Zookeeper od Kafka 3.x) |
**Replikace a HA:**
| Parametr | Hodnota |
|----------|---------|
| Replication factor | 23 (typicky 3 pro produkci) |
| ISR (In-Sync Replicas) | Počet replik, které drží krok s leaderem |
| Min ISR | Minimální počet ISR pro potvrzení zápisu (acks=all) |
| acks=0 | Fire-and-forget (nejrychlejší, možná ztráta dat) |
| acks=1 | Zápis potvrzen leaderem (kompromis) |
| acks=all | Zápis potvrzen všemi ISR (nejbezpečnější) |
| Leader failover | Automatický výběr nového leadera z ISR |
**Důležité konfigurace:**
```properties
# Produkce
replication.factor=3
min.insync.replicas=2
default.replication.factor=3
# Retention
log.retention.hours=168 # 7 dní
log.retention.bytes=-1 # neomezeno (nebo limit)
log.segment.bytes=1073741824 # 1 GB per segment
# Performance
num.partitions=3 # podle potřeb (scale-out)
compression.type=snappy # (snappy, gzip, lz4, zstd)
```
**Partitioning strategies:**
| Strategy | Klíč | Výhoda | Nevýhoda |
|----------|------|--------|----------|
| Round-robin | null | Rovnoměrné rozložení | Ztráta pořadí per klíč |
| Key-based | user_id, order_id | Zprávy se stejným klíčem → stejná partition | Nerovnoměrné rozložení (hot keys) |
| Custom partitioner | Vlastní logika | Optimalizace per use case | Složitější na údržbu |
### RabbitMQ
**Architektura:**
```
Producer ──► Exchange ──► Binding ──► Queue ──► Consumer
┌───────────┼───────────┐
▼ ▼ ▼
Direct Topic Fanout
Exchange Exchange Exchange
```
| Koncept | Popis |
|---------|-------|
| **Exchange** | Přijímá zprávy od producera, routuje do queue |
| **Binding** | Vazba exchange → queue s routing key |
| **Queue** | FIFO fronta zpráv (consumer čte) |
| **Virtual Host (vhost)** | Izolace tenantů v rámci jednoho clusteru |
| **Publisher Confirm** | Potvrzení že broker zprávu přijal |
| **Consumer Ack** | Potvrzení že consumer zprávu zpracoval |
**Exchange typy:**
| Typ | Routing | Use case |
|-----|---------|----------|
| **Direct** | routing_key = binding_key | Task queue, point-to-point |
| **Topic** | routing_key match binding pattern (wildcard `*`, `#`) | Pub/sub s filtrováním |
| **Fanout** | Všem bindovaným queue | Broadcast, event notification |
| **Headers** | AMQP headers match | Komplexní routing (není závislý na routing key) |
**Queue typy:**
```properties
# Classic Queue (deprecated v produkci)
x-queue-type: classic
# Quorum Queue (doporučeno pro produkci)
x-queue-type: quorum
x-quorum-initial-group-size: 3
x-dead-letter-exchange: dlx
# Stream Queue (pro large backlogs)
x-queue-type: stream
x-max-length-bytes: 1073741824
```
**HA a clustering:**
| Režim | Popis | Use case |
|-------|-------|----------|
| **Quorum Queues** | Raft-based replikace (35 node), auto failover | Produkce, HA messaging |
| **Federation** | Async forwarding zpráv mezi nezávislými RabbitMQ clustery | Multi-region, DR |
| **Shovel** | Point-to-point forwarding zpráv (Federation na úrovni queue) | Migrace, specifický routing |
| **Warm Standby (DR)** | Druhý cluster, start až při failoveru | Cold DR |
### Apache Pulsar
**Unikátní architektura (compute/storage separation):**
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Producer │ │ Consumer │ │ Consumer │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
┌──────▼───────────────────▼───────────────────▼──────┐
│ Broker (stateless) │
│ Subscription: Exclusive / Shared / Failover │
└──────────────────────┬──────────────────────────────┘
┌──────────────────────▼──────────────────────────────┐
│ BookKeeper (stateful storage) │
│ ├── Bookie 1 ├── Bookie 2 ├── Bookie 3 ├── ... │
│ └── Ledger (append-only, segmented log) │
└─────────────────────────────────────────────────────┘
```
| Koncept | Popis |
|---------|-------|
| **Topic** | Logická kategorie (partitioned nebo non-partitioned) |
| **Subscription** | Způsob doručení (Exclusive, Shared, Failover, Key_Shared) |
| **Ledger** | Storage unit v BookKeeper (append-only) |
| **Bookie** | Storage node (BookKeeper) |
| **Managed Ledger** | Segmentovaný log s cache a retention |
**Výhody oproti Kafce:**
- Compute/storage separation — nezávislé škálování
- Geo-replication built-in (nativní)
- Multi-tenant (namespaces, isolation)
- TTL, retry, dead letter topic (built-in)
- Read-at-least-once / effectively-once
### NATS
| Feature | Popis |
|---------|-------|
| **Core NATS** | Pub/sub, request-reply, < 1 ms latence |
| **JetStream** | Persistence, exactly-once, key-value store, object store |
| **Leaf nodes** | Hierarchické propojení clusterů |
| **Super-cluster** | Multi-region clustering (global) |
**Use case:** IoT, edge computing, microservices communication, low-latency messaging.
### Oracle Service Bus (OSB)
Součást Oracle SOA Suite, běží na WebLogic Serveru. Enterprise service bus pro integraci v Oracle-heavy prostředích.
| Koncept | Popis |
|---------|-------|
| **Proxy Service** | Vstupní endpoint (HTTP, JMS, MQ, SOAP, REST) |
| **Business Service** | Cílový backend service |
| **Pipeline** | Message processing — routing, transformation, validation |
| **Split-Join** | Parallel/sequential orchestration více služeb |
| **Reporting** | Message tracking, SLA monitoring |
**Klíčové vlastnosti:**
- **Protocol mediation** — překlad mezi SOAP/REST/JMS/MQ/FTP
- **Message transformation** — XSLT, XQuery, MFL (neXML)
- **Throttling, SLA, alerting** — built-in
- **Oracle AQ (Advanced Queuing)** — integrace s Oracle DB frontami
- **XPath, XQuery, XSLT 2.0/3.0** — nativní podpora
- **Error handling** — fault policies, error queues, retry
**Use case:** Enterprise SOA, Oracle DB → Kafka bridging, legacy mainframe wrapping, B2B integration.
**Alternativy:** IBM Integration Bus (IIB), MuleSoft Anypoint, WSO2 EI, Apache Camel / ServiceMix.
---
## Srovnání platformem
### Výkon a škálování
| Platforma | Max throughput | Latence (P99) | Počet zpráv/s (1 broker) | Škálování |
|-----------|--------------|---------------|-------------------------|-----------|
| **Kafka** | > 1 GB/s | 210 ms | ~1 000 000 | Partitions (horizontální) |
| **Pulsar** | > 1 GB/s | 515 ms | ~1 000 000 | Brokers + Bookies |
| **RabbitMQ** | ~100 MB/s | < 1 ms (RAM) | ~100 000 | Clustering (node) |
| **NATS** | > 10 GB/s | < 0,5 ms | ~10 000 000 | Clustering + Leaf nodes |
| **OSB** | < 1 GB/s | 10100 ms | ~10 000 | Vertikální (WebLogic cluster)
### Delivery guarantees
| Platforma | At most once | At least once | Exactly once | Ordering |
|-----------|-------------|---------------|-------------|----------|
| **Kafka** | Ano | Ano (acks=all + min.insync) | Ano (idempotent + transactional) | Per partition |
| **Pulsar** | Ano | Ano | Ano (dedup + transactional) | Per partition |
| **RabbitMQ** | Ano | Ano (Publisher Confirm + Consumer Ack) | Omezeně | Per queue |
| **NATS** | Ano | Ano (JetStream) | Omezeně | Per subject |
| **OSB** | Ano | Ano (XA transactions, exactly-once delivery) | Ano (XA + WS-AT) | Per pipeline |
### Kdy co použít
| Use case | Doporučená platforma | Zdůvodnění |
|----------|---------------------|------------|
| **Event sourcing / audit log** | Kafka, Pulsar | Append-only log, high throughput, replay |
| **CDC (Change Data Capture)** | Kafka (Kafka Connect + Debezium) | Ekosystém konektorů |
| **Task queue (job processing)** | RabbitMQ, SQS | Dead letter, retry, priority, scheduling |
| **API messaging / microservices** | NATS, RabbitMQ | Nízká latence, jednoduchost |
| **Data pipeline (ETL)** | Kafka (KSQL, Kafka Streams) | Stream processing v platformě |
| **IoT / Edge** | NATS, MQTT (RabbitMQ) | Lightweight, leaf nodes |
| **Enterprise SOA / EAI** | OSB, IBM IIB, MuleSoft | Protocol mediation, XA, B2B, legacy wrapping |
| **Multi-tenant cloud** | Pulsar | Nativní multi-tenant, geo-replication |
| **Serverless / event-driven** | SQS/SNS, Pub/Sub | Managed, auto-scaling |
---
## DR a vysoká dostupnost
Viz [DATACENTERS.md](DATACENTERS.md) — sekce "Vliv jednotlivých technologií na výběr DC topologie" pro detail DR mapping per platforma.
### Best practices
- **Neztrať zprávu v queue** — preferovat aknowledge-based consumption (ne auto-ack)
- **Dead letter queue** — každá hlavní queue má DLQ pro nedoručitelné zprávy
- **Monitoring lag** — consumer lag je klíčová metrika (Kafka: `kafka.consumer:consumer_lag`)
- **Idempotentní consumer** — stejná zpráva může být doručena dvakrát
- **Retry s backoff** — exponenciální backoff při selhání zpracování
- **Schema registry** — vyhnout se deserialization errors (Avro, Protobuf, JSON Schema)
- **Šifrování** — TLS in transit, encryption at rest (Kafka: cluster-side + topic-level)
---
## Související
- [DATACENTERS.md](DATACENTERS.md) — DR topologie, per-platforma mapping
- [CLOUD.md](CLOUD.md) — managed messaging (SQS, SNS, Service Bus, Pub/Sub)
## Zdroje
Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
*Poslední revize: 2026-06-12*

View File

@@ -52,9 +52,10 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
| 🌐 Network architecture | [NETWORKING.md](NETWORKING.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
| 📊 Monitoring & observability | [MONITORING.md](MONITORING.md) | Prometheus, Grafana, OTel, logging, alerting | — |
| 🔄 CI/CD & DevOps | [CICD.md](CICD.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
| 🔄 Disaster Recovery | [DR.md](DR.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
| 🗄️ Database architecture | [DATABASES.md](DATABASES.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VEKTOROVE-DB, DATABAZOVE-ENGINY |
| 🖥️ Hypervisors | [HYPERVISORS.md](HYPERVISORS.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
| 🏭 Data centers | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC services | MONITORING |
| 🏭 Data centers | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING |
| 💾 Storage | [STORAGE.md](STORAGE.md) | SAN/NAS/object, RAID, SDS, Ceph, OpenStack Cinder/Swift/Manila | — |
| 🔌 Server connectivity | [CONNECTIVITY.md](CONNECTIVITY.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
| 🔧 Server hardware | [SERVER-HW.md](SERVER-HW.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
@@ -89,9 +90,10 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
| 🌐 Network architecture | [NETWORKING.en.md](NETWORKING.en.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
| 📊 Monitoring & observability | [MONITORING.en.md](MONITORING.en.md) | Prometheus, Grafana, OTel, logging, alerting | — |
| 🔄 CI/CD & DevOps | [CICD.en.md](CICD.en.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
| 🔄 Disaster Recovery | [DR.en.md](DR.en.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
| 🗄️ Database architecture | [DATABASES.en.md](DATABASES.en.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VECTOR-DBS, DATABASE-ENGINES |
| 🖥️ Hypervisors | [HYPERVISORS.en.md](HYPERVISORS.en.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services | MONITORING |
| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING |
| 💾 Storage | [STORAGE.en.md](STORAGE.en.md) | SAN/NAS/object, RAID, SDS, Ceph | — |
| 🔌 Server connectivity | [CONNECTIVITY.en.md](CONNECTIVITY.en.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
| 🔧 Server hardware | [SERVER-HW.en.md](SERVER-HW.en.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
@@ -136,6 +138,7 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
| `DATACENTERS.md` / `DATACENTERS.en.md` | [`MONITORING.md`](MONITORING.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
| `MONITORING.md` / `MONITORING.en.md` | [`sources/monitoring/sources.md`](sources/monitoring/sources.md) |
| `CICD.md` / `CICD.en.md` | [`sources/cicd/sources.md`](sources/cicd/sources.md) |
| `DR.md` / `DR.en.md` | [`CLOUD.md`](CLOUD.md), [`DATACENTERS.md`](DATACENTERS.md), [`MONITORING.md`](MONITORING.md), [`CICD.md`](CICD.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
| `PROVISIONING.md` / `PROVISIONING.en.md` | [`CICD.md`](CICD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
| `STORAGE.md` / `STORAGE.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
| `GPU.md` / `GPU.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |

View File

@@ -52,15 +52,18 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
| 🌐 Síťová architektura | [NETWORKING.md](NETWORKING.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
| 📊 Monitoring a observabilita | [MONITORING.md](MONITORING.md) | Prometheus, Grafana, OTel, logging, alerting, SLO | — |
| 🔄 CI/CD a DevOps | [CICD.md](CICD.md) | Pipelines, GitOps, IaC (Terraform), deployment strategie | — |
| 🔄 Disaster Recovery | [DR.md](DR.md) | RTO, RPO, scénáře, prevence, výpočet uptimu | CLOUD, DATACENTERS, MONITORING |
| 🗄️ Databázová architektura | [DATABASES.md](DATABASES.md) | Klasifikace, sharding, replikace, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VEKTOROVE-DB, DATABAZOVE-ENGINY |
| 🖥️ Hypervisory | [HYPERVISORS.md](HYPERVISORS.md) | VMware, Hyper-V, KVM, Proxmox, migrace | STORAGE, SERVER-HW |
| 🏭 Datová centra | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC služby | MONITORING |
| 🏭 Datová centra | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC služby, sekundární DC topologie | MONITORING, MESSAGING |
| 💾 Storage | [STORAGE.md](STORAGE.md) | SAN/NAS/object, RAID, SDS, Ceph, OpenStack Cinder/Swift/Manila | — |
| 🔌 Server connectivity | [CONNECTIVITY.md](CONNECTIVITY.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
| 🔧 Server hardware | [SERVER-HW.md](SERVER-HW.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
| 🎮 GPU | [GPU.md](GPU.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — |
| ⚙️ Server config | [SERVER-CONFIG.md](SERVER-CONFIG.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — |
| 📦 Provisioning | [PROVISIONING.md](PROVISIONING.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD |
| 📨 Messaging & streaming | [MESSAGING.md](MESSAGING.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD |
| 🏗️ Migrace DC | [DC-MIGRATION.md](DC-MIGRATION.md) | Strategie, fáze, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE |
| 📋 Původní rozcestník | [HARDWARE.md](HARDWARE.md) | Legacy index → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING |
| 📋 Původní infrastruktura | [INFRASTRUCTURE.md](INFRASTRUCTURE.md) | Legacy index → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE |
| 📋 Review workflow | [REVIEW.md](REVIEW.md) | Proces oponentury a kontroly obsahu | — |
@@ -89,15 +92,18 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
| 🌐 Network architecture | [NETWORKING.en.md](NETWORKING.en.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
| 📊 Monitoring & observability | [MONITORING.en.md](MONITORING.en.md) | Prometheus, Grafana, OTel, logging, alerting | — |
| 🔄 CI/CD & DevOps | [CICD.en.md](CICD.en.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
| 🔄 Disaster Recovery | [DR.en.md](DR.en.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
| 🗄️ Database architecture | [DATABASES.en.md](DATABASES.en.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VECTOR-DBS, DATABASE-ENGINES |
| 🖥️ Hypervisors | [HYPERVISORS.en.md](HYPERVISORS.en.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services | MONITORING |
| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING, MESSAGING |
| 💾 Storage | [STORAGE.en.md](STORAGE.en.md) | SAN/NAS/object, RAID, SDS, Ceph | — |
| 🔌 Server connectivity | [CONNECTIVITY.en.md](CONNECTIVITY.en.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
| 🔧 Server hardware | [SERVER-HW.en.md](SERVER-HW.en.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
| 🎮 GPU | [GPU.en.md](GPU.en.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — |
| ⚙️ Server config | [SERVER-CONFIG.en.md](SERVER-CONFIG.en.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — |
| 📦 Provisioning | [PROVISIONING.en.md](PROVISIONING.en.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD |
| 📨 Messaging & streaming | [MESSAGING.en.md](MESSAGING.en.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD |
| 🏗️ DC Migration | [DC-MIGRATION.en.md](DC-MIGRATION.en.md) | Strategies, phases, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE |
| 📋 Legacy index | [HARDWARE.en.md](HARDWARE.en.md) | → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING |
| 📋 Legacy infra | [INFRASTRUCTURE.en.md](INFRASTRUCTURE.en.md) | → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE |
| 📋 Review workflow | [REVIEW.en.md](REVIEW.en.md) | Review and content control process | — |
@@ -136,6 +142,9 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
| `DATACENTERS.md` / `DATACENTERS.en.md` | [`MONITORING.md`](MONITORING.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
| `MONITORING.md` / `MONITORING.en.md` | [`sources/monitoring/sources.md`](sources/monitoring/sources.md) |
| `CICD.md` / `CICD.en.md` | [`sources/cicd/sources.md`](sources/cicd/sources.md) |
| `DR.md` / `DR.en.md` | [`CLOUD.md`](CLOUD.md), [`DATACENTERS.md`](DATACENTERS.md), [`MONITORING.md`](MONITORING.md), [`CICD.md`](CICD.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
| `MESSAGING.md` / `MESSAGING.en.md` | [`DATACENTERS.md`](DATACENTERS.md), [`CLOUD.md`](CLOUD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
| `DC-MIGRATION.md` / `DC-MIGRATION.en.md` | [`DATACENTERS.md`](DATACENTERS.md), [`CLOUD.md`](CLOUD.md), [`DR.md`](DR.md), [`NETWORKING.md`](NETWORKING.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
| `PROVISIONING.md` / `PROVISIONING.en.md` | [`CICD.md`](CICD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
| `STORAGE.md` / `STORAGE.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
| `GPU.md` / `GPU.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
@@ -187,4 +196,4 @@ Raw referenční data (dokumentace, knihy, standardy) podle oblastí:
---
*Rozcestník je automaticky udržován agentem `kb-index`. Poslední aktualizace: 2026-06-11.*
*Rozcestník je automaticky udržován agentem `kb-index`. Poslední aktualizace: 2026-06-12.*

View File

@@ -112,6 +112,21 @@ Rozděleno do samostatných souborů:
| Complete guide to modern vSphere alternatives — Spectro Cloud | https://www.spectrocloud.com/blog/vsphere-alternatives | `[done]` |
| Broadcom VMware Acquisition: What's Next — Sayers | https://www.sayers.com/blog/after-the-deal-whats-next-for-vmware-customers | `[done]` |
| Stanford University migration from VMware to Proxmox | https://itcommunity.stanford.edu/news/enterprise-technology-completes-successful-virtual-infrastructure-migration-vmware-proxmox | `[done]` |
| | **Messaging / streaming** | |
| Apache Kafka docs | https://kafka.apache.org/documentation/ | `[done]` |
| RabbitMQ docs | https://www.rabbitmq.com/documentation.html | `[done]` |
| Apache Pulsar docs | https://pulsar.apache.org/docs/ | `[done]` |
| NATS docs | https://docs.nats.io/ | `[done]` |
| Designing Event-Driven Systems (Confluent) | https://www.confluent.io/designing-event-driven-systems/ | `[done]` |
| Kafka: The Definitive Guide (2nd ed.) — Confluent | https://www.confluent.io/resources/kafka-the-definitive-guide/ | `[done]` |
| Enterprise Integration Patterns — Hohpe & Woolf | https://www.enterpriseintegrationpatterns.com/ | `[done]` |
| | **DC migrace** | |
| AWS Cloud Migration — 6 Strategies for Migrating to the Cloud | https://aws.amazon.com/blogs/enterprise-strategy/6-strategies-for-migrating-applications-to-the-cloud/ | `[done]` |
| Azure Cloud Migration — Microsoft Cloud Adoption Framework | https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ | `[done]` |
| Gartner 5 Rs of Cloud Migration | https://www.gartner.com/en/documents/3984835 | `[done]` |
| VMware Site Recovery Manager — documentation | https://docs.vmware.com/en/Site-Recovery-Manager/ | `[done]` |
| Zerto — Disaster Recovery & Migration | https://www.zerto.com/resources/ | `[done]` |
| The Phoenix Project — IT Ops & Migration patterns | https://itrevolution.com/product/the-phoenix-project/ | `[done]` |
## Výrobci hardware