new files
This commit is contained in:
@@ -658,6 +658,281 @@ flowchart TD
|
||||
CLIM -->|"Cold (SE, NO)"| FC3["Free cooling 7000+ h/year<br/>Air-side economizer<br/>PUE < 1.2"]
|
||||
```
|
||||
|
||||
## Secondary data center topologies
|
||||
|
||||
When planning a second DC, the choice of topology is key based on distance, RPO/RTO, and budget.
|
||||
|
||||
### Distance classification
|
||||
|
||||
| Category | Distance | Latency (round-trip) | Use case |
|
||||
|-----------|-----------|---------------------|----------|
|
||||
| **Metro (Campus)** | 1–20 km | < 1 ms | Synchronous replication, stretched cluster |
|
||||
| **Metro** | 20–100 km | 1–5 ms | Metro cluster, mostly sync replication |
|
||||
| **Regional** | 100–500 km | 5–20 ms | Asynchronous replication, warm standby |
|
||||
| **Continent** | 500–3000 km | 20–100 ms | Asynchronous replication, cold standby |
|
||||
| **Global** | 3000+ km | > 100 ms | Async only, no real-time dependencies |
|
||||
|
||||
### Topologies by operational mode
|
||||
|
||||
#### Active-Active (Hot-Hot)
|
||||
|
||||
```
|
||||
DC-A (Primary) DC-B (Active)
|
||||
┌────────────────────┐ ┌────────────────────┐
|
||||
│ App Active │ │ App Active │
|
||||
│ DB Active │◄─sync─►│ DB Active │
|
||||
│ Users → LB → A │ │ Users → LB → B │
|
||||
└────────────────────┘ └────────────────────┘
|
||||
│ │
|
||||
└──── Global Load Balancer ────┘
|
||||
```
|
||||
|
||||
| Parameter | Value |
|
||||
|----------|---------|
|
||||
| **RTO** | 0–seconds (automatic failover, traffic is redirected) |
|
||||
| **RPO** | 0 (sync replication, commit is confirmed only after write to both DCs) |
|
||||
| **Max distance** | < 100 km (latency < 5 ms RTT for sync DB replication) |
|
||||
| **Operating costs** | 2× (both DCs fully active, both fully equipped) |
|
||||
| **Advantages** | Zero downtime, instant switchover, full utilization of both DCs |
|
||||
| **Disadvantages** | Requires synchronous replication → distance limit, complex networking, split-brain risk |
|
||||
|
||||
**Split-brain solutions**: STONITH (Shoot The Other Node In The Head), watchdog, quorum (3rd node in 3rd location / cloud), fencing, SCSI-3 persistent reservation.
|
||||
|
||||
**Use case**: Financial services, telco, payment gateways — where even a minute of downtime = millions.
|
||||
|
||||
#### Active-Passive (Hot-Warm, MetroCluster)
|
||||
|
||||
```
|
||||
DC-A (Primary) DC-B (Standby)
|
||||
┌────────────────────┐ ┌────────────────────┐
|
||||
│ App Active │ │ App Standby │
|
||||
│ DB Primary │──sync──►│ DB Standby │
|
||||
│ Users → LB → A │ │ ~~~ (waiting) ~~~ │
|
||||
│ DNS: A-record │ │ DNS: health check │
|
||||
└────────────────────┘ └────────────────────┘
|
||||
```
|
||||
|
||||
| Parameter | Value |
|
||||
|----------|---------|
|
||||
| **RTO** | tens of seconds–minutes (DNS failover + App startup) |
|
||||
| **RPO** | 0 (sync) or seconds (async) |
|
||||
| **Max distance** | sync < 100 km, async unlimited |
|
||||
| **Operating costs** | 1.5–1.8× (second DC has reduced or idle compute) |
|
||||
| **MetroCluster** | Specific implementation: FC SAN over DWDM, sync mirror, automatic failover |
|
||||
|
||||
**MetroCluster** (NetApp, Dell EMC, HPE):
|
||||
- Storage-based cluster with synchronous mirroring between DCs
|
||||
- Automatic failover on entire DC failure
|
||||
- Requires dedicated DWDM or dark fiber interconnection
|
||||
- Typical distance: up to 50 km (for latency < 1 ms RTT)
|
||||
- Use case: enterprise storage, primary+secondary DC in metropolitan area
|
||||
|
||||
#### Hot-Cold (Warm Standby → Cold)
|
||||
|
||||
```
|
||||
DC-A (Primary) DC-B (Cold Standby)
|
||||
┌────────────────────┐ ┌────────────────────┐
|
||||
│ App Active │ │ ~~~ powered off ~~~│
|
||||
│ DB Active │──async─►│ Backup storage │
|
||||
│ Users → A │ │ ~~~ no compute ~~~│
|
||||
└────────────────────┘ └────────────────────┘
|
||||
```
|
||||
|
||||
| Parameter | Value |
|
||||
|----------|---------|
|
||||
| **RTO** | hours–days (purchase/rent HW, restore from backup) |
|
||||
| **RPO** | hours (last backup) |
|
||||
| **Max distance** | unlimited |
|
||||
| **Operating costs** | 1.1–1.3× (only storage and facility, compute only at failover) |
|
||||
| **Typical use case** | Low-cost DR, compliance, last resort |
|
||||
|
||||
#### Pilot Light
|
||||
|
||||
```
|
||||
DC-A (Primary) DC-B (Pilot Light)
|
||||
┌────────────────────┐ ┌────────────────────┐
|
||||
│ App Active │ │ ~~~ off ~~~ │
|
||||
│ DB Active │──async─►│ DB replica (mini) │
|
||||
│ All services │ │ Core services only│
|
||||
│ │ │ (DNS, LDAP, mon) │
|
||||
└────────────────────┘ └────────────────────┘
|
||||
On DR: spin-up compute
|
||||
from IaC, rest from backup
|
||||
```
|
||||
|
||||
- DC-B runs with minimum compute (only core services and DB replica)
|
||||
- Application layer is spun up from IaC (Terraform, Ansible) only during DR
|
||||
- Compromise between cost and RTO
|
||||
|
||||
### Comparison table
|
||||
|
||||
| Topology | RTO | RPO | Cost (× primary) | Max distance | Failover |
|
||||
|-----------|-----|-----|-------------------|-------------|----------|
|
||||
| **Active-Active** | 0–s | 0 | 2.0× | < 100 km | Auto (traffic) |
|
||||
| **MetroCluster** | s–min | 0 | 1.8–2.0× | < 50 km | Auto (storage) |
|
||||
| **Active-Passive (sync)** | min | 0 | 1.5–1.8× | < 100 km | Semi-auto |
|
||||
| **Active-Passive (async)** | min–h | s–min | 1.3–1.5× | unlimited | Semi-auto |
|
||||
| **Pilot Light** | h | min–h | 1.2–1.4× | unlimited | Manual |
|
||||
| **Warm Standby** | min–h | s–min | 1.5–1.8× | unlimited | Semi-auto |
|
||||
| **Cold Standby** | days | h | 1.1–1.3× | unlimited | Manual |
|
||||
|
||||
### Stretched Cluster
|
||||
|
||||
```
|
||||
┌──── Site A (50 km) ────┐ ┌──── Site B ──────────┐
|
||||
│ ┌──────────────────┐ │ │ ┌──────────────────┐ │
|
||||
│ │ ESXi / Hyper-V │ │ │ │ ESXi / Hyper-V │ │
|
||||
│ │ VM │ │ │ │ VM (complement) │ │
|
||||
│ └────────┬─────────┘ │ │ └────────┬─────────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ ┌────────▼─────────┐ │ │ ┌────────▼─────────┐ │
|
||||
│ │ Storage (SAN) │──┼────┼──│ Storage (SAN) │ │
|
||||
│ │ MetroCluster │ │ │ │ MetroCluster │ │
|
||||
│ └──────────────────┘ │ │ └──────────────────┘ │
|
||||
└────────────────────────┘ └────────────────────────┘
|
||||
│
|
||||
┌─────▼──────┐
|
||||
│ vCenter / │
|
||||
│ Cluster │
|
||||
│ (single) │
|
||||
└────────────┘
|
||||
```
|
||||
|
||||
- One cluster stretched across two sites (single management domain)
|
||||
- VMs can live-migrate between sites (vMotion over distance)
|
||||
- Storage synchronously mirrored (MetroCluster, VPLEX, vSAN延伸)
|
||||
- **Requirements**: dark fiber / DWDM, low latency (< 5 ms), high link reliability
|
||||
- **Risks**: split-brain, brain drain (split-site cluster), network dependency
|
||||
- **Use case**: enterprise with own dark fiber between two DCs in a metropolitan area
|
||||
|
||||
### Decision tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Start(["Secondary DC"]) --> RPO{"Required RPO?"}
|
||||
RPO -->|"0 (no data loss)"| SYNC{"Sync replication possible?"}
|
||||
SYNC -->|"Yes, < 100 km"| ACT{"Want zero downtime?"}
|
||||
ACT -->|"Yes"| AA["Active-Active<br/>RTO=0, RPO=0, 2× cost"]
|
||||
ACT -->|"No"| AP["Active-Passive<br/>RTO=min, RPO=0, 1.5×"]
|
||||
SYNC -->|"No, > 100 km"| ASYNC["Active-Passive (async)<br/>RTO=min, RPO=s, 1.3×"]
|
||||
|
||||
RPO -->|"minutes–hours"| WARM{"Want fast failover?"}
|
||||
WARM -->|"Yes"| PILOT["Pilot Light<br/>RTO=h, RPO=min, 1.2×"]
|
||||
WARM -->|"No"| COLD["Cold Standby<br/>RTO=days, RPO=h, 1.1×"]
|
||||
|
||||
Start --> DIST{"Distance between DCs"}
|
||||
DIST -->|"< 50 km, own fiber"| MC["MetroCluster / Stretched Cluster<br/>Single management, sync storage"]
|
||||
DIST -->|"50–300 km"| REG["Regional DR<br/>Active-Passive, async replication"]
|
||||
DIST -->|"> 300 km"| GLOBAL["Global DR<br/>Cold standby, backup & restore"]
|
||||
```
|
||||
|
||||
### Physical infrastructure for DC interconnection
|
||||
|
||||
| Technology | Bandwidth | Max distance | Latency | Use case |
|
||||
|------------|-----------|-------------|---------|----------|
|
||||
| **Dark fiber** | 100 GbE–800 GbE | 10–80 km (single-mode) | < 0.1 ms | MetroCluster, stretched cluster |
|
||||
| **DWDM** | 400 GbE–1.6 TbE (per lambda) | 80–120 km (without amplifier) | < 0.5 ms | Metro, metro cluster |
|
||||
| **CWDM** | 10–25 GbE (per channel) | 10–40 km | < 0.3 ms | Campus, smaller metro |
|
||||
| **MPLS L2VPN** | 10–100 GbE | unlimited | 1–10 ms | Regional DR, async replication |
|
||||
| **Internet IPsec** | 1–10 GbE | unlimited | 5–50 ms | Cold standby, backup |
|
||||
|
||||
### Impact of individual technologies on DC topology selection
|
||||
|
||||
Choosing a secondary DC topology is not purely an infrastructure decision — each layer (DB, hypervisor, orchestration, messaging) brings its own constraints.
|
||||
|
||||
#### Databases
|
||||
|
||||
| DB technology | Sync replication | Max distance | Auto-failover | Split-brain handling | Note |
|
||||
|---------------|---------------|-------------|---------------|-------------------|----------|
|
||||
| **PostgreSQL** | Synchronous commit (synchronous_standby_names) | < 100 km (latency < 10 ms) | Patroni / repmgr + etcd | Quorum (etcd, 3+ node) | Streaming replication, needs wal_keep_segments |
|
||||
| **MySQL** | Group Replication (multi-primary, single-primary) | < 100 km | MySQL InnoDB Cluster + MySQL Router | Paxos (Group Replication, 3+ node) | Semi-sync as compromise |
|
||||
| **Oracle** | Data Guard (SYNC/FASTSYNC/ASYNC), RAC extended | sync < 100 km, async unlimited | Data Guard Broker / FSFO (Fast Start Failover) | Observer (3rd node) | Far Sync for remote DCs |
|
||||
| **MSSQL** | AlwaysOn Availability Groups (SYNCHRONOUS_COMMIT) | < 100 km | AlwaysOn + Cluster quorum | File share majority / cloud witness | Multi-site cluster support |
|
||||
| **MongoDB** | Majority write concern + journaling | < 100 km | Replica set auto-election | Arbitration node (voting member) | Priority-based failover |
|
||||
| **Cassandra** | N/A (multi-master, eventual consistency) | unlimited | Yes (peer-to-peer) | None (multi-master, gossip protocol) | Snitch-aware topology, NetworkTopologyStrategy |
|
||||
| **Redis** | Redis Sentinel / Redis Cluster (async) | unlimited (async) | Sentinel / Cluster failover | Quorum (Sentinel, majority) | PSYNC replication, replication lag |
|
||||
|
||||
Key limitation for **sync replication**: latency < 5 ms RTT (commit must wait for confirmation from both DCs). At 100 km RTT ~1 ms — OK. At 1000 km (~10 ms RTT) sync replication reduces transaction throughput by 80+ %.
|
||||
|
||||
Suitable for **Active-Active**:
|
||||
- **Cassandra / ScyllaDB** — native multi-DC, eventual consistency, no split-brain
|
||||
- **MySQL Group Replication (multi-primary)** — 3+ DC for quorum
|
||||
- **CockroachDB / TiDB** — native multi-region, ACID across DCs
|
||||
- **Redis Enterprise** — Active-Active (CRDT-based)
|
||||
|
||||
Suitable for **Active-Passive**:
|
||||
- **PostgreSQL + Patroni** — auto-failover, etcd quorum
|
||||
- **Oracle Data Guard** — FSFO, far sync for remote DCs
|
||||
- **MSSQL AlwaysOn** — cloud witness
|
||||
- **MongoDB Replica Set** — arbitration node in 3rd location
|
||||
|
||||
#### Hypervisors
|
||||
|
||||
| Hypervisor | Cluster technology | Stretched cluster | Max distance | Split-brain |
|
||||
|-----------|-------------------|-------------------|-------------|-------------|
|
||||
| **VMware vSphere** | vSAN延伸, Metro vCenter, Site Recovery Manager | Yes (vSAN延伸, Metro Cluster) | < 50 km (vSAN延伸), < 10 ms RTT | Fencing (STONITH), witness host |
|
||||
| **Hyper-V** | Storage Replica + Failover Cluster | Yes (Cluster Sets) | < 50 km (sync), unlimited (async) | File share witness / cloud witness |
|
||||
| **Proxmox VE** | Proxmox HA + Ceph | Limited (Ceph stretch cluster) | < 50 km (Ceph sync) | Ceph monitor quorum (3+ DC) |
|
||||
| **XCP-ng / XenServer** | Xen Orchestra HA + SR (Storage Repository) replication | Limited | depends on storage replication | — |
|
||||
| **Nutanix AHV** | Metro Availability (sync), Async DR | Yes (Metro) | < 100 km (sync), unlimited (async) | Witness VM (cloud / 3rd site) |
|
||||
| **KVM / oVirt** | oVirt HA + GlusterFS / NFS | Limited | depends on storage replication | — |
|
||||
|
||||
**vSAN延伸 specific requirements:**
|
||||
- Dedicated vSAN network (25 GbE min., < 5 ms RTT)
|
||||
- Witness host in 3rd location (or cloud witness)
|
||||
- All VM policies (FTT=1, mirroring striped)
|
||||
- Storage policy: `site-A + site-B + witness`
|
||||
|
||||
#### Kubernetes and container platforms
|
||||
|
||||
| Platform | Multi-cluster DR | Replication | Max distance | Failover |
|
||||
|-----------|-----------------|-----------|-------------|----------|
|
||||
| **Vanilla K8s** | KubeFed, Cluster API, Velero + Restic | Velero (backup/restore), Rook (Ceph) | unlimited | Manual (Velero restore) |
|
||||
| **OpenShift** | ACM (Advanced Cluster Management), Velero | OADP (OpenShift API for Data Protection) | unlimited | ACM failover (subscription) |
|
||||
| **Rancher** | Rancher Multi-Cluster App, Velero | Longhorn (sync/async DR), Velero | unlimited | Semi-auto |
|
||||
| **Google GKE** | Multi-cluster Services, Backup for GKE | Config Sync, Backup for GKE | unlimited | Manual |
|
||||
| **Azure AKS** | Azure ARC + Velero + Azure Traffic Manager | AKS backup (velero), Azure Site Recovery | unlimited | Manual (Velero) |
|
||||
| **AWS EKS** | EKS multi-cluster, Velero + S3 cross-region | Velero (S3), Rook (EBS snapshots) | unlimited | Manual |
|
||||
|
||||
**Key K8s DR principles:**
|
||||
- **Applications must be stateless** (or state externalized to DB/storage)
|
||||
- **Velero** — backup/restore entire cluster (PV, resources, helm releases)
|
||||
- **Rook/Ceph** — cross-region mirroring RBD volumes
|
||||
- **KubeFed / ACM** — subscription-based deploy to multiple clusters
|
||||
- **Ingress/Gateway API** — traffic routing between clusters
|
||||
- **External DNS** — DNS failover on cluster outage
|
||||
|
||||
#### Messaging / streaming
|
||||
|
||||
| Platform | Replication | Topology | DR support | Max distance |
|
||||
|-----------|-----------|-----------|------------|-------------|
|
||||
| **Apache Kafka** | MirrorMaker 2, Confluent Cluster Linking, KRaft quorum | Active-Passive (MM2), Active-Active (Cluster Linking) | MM2: async, Cluster Linking: async | unlimited |
|
||||
| **RabbitMQ** | Classic Queue Mirroring, Quorum Queues | Active-Passive (Warm Standby) | Federation / Shovel (async) | unlimited |
|
||||
| **Red Hat AMQ** | (Artemis) Cluster + HA | Active-Passive (shared store / replication) | Live-backup pair | < 100 km (sync) |
|
||||
| **NATS** | NATS JetStream (cluster + cross-account) | Active-Active (Leaf nodes, cross-account) | Super-cluster, failover | unlimited |
|
||||
| **Apache Pulsar** | BookKeeper (bookie rack-aware), geo-replication | Active-Active (geo-replication) | Built-in (cluster-level) | unlimited (async) |
|
||||
| **AWS SQS/SNS** | Managed, AWS region pairs | Active-Active (multi-region) | Built-in (AWS managed) | unlimited |
|
||||
| **Azure Service Bus** | Managed, paired region | Active-Passive (paired region) | Built-in (geo-recovery) | unlimited |
|
||||
| **Oracle Service Bus (OSB)** | Oracle WebLogic Cluster + JDBC store + AQ | Active-Passive (WebLogic Cluster + Data Guard) | OSB/WLS cluster + Oracle RAC/Data Guard sync | < 100 km (Data Guard sync), unlimited (async) |
|
||||
|
||||
**Messaging DR recommendations:**
|
||||
- **Kafka**: use Cluster Linking for Active-Active, or MirrorMaker 2 for Active-Passive; replicate only critical topics
|
||||
- **RabbitMQ**: Quorum Queues + Federation upstream for DR; avoid Classic Queue Mirroring (deprecated)
|
||||
- **Pulsar**: native geo-replication, bookie rack-aware for stretched cluster; easiest DR among messaging platforms
|
||||
- **OSB**: WebLogic cluster + Oracle RAC/Data Guard; DR depends on DB layer, not on OSB itself
|
||||
|
||||
### Per-layer limitations summary table
|
||||
|
||||
| Layer | Limiting factor for secondary DC | Max distance for sync | Impact on topology selection |
|
||||
|--------|-----------------------------------|----------------------|--------------------------|
|
||||
| **Storage** | Sync mirror latency, DWDM cost | < 50 km (MetroCluster) | Stretched cluster only in metro |
|
||||
| **Databases** | Commit wait for sync replication | < 100 km (5 ms RTT) | Active-Active only with multi-master DB |
|
||||
| **Hypervisor** | Stretched cluster quorum + fencing | < 50 km (vSAN, 5 ms) | MetroCluster / stretched cluster |
|
||||
| **Kubernetes** | Velero restore time, Rook mirror latency | unlimited (async) | Active-Passive, cold standby |
|
||||
| **Messaging** | Replication lag, offset management | unlimited (async) | Active-Active (Kafka, Pulsar, NATS) or Active-Passive |
|
||||
| **Network** | Dark fiber/DWDM cost, latency | < 100 km (metro fiber) | Limits sync replication options |
|
||||
| **Application** | Stateful/stateless, connection draining | depends on architecture | Stateless app → any topology |
|
||||
|
||||
## Disk monitoring — S.M.A.R.T.
|
||||
|
||||
Self-Monitoring, Analysis and Reporting Technology — predictive monitoring of HDD/SSD.
|
||||
|
||||
277
DATACENTERS.md
277
DATACENTERS.md
@@ -658,6 +658,281 @@ flowchart TD
|
||||
CLIM -->|"Chladná (SE, NO)"| FC3["Free cooling 7000+ h/rok<br/>Air-side economizer<br/>PUE < 1.2"]
|
||||
```
|
||||
|
||||
## Topologie sekundárního datového centra
|
||||
|
||||
Při plánování druhého DC je klíčová volba topologie podle vzdálenosti, RPO/RTO a rozpočtu.
|
||||
|
||||
### Klasifikace vzdáleností
|
||||
|
||||
| Kategorie | Vzdálenost | Latence (round-trip) | Use case |
|
||||
|-----------|-----------|---------------------|----------|
|
||||
| **Metro (Campus)** | 1–20 km | < 1 ms | Synchronní replikace, stretched cluster |
|
||||
| **Metro** | 20–100 km | 1–5 ms | Metro cluster, většinou sync replikace |
|
||||
| **Regional** | 100–500 km | 5–20 ms | Asynchronní replikace, warm standby |
|
||||
| **Continent** | 500–3000 km | 20–100 ms | Asynchronní replikace, cold standby |
|
||||
| **Global** | 3000+ km | > 100 ms | Pouze async, žádné real-time závislosti |
|
||||
|
||||
### Topologie podle provozního režimu
|
||||
|
||||
#### Active-Active (Hot-Hot)
|
||||
|
||||
```
|
||||
DC-A (Primary) DC-B (Active)
|
||||
┌────────────────────┐ ┌────────────────────┐
|
||||
│ App Active │ │ App Active │
|
||||
│ DB Active │◄─sync─►│ DB Active │
|
||||
│ Users → LB → A │ │ Users → LB → B │
|
||||
└────────────────────┘ └────────────────────┘
|
||||
│ │
|
||||
└──── Global Load Balancer ────┘
|
||||
```
|
||||
|
||||
| Parametr | Hodnota |
|
||||
|----------|---------|
|
||||
| **RTO** | 0–vteřiny (automatický failover, traffic se přesměruje) |
|
||||
| **RPO** | 0 (sync replikace, commit je potvrzen až po zápisu do obou DC) |
|
||||
| **Max distance** | < 100 km (latence < 5 ms RTT pro sync DB replikaci) |
|
||||
| **Provozní náklady** | 2× (obě DC plně aktivní, obě plně vybavené) |
|
||||
| **Výhody** | Nulový výpadek, okamžité přepnutí, plné využití obou DC |
|
||||
| **Nevýhody** | Nutná synchronní replikace → limit vzdálenosti, komplexní networking, split-brain risk |
|
||||
|
||||
**Split-brain řešení**: STONITH (Shoot The Other Node In The Head), watchdog, quorum (3. node v 3. lokaci / cloud), fencing, SCSI-3 persistent reservation.
|
||||
|
||||
**Use case**: Finanční služby, telco, platební brány — kde i minuta výpadku = miliony.
|
||||
|
||||
#### Active-Passive (Hot-Warm, MetroCluster)
|
||||
|
||||
```
|
||||
DC-A (Primary) DC-B (Standby)
|
||||
┌────────────────────┐ ┌────────────────────┐
|
||||
│ App Active │ │ App Standby │
|
||||
│ DB Primary │──sync──►│ DB Standby │
|
||||
│ Users → LB → A │ │ ~~~ (čeká) ~~~ │
|
||||
│ DNS: A-record │ │ DNS: health check │
|
||||
└────────────────────┘ └────────────────────┘
|
||||
```
|
||||
|
||||
| Parametr | Hodnota |
|
||||
|----------|---------|
|
||||
| **RTO** | desítky vteřin–minuty (DNS failover + startup App) |
|
||||
| **RPO** | 0 (sync) nebo sekundy (async) |
|
||||
| **Max distance** | sync < 100 km, async neomezeně |
|
||||
| **Provozní náklady** | 1,5–1,8× (druhé DC má zmenšený nebo idle compute) |
|
||||
| **MetroCluster** | Specifická implementace: FC SAN přes DWDM, sync mirror, automatický failover |
|
||||
|
||||
**MetroCluster** (NetApp, Dell EMC, HPE):
|
||||
- Storage-based cluster se synchronním mirroringem mezi DC
|
||||
- Automatic failover při selhání celého DC
|
||||
- Vyžaduje dedikované DWDM nebo dark fiber propojení
|
||||
- Typická vzdálenost: do 50 km (pro latenci < 1 ms RTT)
|
||||
- Use case: enterprise storage, primary+secondary DC v metropolitní oblasti
|
||||
|
||||
#### Hot-Cold (Warm Standby → Cold)
|
||||
|
||||
```
|
||||
DC-A (Primary) DC-B (Cold Standby)
|
||||
┌────────────────────┐ ┌────────────────────┐
|
||||
│ App Active │ │ ~~~ powered off ~~~│
|
||||
│ DB Active │──async─►│ Backup storage │
|
||||
│ Users → A │ │ ~~~ no compute ~~~│
|
||||
└────────────────────┘ └────────────────────┘
|
||||
```
|
||||
|
||||
| Parametr | Hodnota |
|
||||
|----------|---------|
|
||||
| **RTO** | hodiny–dny (nákup/najmutí HW, obnova z backupu) |
|
||||
| **RPO** | hodiny (poslední backup) |
|
||||
| **Max distance** | neomezena |
|
||||
| **Provozní náklady** | 1,1–1,3× (jen storage a facility, compute až při failoveru) |
|
||||
| **Typ use case** | Low-cost DR, compliance, poslední záchrana |
|
||||
|
||||
#### Pilot Light
|
||||
|
||||
```
|
||||
DC-A (Primary) DC-B (Pilot Light)
|
||||
┌────────────────────┐ ┌────────────────────┐
|
||||
│ App Active │ │ ~~~ off ~~~ │
|
||||
│ DB Active │──async─►│ DB replica (mini) │
|
||||
│ Všechny služby │ │ Core services jen │
|
||||
│ │ │ (DNS, LDAP, mon) │
|
||||
└────────────────────┘ └────────────────────┘
|
||||
Při DR: spin-up compute
|
||||
z IaC, zbytek z backupu
|
||||
```
|
||||
|
||||
- DC-B běží s minimem compute (jen core služby a DB replica)
|
||||
- Aplikační vrstva se spin-up z IaC (Terraform, Ansible) až při DR
|
||||
- Kompromis mezi náklady a RTO
|
||||
|
||||
### Srovnávací tabulka
|
||||
|
||||
| Topologie | RTO | RPO | Náklady (× primár) | Max distance | Failover |
|
||||
|-----------|-----|-----|-------------------|-------------|----------|
|
||||
| **Active-Active** | 0–s | 0 | 2,0× | < 100 km | Auto (traffic) |
|
||||
| **MetroCluster** | s–min | 0 | 1,8–2,0× | < 50 km | Auto (storage) |
|
||||
| **Active-Passive (sync)** | min | 0 | 1,5–1,8× | < 100 km | Polo-auto |
|
||||
| **Active-Passive (async)** | min–h | s–min | 1,3–1,5× | neomezena | Polo-auto |
|
||||
| **Pilot Light** | h | min–h | 1,2–1,4× | neomezena | Manuální |
|
||||
| **Warm Standby** | min–h | s–min | 1,5–1,8× | neomezena | Polo-auto |
|
||||
| **Cold Standby** | dny | h | 1,1–1,3× | neomezena | Manuální |
|
||||
|
||||
### Stretched Cluster
|
||||
|
||||
```
|
||||
┌──── Site A (50 km) ────┐ ┌──── Site B ──────────┐
|
||||
│ ┌──────────────────┐ │ │ ┌──────────────────┐ │
|
||||
│ │ ESXi / Hyper-V │ │ │ │ ESXi / Hyper-V │ │
|
||||
│ │ VM │ │ │ │ VM (komplement) │ │
|
||||
│ └────────┬─────────┘ │ │ └────────┬─────────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ ┌────────▼─────────┐ │ │ ┌────────▼─────────┐ │
|
||||
│ │ Storage (SAN) │──┼────┼──│ Storage (SAN) │ │
|
||||
│ │ MetroCluster │ │ │ │ MetroCluster │ │
|
||||
│ └──────────────────┘ │ │ └──────────────────┘ │
|
||||
└────────────────────────┘ └────────────────────────┘
|
||||
│
|
||||
┌─────▼──────┐
|
||||
│ vCenter / │
|
||||
│ Cluster │
|
||||
│ (single) │
|
||||
└────────────┘
|
||||
```
|
||||
|
||||
- Jeden cluster roztažený přes dvě lokality (single management domain)
|
||||
- VM mohou live-migrovat mezi site (vMotion nad vzdálenost)
|
||||
- Storage synchronně mirrorovaná (MetroCluster, VPLEX, vSAN延伸)
|
||||
- **Požadavky**: dark fiber / DWDM, nízká latence (< 5 ms), vysoká spolehlivost linky
|
||||
- **Riziko**: split-brain, brain drain (split-site cluster), závislost na síti
|
||||
- **Use case**: enterprise s vlastní dark fiber mezi dvěma DC v metropolitní oblasti
|
||||
|
||||
### Rozhodovací strom
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Start(["Sekundární DC"]) --> RPO{"Požadované RPO?"}
|
||||
RPO -->|"0 (žádná ztráta dat)"| SYNC{"Sync replikace možná?"}
|
||||
SYNC -->|"Ano, < 100 km"| ACT{"Chceš nulový výpadek?"}
|
||||
ACT -->|"Ano"| AA["Active-Active<br/>RTO=0, RPO=0, 2× náklady"]
|
||||
ACT -->|"Ne"| AP["Active-Passive<br/>RTO=min, RPO=0, 1,5×"]
|
||||
SYNC -->|"Ne, > 100 km"| ASYNC["Active-Passive (async)<br/>RTO=min, RPO=s, 1,3×"]
|
||||
|
||||
RPO -->|"minuty–hodiny"| WARM{"Chceš rychlý failover?"}
|
||||
WARM -->|"Ano"| PILOT["Pilot Light<br/>RTO=h, RPO=min, 1,2×"]
|
||||
WARM -->|"Ne"| COLD["Cold Standby<br/>RTO=dny, RPO=h, 1,1×"]
|
||||
|
||||
Start --> DIST{"Vzdálenost mezi DC"}
|
||||
DIST -->|"< 50 km, vlastní fiber"| MC["MetroCluster / Stretched Cluster<br/>Single management, sync storage"]
|
||||
DIST -->|"50–300 km"| REG["Regionální DR<br/>Active-Passive, async replikace"]
|
||||
DIST -->|"> 300 km"| GLOBAL["Globální DR<br/>Cold standby, backup & restore"]
|
||||
```
|
||||
|
||||
### Fyzická infrastruktura pro propojení DC
|
||||
|
||||
| Technologie | Bandwidth | Max distance | Latence | Use case |
|
||||
|------------|-----------|-------------|---------|----------|
|
||||
| **Dark fiber** | 100 GbE–800 GbE | 10–80 km (single-mode) | < 0,1 ms | MetroCluster, stretched cluster |
|
||||
| **DWDM** | 400 GbE–1,6 TbE (per lambda) | 80–120 km (bez zesilovače) | < 0,5 ms | Metro, metro cluster |
|
||||
| **CWDM** | 10–25 GbE (per channel) | 10–40 km | < 0,3 ms | Campus, menší metro |
|
||||
| **MPLS L2VPN** | 10–100 GbE | neomezena | 1–10 ms | Regional DR, async replikace |
|
||||
| **Internet IPsec** | 1–10 GbE | neomezena | 5–50 ms | Cold standby, backup |
|
||||
|
||||
### Vliv jednotlivých technologií na výběr DC topologie
|
||||
|
||||
Volba topologie sekundárního DC není čistě infrastrukturní rozhodnutí — každá vrstva (DB, hypervisor, orchestrace, messaging) přináší vlastní omezení.
|
||||
|
||||
#### Databáze
|
||||
|
||||
| DB technologie | Sync replikace | Max distance | Auto-failover | Split-brain řešení | Poznámka |
|
||||
|---------------|---------------|-------------|---------------|-------------------|----------|
|
||||
| **PostgreSQL** | Synchronous commit (synchronous_standby_names) | < 100 km (latence < 10 ms) | Patroni / repmgr + etcd | Quorum (etcd, 3+ node) | Streaming replication, nutné wal_keep_segments |
|
||||
| **MySQL** | Group Replication (multi-primary, single-primary) | < 100 km | MySQL InnoDB Cluster + MySQL Router | Paxos (Group Replication, 3+ node) | Semi-sync jako kompromis |
|
||||
| **Oracle** | Data Guard (SYNC/FASTSYNC/ASYNC), RAC extended | sync < 100 km, async neomezena | Data Guard Broker / FSFO (Fast Start Failover) | Observer (3. node) | Far Sync pro vzdálená DC |
|
||||
| **MSSQL** | AlwaysOn Availability Groups (SYNCHRONOUS_COMMIT) | < 100 km | AlwaysOn + Cluster quorum | File share majority / cloud witness | Multi-site cluster podpora |
|
||||
| **MongoDB** | Majority write concern + journaling | < 100 km | Replica set auto-election | Arbitration node (voting member) | Priority-based failover |
|
||||
| **Cassandra** | N/A (multi-master, eventual consistency) | neomezena | Ano (peer-to-peer) | Žádné (multi-master, gossip protokol) | Snitch-aware topologie, NetworkTopologyStrategy |
|
||||
| **Redis** | Redis Sentinel / Redis Cluster (async) | neomezena (async) | Sentinel / Cluster failover | Quorum (Sentinel, majority) | PSYNC replikace, replication lag |
|
||||
|
||||
Klíčové omezení pro **sync replikaci**: latence < 5 ms RTT (commit musí počkat na potvrzení z obou DC). Při 100 km je RTT ~1 ms – v pořádku. Při 1000 km (~10 ms RTT) sync replikace snižuje výkon transakcí o 80+ %.
|
||||
|
||||
Pro **Active-Active** jsou vhodné:
|
||||
- **Cassandra / ScyllaDB** — nativní multi-DC, eventual consistency, žádný split-brain
|
||||
- **MySQL Group Replication (multi-primary)** — 3+ DC pro kvorum
|
||||
- **CockroachDB / TiDB** — nativní multi-region, ACID napříč DC
|
||||
- **Redis Enterprise** — Active-Active (CRDT-based)
|
||||
|
||||
Pro **Active-Passive** jsou vhodné:
|
||||
- **PostgreSQL + Patroni** — auto-failover, etcd kvorum
|
||||
- **Oracle Data Guard** — FSFO, far sync pro vzdálené DC
|
||||
- **MSSQL AlwaysOn** — cloud witness
|
||||
- **MongoDB Replica Set** — arbitration node v 3. lokaci
|
||||
|
||||
#### Hypervisory
|
||||
|
||||
| Hypervisor | Cluster technologie | Stretched cluster | Max distance | Split-brain |
|
||||
|-----------|-------------------|-------------------|-------------|-------------|
|
||||
| **VMware vSphere** | vSAN延伸, Metro vCenter, Site Recovery Manager | Ano (vSAN延伸, Metro Cluster) | < 50 km (vSAN延伸), < 10 ms RTT | Fencing (STONITH), witness host |
|
||||
| **Hyper-V** | Storage Replica + Failover Cluster | Ano (Cluster Sets) | < 50 km (sync), neomezena (async) | File share witness / cloud witness |
|
||||
| **Proxmox VE** | Proxmox HA + Ceph | Omezeně (Ceph stretch cluster) | < 50 km (Ceph sync) | Ceph monitor quorum (3+ DC) |
|
||||
| **XCP-ng / XenServer** | Xen Orchestra HA + SR (Storage Repository) replication | Omezeně | závisí na storage replikaci | — |
|
||||
| **Nutanix AHV** | Metro Availability (sync), Async DR | Ano (Metro) | < 100 km (sync), neomezena (async) | Witness VM (cloud / 3. site) |
|
||||
| **KVM / oVirt** | oVirt HA + GlusterFS / NFS | Omezeně | závisí na storage replikaci | — |
|
||||
|
||||
**vSAN延伸** specifické požadavky:
|
||||
- Dedikovaná síť pro vSAN (25 GbE min., < 5 ms RTT)
|
||||
- Witness host v 3. lokaci (nebo cloud witness)
|
||||
- Všechny VM protokoly (FTT=1, mirroring striped)
|
||||
- Storage policy: `site-A + site-B + witness`
|
||||
|
||||
#### Kubernetes a kontejnerové platformy
|
||||
|
||||
| Platforma | Multi-cluster DR | Replikace | Max distance | Failover |
|
||||
|-----------|-----------------|-----------|-------------|----------|
|
||||
| **Vanilla K8s** | KubeFed, Cluster API, Velero + Restic | Velero (backup/restore), Rook (Ceph) | neomezena | Manuální (Velero restore) |
|
||||
| **OpenShift** | ACM (Advanced Cluster Management), Velero | OADP (OpenShift API for Data Protection) | neomezena | ACM failover (subscription) |
|
||||
| **Rancher** | Rancher Multi-Cluster App, Velero | Longhorn (sync/async DR), Velero | neomezena | Polo-auto |
|
||||
| **Google GKE** | Multi-cluster Services, Backup for GKE | Config Sync, Backup for GKE | neomezena | Manuální |
|
||||
| **Azure AKS** | Azure ARC + Velero + Azure Traffic Manager | AKS backup (velero), Azure Site Recovery | neomezena | Manuální (Velero) |
|
||||
| **AWS EKS** | EKS multi-cluster, Velero + S3 cross-region | Velero (S3), Rook (EBS snapshots) | neomezena | Manuální |
|
||||
|
||||
**Klíčové principy K8s DR:**
|
||||
- **Aplikace musí být stateless** (nebo state externalizovaný do DB/storage)
|
||||
- **Velero** — backup/restore celého clusteru (PV, resources, helm releases)
|
||||
- **Rook/Ceph** — cross-region mirroring RBD volumes
|
||||
- **KubeFed / ACM** — subscription-based deploy do více clusterů
|
||||
- **Ingress/Gateway API** — traffic routing mezi clustery
|
||||
- **External DNS** — DNS failover při výpadku clusteru
|
||||
|
||||
#### Messaging / streaming
|
||||
|
||||
| Platforma | Replikace | Topologie | DR podpora | Max distance |
|
||||
|-----------|-----------|-----------|------------|-------------|
|
||||
| **Apache Kafka** | MirrorMaker 2, Confluent Cluster Linking, KRaft quorum | Active-Passive (MM2), Active-Active (Cluster Linking) | MM2: async, Cluster Linking: async | neomezena |
|
||||
| **RabbitMQ** | Classic Queue Mirroring, Quorum Queues | Active-Passive (Warm Standby) | Federation / Shovel (async) | neomezena |
|
||||
| **Red Hat AMQ** | (Artemis) Cluster + HA | Active-Passive (shared store / replication) | Live-backup pair | < 100 km (sync) |
|
||||
| **NATS** | NATS JetStream (cluster + cross-account) | Active-Active (Leaf nodes, cross-account) | Super-cluster, failover | neomezena |
|
||||
| **Apache Pulsar** | BookKeeper (bookie rack-aware), geo-replication | Active-Active (geo-replication) | Built-in (cluster-level) | neomezena (async) |
|
||||
| **AWS SQS/SNS** | Managed, AWS region pairs | Active-Active (multi-region) | Built-in (AWS managed) | neomezena |
|
||||
| **Azure Service Bus** | Managed, paired region | Active-Passive (paired region) | Built-in (geo-recovery) | neomezena |
|
||||
| **Oracle Service Bus (OSB)** | Oracle WebLogic Cluster + JDBC store + AQ | Active-Passive (WebLogic Cluster + Data Guard) | OSB/WLS cluster + Oracle RAC/Data Guard sync | < 100 km (Data Guard sync), neomezena (async) |
|
||||
|
||||
**Doporučení pro DR messagingu:**
|
||||
- **Kafka**: použít Cluster Linking pro Active-Active, nebo MirrorMaker 2 pro Active-Passive; replikovat jen kritická témata
|
||||
- **RabbitMQ**: Quorum Queues + Federation upstream pro DR; vyhnout se Classic Queue Mirroring (deprecated)
|
||||
- **Pulsar**: nativní geo-replication, bookie rack-aware pro stretch cluster; nejjednodušší DR mezi messaging platformami
|
||||
- **OSB**: WebLogic cluster + Oracle RAC/Data Guard; DR závisí na DB vrstvě, ne na OSB samotném
|
||||
|
||||
### Hlavní omezení per vrstva (shrnující tabulka)
|
||||
|
||||
| Vrstva | Omezující faktor pro sekundární DC | Max distance pro sync | Dopad na výběr topologie |
|
||||
|--------|-----------------------------------|----------------------|--------------------------|
|
||||
| **Storage** | Latence sync mirroru, DWDM náklady | < 50 km (MetroCluster) | Stretched cluster jen v metru |
|
||||
| **Databáze** | Commit wait pro sync replikaci | < 100 km (5 ms RTT) | Active-Active jen s DB podporující multi-master |
|
||||
| **Hypervisor** | Stretched cluster quorum + fencing | < 50 km (vSAN, 5 ms) | MetroCluster / stretched cluster |
|
||||
| **Kubernetes** | Velero restore time, Rook mirror latency | neomezena (async) | Active-Passive, cold standby |
|
||||
| **Messaging** | Replication lag, offset management | neomezena (async) | Active-Active (Kafka, Pulsar, NATS) nebo Active-Passive |
|
||||
| **Network** | Dark fiber/DWDM náklady, latency | < 100 km (metro fiber) | Omezuje možnosti sync replikace |
|
||||
| **Aplikace** | Stateful/stateless, connection draining | závisí na architektuře | Stateless app → libovolná topologie |
|
||||
|
||||
## Monitoring disků — S.M.A.R.T.
|
||||
|
||||
Self-Monitoring, Analysis and Reporting Technology — prediktivní monitoring HDD/SSD.
|
||||
@@ -785,4 +1060,4 @@ OpenStack přináší do DC softwarovou abstrakční vrstvu, která umožňuje m
|
||||
- Akademické / HPC clustery (Ironic, Cyborg, Manila)
|
||||
- Government / regulated prostředí (on-prem, audit trail)
|
||||
|
||||
*Poslední revize: 2026-06-03*
|
||||
*Poslední revize: 2026-06-12*
|
||||
|
||||
246
DC-MIGRATION.en.md
Normal file
246
DC-MIGRATION.en.md
Normal file
@@ -0,0 +1,246 @@
|
||||
# 🏗️ Data Center Migration
|
||||
|
||||
## Migration strategies
|
||||
|
||||
| Strategy | RTO | RPO | Risk | Cost | Duration | Description |
|
||||
|-----------|-----|-----|--------|---------|-------------|-------|
|
||||
| **Cold / Big Bang** | hours–days | days | High | Low | days | Shut everything down, move, power up |
|
||||
| **Phased / Wave** | minutes (per wave) | minutes | Medium | Medium | weeks–months | Workloads moved in waves |
|
||||
| **Rolling** | 0 (live) | 0 | Low | High | months | Live migration per VM/service |
|
||||
| **Parallel Run** | 0 | 0 | Low | Very high | months | Both DCs operational, gradual cutover |
|
||||
| **Pilot Light** | hours | minutes | Medium | Low | weeks | Critical services in new DC, rest migrates |
|
||||
| **Lift & Shift** | hours | minutes | Medium | Low | weeks | VMs/servers moved without configuration changes |
|
||||
| **Re-platform** | hours | minutes | Low | Medium | months | Optimization during migration (OS upgrade, resize) |
|
||||
| **Re-architect** | 0 | 0 | Low | High | months–years | Application redesigned for new platform |
|
||||
|
||||
---
|
||||
|
||||
## Decision tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Start(["DC Migration"]) --> APP{"Application\nstateful?"}
|
||||
APP -->|"Yes"| DOWNTIME{"Tolerates\ndowntime?"}
|
||||
APP -->|"No"| ROLLING["Rolling / Parallel Run"]
|
||||
|
||||
DOWNTIME -->|"Yes, hours+"| COLD["Cold / Big Bang\nSimplest, cheapest\nRisk: all at once"]
|
||||
DOWNTIME -->|"Yes, minutes"| PHASED["Phased / Wave\nBy application / business unit"]
|
||||
DOWNTIME -->|"No (zero downtime)"| SYNC{"Sync replication\npossible?"}
|
||||
|
||||
SYNC -->|"Yes, < 100 km"| ROLLING
|
||||
SYNC -->|"No"| PARALLEL["Parallel Run\nBoth DCs active, gradual cutover"]
|
||||
|
||||
ROLLING --> ROLL_HA{"VMware,\nHyper-V?"}
|
||||
ROLL_HA -->|"Yes"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"]
|
||||
ROLL_HA -->|"No"| ROLL_REPL["Storage + DB replication\nGradual workload migration"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Migration phases
|
||||
|
||||
### 1. Discovery and assessment
|
||||
|
||||
| Task | Tools | Output |
|
||||
|------|----------|--------|
|
||||
| HW and SW inventory | RVTools, NetBox, CMDB | Server, VM, and service list |
|
||||
| Dependency mapping | ServiceNow, AppDynamics, manual | Application dependency graph |
|
||||
| Traffic analysis | NetFlow, sFlow, vRNI | Bandwidth, latency, peak usage |
|
||||
| Performance baseline | Prometheus, Zabbix, vRealize | CPU/RAM/disk/network per workload |
|
||||
| License audit | Flexera, SAM | Licenses, support, compliance |
|
||||
|
||||
**Output:** workload list with RTO/RPO, dependencies, and criticality.
|
||||
|
||||
### 2. Planning
|
||||
|
||||
- **Wave plan** — workload division into migration waves (10–50 VMs per wave)
|
||||
- **Dependency ordering** — DNS, NTP, LDAP, PKI first
|
||||
- **Cutover window** — time window for switching (typically weekend)
|
||||
- **Rollback plan** — conditions and procedure for reversal
|
||||
- **Test plan** — what and how to test post-migration
|
||||
- **Communication plan** — who, when, how is informed
|
||||
|
||||
### 3. New DC preparation
|
||||
|
||||
- **Infrastructure** — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (see [DATACENTERS.en.md](DATACENTERS.en.md) — deployment order)
|
||||
- **Network** — BGP peering, VXLAN/VLAN, firewall rules, load balancers
|
||||
- **Storage** — SAN zoning, NAS exports, Ceph cluster
|
||||
- **Virtualization** — vCenter, Hyper-V cluster, Proxmox
|
||||
|
||||
### 4. Replication and synchronization
|
||||
|
||||
| Layer | Method | Tools |
|
||||
|--------|--------|----------|
|
||||
| **Storage (block)** | SAN sync/async mirror, LUN replication | NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster |
|
||||
| **Storage (file)** | DFS-R, rsync, robocopy | Windows DFS, Rsync |
|
||||
| **Storage (object)** | Cross-region replication | MinIO replication, S3 CRR |
|
||||
| **Databases** | Log shipping, CDC, streaming replication | PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication |
|
||||
| **VM** | Storage vMotion, replication | VMware vSphere Replication, Hyper-V Replica, Zerto |
|
||||
| **Kubernetes** | Velero + Restic, Rook Ceph mirror | Velero, Rook |
|
||||
|
||||
### 5. Workload migration
|
||||
|
||||
#### Wave migration (recommended for medium/large DCs)
|
||||
|
||||
```mermaid
|
||||
gantt
|
||||
title Wave migration
|
||||
dateFormat YYYY-MM-DD
|
||||
section Wave 1 - Core
|
||||
DNS, NTP, LDAP :done, w1a, 2026-07-01, 3d
|
||||
Monitoring + logging :done, w1b, after w1a, 2d
|
||||
section Wave 2 - Network
|
||||
Load balancers :active, w2a, 2026-07-06, 2d
|
||||
Firewalls :active, w2b, 2026-07-08, 2d
|
||||
section Wave 3 - Storage
|
||||
NAS migration :w3a, 2026-07-10, 5d
|
||||
SAN replication :w3b, 2026-07-10, 3d
|
||||
section Wave 4 - Dev/Test
|
||||
Dev VMs :w4a, 2026-07-15, 5d
|
||||
section Wave 5 - Prod tier 3
|
||||
Internal apps :w5a, 2026-07-22, 5d
|
||||
section Wave 6 - Prod tier 2
|
||||
Business apps :w6a, 2026-07-29, 5d
|
||||
section Wave 7 - Prod tier 1
|
||||
Critical apps :w7a, 2026-08-05, 5d
|
||||
```
|
||||
|
||||
#### Typical single wave procedure:
|
||||
|
||||
1. **Day -7**: Sync data replication (initial seed)
|
||||
2. **Day -1**: Incremental sync, final test
|
||||
3. **Day 0 (cutover)**:
|
||||
- Stop application in source DC
|
||||
- Final sync (last delta)
|
||||
- Start application in target DC
|
||||
- DNS/Traffic switch
|
||||
- Smoke test
|
||||
4. **Day +1**: Monitoring (performance, errors, lag)
|
||||
5. **Day +7**: Rollback window end (success confirmation)
|
||||
|
||||
### 6. Network strategies
|
||||
|
||||
#### IP re-addressing
|
||||
|
||||
| Approach | Description | Pros | Cons |
|
||||
|---------|-------|--------|----------|
|
||||
| **Keep IP** | Same IPs, BGP anycast or stretch VLAN | No application config changes | Stretched VLAN/L2 limitations |
|
||||
| **Change IP** | New IP range, DNS/BGP routing change | Clean architecture | Config changes, DNS TTL |
|
||||
| **NAT translation** | NAT between old and new IP space | No application changes | Latency, troubleshooting complexity |
|
||||
|
||||
**Keep IP** is only possible with:
|
||||
- L2 stretch between DCs (VXLAN, OTV) — distance limited
|
||||
- BGP anycast for VIPs (load balancers)
|
||||
- Applications tolerant to ARP cache changes
|
||||
|
||||
#### DNS cutover
|
||||
|
||||
```
|
||||
1. Lower TTL to 60–300 s (one week ahead)
|
||||
2. At cutover, change A/AAAA records to new IPs
|
||||
3. Wait for propagation (per TTL)
|
||||
4. Monitor traffic
|
||||
```
|
||||
|
||||
#### Traffic steering
|
||||
|
||||
| Technique | Use case |
|
||||
|----------|----------|
|
||||
| **BGP** | Change AS path / local pref for traffic steering |
|
||||
| **DNS** | Lower TTL, change A records |
|
||||
| **Load balancer** | Change pool members, health check |
|
||||
| **GSLB** | Global Server Load Balancing (F5 GTM, NSX ALB) |
|
||||
| **Cloud DNS** | AWS Route53, Azure Traffic Manager, Google Cloud DNS |
|
||||
|
||||
### 7. Database migration
|
||||
|
||||
See individual DB files for details. Summary table:
|
||||
|
||||
| DB | Method | RPO | RTO | Note |
|
||||
|----|--------|-----|-----|----------|
|
||||
| **PostgreSQL** | Streaming replication + Patroni switchover | 0 (sync) / ~MB (async) | min | Patroni auto-failover |
|
||||
| **MySQL** | Group Replication / async replication | 0 (sync) / seconds | min | InnoDB Cluster |
|
||||
| **Oracle** | Data Guard switchover | 0 (sync) | min | Far sync for remote DCs |
|
||||
| **MSSQL** | AlwaysOn AG failover | 0 (sync) | min | Cloud witness |
|
||||
| **MongoDB** | Replica set election | seconds | < 1 min | Priority-based failover |
|
||||
| **Cassandra** | Multi-DC replication | eventual | 0 | Native multi-master |
|
||||
|
||||
### 8. Testing
|
||||
|
||||
| Phase | What to test | Method |
|
||||
|------|-------------|--------|
|
||||
| **Pre-migration** | Application in new DC (isolated) | Dry run on replicated data |
|
||||
| **Cutover** | Functionality, availability, latency | Smoke test, synthetic transactions |
|
||||
| **Post-migration** | Performance, integration, monitoring | A/B comparison with baseline, canary traffic |
|
||||
| **Rollback** | Return to old DC | Tested rollback plan |
|
||||
|
||||
### 9. Rollback plan
|
||||
|
||||
Each wave must have a defined rollback:
|
||||
|
||||
| Condition | Action |
|
||||
|----------|------|
|
||||
| Application fails to start in new DC | DNS switch back, stop replication |
|
||||
| Performance worse than baseline (> 20 %) | Rollback, root cause analysis |
|
||||
| Integration failure (API timeout, DB connection) | Rollback, dependency check |
|
||||
| Security incident | Rollback, forensic analysis |
|
||||
|
||||
Rollback must be tested **before** the real cutover.
|
||||
|
||||
---
|
||||
|
||||
## Special cases
|
||||
|
||||
### Mainframe migration
|
||||
|
||||
- **IBM z/OS** — GDPS (Geographically Dispersed Parallel Sysplex)
|
||||
- HyperSwap for storage mirroring
|
||||
- Cross-system coupling facility (XCF)
|
||||
- Often the last migrated component
|
||||
|
||||
### COTS applications (Oracle EBS, SAP)
|
||||
|
||||
- Require vendor-specific migration procedures
|
||||
- Oracle EBS: Autoconfig, cloning (ADXLC)
|
||||
- SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM
|
||||
- License re-licensing on HW change
|
||||
|
||||
### Cloud migration (On-prem → Cloud)
|
||||
|
||||
See [CLOUD.en.md](CLOUD.en.md) — migration strategies (6 Rs):
|
||||
|
||||
| Strategy | Description |
|
||||
|-----------|-------|
|
||||
| **Re-host (Lift & Shift)** | VM → Cloud VM (AWS MGN, Azure Migrate) |
|
||||
| **Re-platform** | OS upgrade, managed DB (RDS, Cloud SQL) |
|
||||
| **Re-architect** | Application rewritten as cloud-native |
|
||||
| **Retire** | Decommission unnecessary applications |
|
||||
| **Retain** | Application stays on-prem (review later) |
|
||||
| **Repurchase** | SaaS replacement |
|
||||
|
||||
---
|
||||
|
||||
## Recommended approach per DC size
|
||||
|
||||
| DC Size | VM Count | Recommended strategy | Duration | Team |
|
||||
|-------------|----------|---------------------|-------------|-----|
|
||||
| **Small** | < 50 | Big Bang (weekend) | 2–4 days | 3–5 people |
|
||||
| **Medium** | 50–500 | Phased (5–10 waves) | 2–8 weeks | 5–10 people |
|
||||
| **Large** | 500–5000 | Phased + Rolling | 3–12 months | 10–30 people |
|
||||
| **Enterprise** | 5000+ | Parallel Run / Rolling | 12–36 months | 30+ people |
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- [DATACENTERS.en.md](DATACENTERS.en.md) — DC topologies, secondary DC, deployment order
|
||||
- [CLOUD.en.md](CLOUD.en.md) — cloud migration strategies (6 Rs)
|
||||
- [DR.en.md](DR.en.md) — disaster recovery, RTO/RPO
|
||||
- [NETWORKING.en.md](NETWORKING.en.md) — BGP, DNS, VXLAN, traffic steering
|
||||
- [STORAGE.en.md](STORAGE.en.md) — storage replication
|
||||
|
||||
## Sources
|
||||
|
||||
Links, books, and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
|
||||
|
||||
*Last revision: 2026-06-12*
|
||||
246
DC-MIGRATION.md
Normal file
246
DC-MIGRATION.md
Normal file
@@ -0,0 +1,246 @@
|
||||
# 🏗️ Migrace datových center
|
||||
|
||||
## Strategie migrace
|
||||
|
||||
| Strategie | RTO | RPO | Riziko | Náklady | Doba trvání | Popis |
|
||||
|-----------|-----|-----|--------|---------|-------------|-------|
|
||||
| **Cold / Big Bang** | hodiny–dny | dny | Vysoké | Nízké | dny | Vše najednou vypnout, přesunout, zapnout |
|
||||
| **Phased / Wave** | minuty (per wave) | minuty | Střední | Střední | týdny–měsíce | Workloady po vlnách |
|
||||
| **Rolling** | 0 (live) | 0 | Nízké | Vysoké | měsíce | Live migration per VM/služba |
|
||||
| **Parallel Run** | 0 | 0 | Nízké | Velmi vysoké | měsíce | Oba DC v provozu, postupný přechod |
|
||||
| **Pilot Light** | hodiny | minuty | Střední | Nízké | týdny | Kritické služby v novém DC, ostatní se přesouvají |
|
||||
| **Lift & Shift** | hodiny | minuty | Střední | Nízké | týdny | VM/servery přesunuty bez změny konfigurace |
|
||||
| **Re-platform** | hodiny | minuty | Nízké | Střední | měsíce | Optimalizace během migrace (OS upgrade, resize) |
|
||||
| **Re-architect** | 0 | 0 | Nízké | Vysoké | měsíce–roky | Aplikace přepracována pro novou platformu |
|
||||
|
||||
---
|
||||
|
||||
## Rozhodovací strom
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Start(["Migrace DC"]) --> APP{"Aplikace\nstateful?"}
|
||||
APP -->|"Ano"| DOWNTIME{"Toleruje\nvýpadek?"}
|
||||
APP -->|"Ne"| ROLLING["Rolling / Parallel Run"]
|
||||
|
||||
DOWNTIME -->|"Ano, hodiny+"| COLD["Cold / Big Bang\nNejjednodušší, nejlevnější\nRiziko: vše najednou"]
|
||||
DOWNTIME -->|"Ano, minuty"| PHASED["Phased / Wave\nPo aplikacích / byznys jednotkách"]
|
||||
DOWNTIME -->|"Ne (zero downtime)"| SYNC{"Sync replikace\nmožná?"}
|
||||
|
||||
SYNC -->|"Ano, < 100 km"| ROLLING
|
||||
SYNC -->|"Ne"| PARALLEL["Parallel Run\nOba DC aktivní, postupný cutover"]
|
||||
|
||||
ROLLING --> ROLL_HA{"VMware,\nHyper-V?"}
|
||||
ROLL_HA -->|"Ano"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"]
|
||||
ROLL_HA -->|"Ne"| ROLL_REPL["Storage + DB replikace\nPostupný přesun workloadů"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Fáze migrace
|
||||
|
||||
### 1. Discovery a assessment
|
||||
|
||||
| Úkol | Nástroje | Výstup |
|
||||
|------|----------|--------|
|
||||
| Inventarizace HW a SW | RVTools, NetBox, CMDB | Seznam všech serverů, VM, služeb |
|
||||
| Dependency mapping | ServiceNow, AppDynamics, manual | Aplikační dependency graf |
|
||||
| Traffic analysis | NetFlow, sFlow, vRNI | BANDWIDTH, latency, peak usage |
|
||||
| Výkonnostní baseline | Prometheus, Zabbix, vRealize | CPU/RAM/disk/network per workload |
|
||||
| Licenční audit | Flexera, SAM | Licence, support, compliance |
|
||||
|
||||
**Výstupem je:** seznam workloadů s RTO/RPO, závislostmi a kritičností. Bez toho nelze naplánovat migraci.
|
||||
|
||||
### 2. Plánování
|
||||
|
||||
- **Wave plán** — rozdělení workloadů do migračních vln (10–50 VM na vlnu)
|
||||
- **Závislostní řazení** — DNS, NTP, LDAP, PKI musí být první
|
||||
- **Cutover okno** — časové okno pro přepnutí (typicky víkend)
|
||||
- **Rollback plán** — podmínky a postup pro vrácení
|
||||
- **Testovací plán** — co a jak testovat po migraci
|
||||
- **Komunikační plán** — kdo, kdy, jak je informován
|
||||
|
||||
### 3. Příprava nového DC
|
||||
|
||||
- **Infrastruktura** — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (viz [DATACENTERS.md](DATACENTERS.md) — deployment order)
|
||||
- **Network** — BGP peering, VXLAN/VLAN, firewall pravidla, load balancery
|
||||
- **Storage** — SAN zoning, NAS exports, Ceph cluster
|
||||
- **Virtualizace** — vCenter, Hyper-V cluster, Proxmox
|
||||
|
||||
### 4. Replikace a synchronizace
|
||||
|
||||
| Vrstva | Metoda | Nástroje |
|
||||
|--------|--------|----------|
|
||||
| **Storage (block)** | SAN sync/async mirror, LUN replication | NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster |
|
||||
| **Storage (file)** | DFS-R, rsync, robocopy | Windows DFS, Rsync |
|
||||
| **Storage (object)** | Cross-region replication | MinIO replication, S3 CRR |
|
||||
| **Databáze** | Log shipping, CDC, streaming replication | PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication |
|
||||
| **VM** | Storage vMotion, replication | VMware vSphere Replication, Hyper-V Replica, Zerto |
|
||||
| **Kubernetes** | Velero + Restic, Rook Ceph mirror | Velero, Rook |
|
||||
|
||||
### 5. Migrace workloadů
|
||||
|
||||
#### Wave migrace (doporučeno pro střední/větší DC)
|
||||
|
||||
```mermaid
|
||||
gantt
|
||||
title Wave migrace
|
||||
dateFormat YYYY-MM-DD
|
||||
section Wave 1 - Core
|
||||
DNS, NTP, LDAP :done, w1a, 2026-07-01, 3d
|
||||
Monitoring + logging :done, w1b, after w1a, 2d
|
||||
section Wave 2 - Network
|
||||
Load balancers :active, w2a, 2026-07-06, 2d
|
||||
Firewalls :active, w2b, 2026-07-08, 2d
|
||||
section Wave 3 - Storage
|
||||
NAS migrace :w3a, 2026-07-10, 5d
|
||||
SAN replication :w3b, 2026-07-10, 3d
|
||||
section Wave 4 - Dev/Test
|
||||
Dev VMs :w4a, 2026-07-15, 5d
|
||||
section Wave 5 - Prod tier 3
|
||||
Internal apps :w5a, 2026-07-22, 5d
|
||||
section Wave 6 - Prod tier 2
|
||||
Business apps :w6a, 2026-07-29, 5d
|
||||
section Wave 7 - Prod tier 1
|
||||
Critical apps :w7a, 2026-08-05, 5d
|
||||
```
|
||||
|
||||
#### Typický postup jedné vlny:
|
||||
|
||||
1. **Den -7**: Sync replikace dat (initial seed)
|
||||
2. **Den -1**: Incremental sync, final test
|
||||
3. **Den 0 (cutover)**:
|
||||
- Zastavení aplikace ve zdrojovém DC
|
||||
- Final sync (poslední delta)
|
||||
- Start aplikace v cílovém DC
|
||||
- DNS/Traffic switch
|
||||
- Smoke test
|
||||
4. **Den +1**: Monitorování (výkon, chyby, lag)
|
||||
5. **Den +7**: Rollback window end (potvrzení úspěchu)
|
||||
|
||||
### 6. Síťové strategie
|
||||
|
||||
#### IP re-addressing
|
||||
|
||||
| Přístup | Popis | Výhody | Nevýhody |
|
||||
|---------|-------|--------|----------|
|
||||
| **Keep IP** | Stejné IP, BGP anycast nebo stretch VLAN | Není třeba měnit konfiguraci aplikací | Stretched VLAN/L2 omezení |
|
||||
| **Change IP** | Nový IP rozsah, DNS/BGP routing změna | Čistá architektura | Změny konfigurací, DNS TTL |
|
||||
| **NAT překlad** | NAT mezi starým a novým IP spacem | Bez změny aplikací | Latence, komplexita troubleshooting |
|
||||
|
||||
**Keep IP** je možný jen:
|
||||
- L2 stretch mezi DC (VXLAN, OTV) — omezeno vzdáleností
|
||||
- BGP anycast pro VIP (load balancery)
|
||||
- Aplikace tolerující ARP cache změny
|
||||
|
||||
#### DNS cutover
|
||||
|
||||
```
|
||||
1. Snížit TTL na 60–300 s (týden předem)
|
||||
2. Při cutoveru změnit A/AAAA záznamy na nové IP
|
||||
3. Počkat na propagaci (dle TTL)
|
||||
4. Monitorovat traffic
|
||||
```
|
||||
|
||||
#### Traffic steering
|
||||
|
||||
| Technika | Use case |
|
||||
|----------|----------|
|
||||
| **BGP** | Změna AS path / local pref pro přesměrování trafficu |
|
||||
| **DNS** | Snížení TTL, change A records |
|
||||
| **Load balancer** | Změna pool members, health check |
|
||||
| **GSLB** | Global Server Load Balancing (F5 GTM, NSX ALB) |
|
||||
| **Cloud DNS** | AWS Route53, Azure Traffic Manager, Google Cloud DNS |
|
||||
|
||||
### 7. Databázová migrace
|
||||
|
||||
Viz detail v jednotlivých DB souborech. Tabulka shrnutí:
|
||||
|
||||
| DB | Metoda | RPO | RTO | Poznámka |
|
||||
|----|--------|-----|-----|----------|
|
||||
| **PostgreSQL** | Streaming replication + Patroni switchover | 0 (sync) / ~MB (async) | min | Patroni auto-failover |
|
||||
| **MySQL** | Group Replication / async replication | 0 (sync) / sekundy | min | InnoDB Cluster |
|
||||
| **Oracle** | Data Guard switchover | 0 (sync) | min | Far sync pro vzdálené DC |
|
||||
| **MSSQL** | AlwaysOn AG failover | 0 (sync) | min | Cloud witness |
|
||||
| **MongoDB** | Replica set election | sekundy | < 1 min | Priority-based failover |
|
||||
| **Cassandra** | Multi-DC replication | eventual | 0 | Nativní multi-master |
|
||||
|
||||
### 8. Testování
|
||||
|
||||
| Fáze | Co testovat | Metoda |
|
||||
|------|-------------|--------|
|
||||
| **Pre-migrace** | Aplikace v novém DC (izolovaně) | Dry run na replikovaných datech |
|
||||
| **Cutover** | Funkčnost, dostupnost, latence | Smoke test, synthetic transactions |
|
||||
| **Post-migrace** | Výkon, integrace, monitoring | A/B comparison s baseline, canary traffic |
|
||||
| **Rollback** | Návrat ke starému DC | Testovaný rollback plán |
|
||||
|
||||
### 9. Rollback plán
|
||||
|
||||
Každá vlna musí mít definovaný rollback:
|
||||
|
||||
| Podmínka | Akce |
|
||||
|----------|------|
|
||||
| Aplikace nestartuje v novém DC | Přepnutí DNS zpět, zastavení replikace |
|
||||
| Výkon horší než baseline (o > 20 %) | Rollback, analýza příčiny |
|
||||
| Integrační selhání (API timeout, DB connection) | Rollback, dependency check |
|
||||
| Bezpečnostní incident | Rollback, forenzní analýza |
|
||||
|
||||
Rollback by měl být otestován **před** reálným cutoverem.
|
||||
|
||||
---
|
||||
|
||||
## Speciální případy
|
||||
|
||||
### Mainframe migrace
|
||||
|
||||
- **IBM z/OS** — GDPS (Geographically Dispersed Parallel Sysplex)
|
||||
- HyperSwap pro storage mirroring
|
||||
- Cross-system coupling facility (XCF)
|
||||
- Často poslední migrovaná komponenta
|
||||
|
||||
### COTS aplikace (Oracle EBS, SAP)
|
||||
|
||||
- Vyžadují specifické migrační postupy výrobce
|
||||
- Oracle EBS: Autoconfig, cloning (ADXLC)
|
||||
- SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM
|
||||
- Licenční re-licensing při změně HW
|
||||
|
||||
### Cloud migrace (On-prem → Cloud)
|
||||
|
||||
Viz [CLOUD.md](CLOUD.md) — migrační strategie (6 Rs):
|
||||
|
||||
| Strategie | Popis |
|
||||
|-----------|-------|
|
||||
| **Re-host (Lift & Shift)** | VM → Cloud VM (AWS MGN, Azure Migrate) |
|
||||
| **Re-platform** | OS upgrade, managed DB (RDS, Cloud SQL) |
|
||||
| **Re-architect** | Aplikace přepsána na cloud-native |
|
||||
| **Retire** | Zastavení nepotřebných aplikací |
|
||||
| **Retain** | Aplikace zůstává on-prem (revize později) |
|
||||
| **Repurchase** | SaaS náhrada |
|
||||
|
||||
---
|
||||
|
||||
## Doporučený postup per velikost DC
|
||||
|
||||
| Velikost DC | Počet VM | Doporučená strategie | Doba trvání | Tým |
|
||||
|-------------|----------|---------------------|-------------|-----|
|
||||
| **Small** | < 50 | Big Bang (víkend) | 2–4 dny | 3–5 lidí |
|
||||
| **Medium** | 50–500 | Phased (5–10 wave) | 2–8 týdnů | 5–10 lidí |
|
||||
| **Large** | 500–5000 | Phased + Rolling | 3–12 měsíců | 10–30 lidí |
|
||||
| **Enterprise** | 5000+ | Parallel Run / Rolling | 12–36 měsíců | 30+ lidí |
|
||||
|
||||
---
|
||||
|
||||
## Související
|
||||
|
||||
- [DATACENTERS.md](DATACENTERS.md) — DC topologie, sekundární DC, deployment order
|
||||
- [CLOUD.md](CLOUD.md) — cloud migrační strategie (6 Rs)
|
||||
- [DR.md](DR.md) — disaster recovery, RTO/RPO
|
||||
- [NETWORKING.md](NETWORKING.md) — BGP, DNS, VXLAN, traffic steering
|
||||
- [STORAGE.md](STORAGE.md) — storage replikace
|
||||
|
||||
## Zdroje
|
||||
|
||||
Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
|
||||
|
||||
*Poslední revize: 2026-06-12*
|
||||
336
DR.en.md
Normal file
336
DR.en.md
Normal file
@@ -0,0 +1,336 @@
|
||||
# 🔄 Disaster Recovery and Business Continuity
|
||||
|
||||
## Terminology
|
||||
|
||||
| Abbreviation | Meaning | Description |
|
||||
|---------|--------|-------|
|
||||
| **RTO** | Recovery Time Objective | Maximum time from outage to service recovery |
|
||||
| **RPO** | Recovery Point Objective | Maximum acceptable data loss (time since last backup) |
|
||||
| **MTD** | Maximum Tolerable Downtime | Total outage duration an organization can survive |
|
||||
| **WRT** | Work Recovery Time | Time needed for full operations recovery after IT restoration |
|
||||
| **MTBF** | Mean Time Between Failures | Mean time between failures |
|
||||
| **MTTR** | Mean Time To Repair | Mean time to repair |
|
||||
| **SLA** | Service Level Agreement | Contractual availability commitment |
|
||||
| **SLO** | Service Level Objective | Internal availability target |
|
||||
| **SLI** | Service Level Indicator | Measured availability value |
|
||||
|
||||
### Relationship between RTO, RPO, MTD, WRT
|
||||
|
||||
```
|
||||
Outage ──── RPO ────► Data restored ──── RTO ────► Service running ──── WRT ────► Full operations
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
Lost data Time without service Time to full capacity
|
||||
|
||||
MTD = RTO + WRT (max. time the business tolerates)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Uptime calculation
|
||||
|
||||
### Nines table
|
||||
|
||||
| Level | Uptime | Downtime / year | Downtime / month | Downtime / week |
|
||||
|--------|--------|---------------|------------------|------------------|
|
||||
| 90 % (one nine) | 0.9 | 36.5 days | 72 h | 16.8 h |
|
||||
| 99 % (two nines) | 0.99 | 3.65 days | 7.2 h | 1.68 h |
|
||||
| 99.5 % | 0.995 | 1.83 days | 3.6 h | 50.4 min |
|
||||
| 99.9 % (three nines) | 0.999 | 8.76 h | 43.2 min | 10.1 min |
|
||||
| 99.95 % | 0.9995 | 4.38 h | 21.6 min | 5.04 min |
|
||||
| 99.99 % (four nines) | 0.9999 | 52.6 min | 4.32 min | 1.01 min |
|
||||
| 99.995 % | 0.99995 | 26.3 min | 2.16 min | 30.2 s |
|
||||
| 99.999 % (five nines) | 0.99999 | 5.26 min | 25.9 s | 6.05 s |
|
||||
| 99.9999 % (six nines) | 0.999999 | 31.6 s | 2.59 s | 0.605 s |
|
||||
|
||||
### Calculation
|
||||
|
||||
```
|
||||
Availability = (Total time - Downtime) / Total time × 100 %
|
||||
|
||||
Example:
|
||||
Year = 365 × 24 × 60 = 525,600 minutes
|
||||
Target: 99.9 % → allowed downtime = 525,600 × (1 - 0.999) = 525.6 minutes = 8.76 h
|
||||
|
||||
Combined availability (chain of dependencies):
|
||||
A_web = 99.9 % (3 nines)
|
||||
A_api = 99.99 % (4 nines)
|
||||
A_db = 99.999 % (5 nines)
|
||||
|
||||
A_total = 0.999 × 0.9999 × 0.99999 = 0.99889 ≈ 99.89 % (less than 3 nines!)
|
||||
|
||||
Parallel availability (redundancy):
|
||||
A_total = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)
|
||||
|
||||
Example: 2 servers with 99% availability
|
||||
A_total = 1 - (1-0.99) × (1-0.99) = 1 - 0.01 × 0.01 = 0.9999 (99.99 %)
|
||||
```
|
||||
|
||||
### Calculator
|
||||
|
||||
```python
|
||||
def uptime_percent_to_downtime(pct, period_days=365):
|
||||
"""Convert uptime percentage to downtime in given period."""
|
||||
total_minutes = period_days * 24 * 60
|
||||
allowed_downtime = total_minutes * (1 - pct / 100)
|
||||
return allowed_downtime # minutes
|
||||
|
||||
def downtime_to_uptime_percent(downtime_minutes, period_days=365):
|
||||
"""Convert downtime in minutes to uptime percentage."""
|
||||
total_minutes = period_days * 24 * 60
|
||||
return (1 - downtime_minutes / total_minutes) * 100
|
||||
|
||||
def combined_availability(availabilities):
|
||||
"""Combined availability (series-connected components)."""
|
||||
result = 1.0
|
||||
for a in availabilities:
|
||||
result *= a
|
||||
return result
|
||||
|
||||
def redundant_availability(availabilities):
|
||||
"""Redundant availability (parallel components)."""
|
||||
result = 1.0
|
||||
for a in availabilities:
|
||||
result *= (1 - a)
|
||||
return 1 - result
|
||||
```
|
||||
|
||||
### Calculation fallacies
|
||||
|
||||
- **Combined availability is not a sum** — adding another dependency always reduces total availability
|
||||
- **Redundancy is not free** — adding a standby component requires failure detection + failover (MTTR does not improve automatically)
|
||||
- **SLA is not a guarantee** — providers often calculate SLA as a monthly average, not per-incident
|
||||
- **Measurement is key** — without SLI, SLO cannot be verified; "unmeasured availability does not exist"
|
||||
- **Planned maintenance** — sometimes counted as uptime, sometimes not (depends on SLA definition)
|
||||
|
||||
---
|
||||
|
||||
## DR scenarios
|
||||
|
||||
### Classification
|
||||
|
||||
| Category | Scenario | Typical RTO | Typical RPO | Frequency |
|
||||
|-----------|--------|-------------|-------------|-----------|
|
||||
| **Site** | Entire DC / region outage | hours | minutes | Low |
|
||||
| **Infrastructure** | HW failure (storage, switch, server) | minutes–hours | seconds | Medium |
|
||||
| **Software** | OS, application, DB failure | minutes | seconds | High |
|
||||
| **Data** | Data corruption, deletion, cryptolocker | hours | backup point | Low–medium |
|
||||
| **Human** | Wrong deployment, config change | minutes–hours | seconds | Medium |
|
||||
| **Security** | Attack, breach, ransomware | days | before attack | Low |
|
||||
| **Network** | Connectivity outage, DDoS | minutes–hours | N/A | Medium |
|
||||
| **Cloud provider** | Regional outage (AWS, Azure, GCP) | hours | minutes | Very low |
|
||||
|
||||
### Scenario details
|
||||
|
||||
#### Site / Region failure
|
||||
|
||||
| Aspect | Description |
|
||||
|--------|-------|
|
||||
| **Cause** | Blackout, fire, flood, earthquake, cloud provider outage |
|
||||
| **Prevention** | Multi-AZ architecture, multi-region deployment, active-active |
|
||||
| **Mitigation** | Automatic DNS failover (Route53, Azure Traffic Manager), replica in DR region |
|
||||
| **Testing** | Game day: shut down primary region, verify automatic failover |
|
||||
|
||||
#### Data corruption / human error
|
||||
|
||||
| Aspect | Description |
|
||||
|--------|-------|
|
||||
| **Cause** | Wrong SQL command (DELETE without WHERE), accidentally deleted bucket, bad migration |
|
||||
| **Prevention** | RBAC, MFA for destructive operations, change management, SQL peer review |
|
||||
| **Mitigation** | Point-in-time recovery (PITR), transaction log replay, immutable backups |
|
||||
| **Testing** | Restore backup to isolated environment, verify data integrity |
|
||||
|
||||
#### Ransomware / cyber attack
|
||||
|
||||
| Aspect | Description |
|
||||
|--------|-------|
|
||||
| **Cause** | Attack on production systems, data encryption, exfiltration |
|
||||
| **Prevention** | Immutable backups (object lock), air-gapped backups, network segmentation |
|
||||
| **Mitigation** | Restore from clean backup, rebuild infrastructure from IaC |
|
||||
| **Testing** | Regular restore in isolated network, verify backup is not infected |
|
||||
|
||||
---
|
||||
|
||||
## Prevention — strategies
|
||||
|
||||
### Backup strategies
|
||||
|
||||
| Approach | Description | Use case |
|
||||
|---------|-------|----------|
|
||||
| **3-2-1 rule** | 3 copies, 2 different media, 1 off-site | Universal |
|
||||
| **3-2-1-0** | + 0 errors after restore (testing) | Enterprise, compliance |
|
||||
| **GFS (Grandfather-Father-Son)** | Daily, weekly, monthly rotation | Long-term archive |
|
||||
| **Incremental forever** | Full backup 1×, then only changes | Large data volumes |
|
||||
| **Reverse incremental** | Full + incremental, full is always current | Fast recovery |
|
||||
|
||||
### Backup methods
|
||||
|
||||
| Method | RPO | RTO | Storage | Suitable for |
|
||||
|--------|-----|-----|----------|------------|
|
||||
| **Full backup** | Last full | Full restore time | Large | Small data, weekly |
|
||||
| **Incremental** | Last incremental | Full + all incrementals | Small | Large data, daily |
|
||||
| **Differential** | Last diff | Full + last diff | Medium | Compromise |
|
||||
| **Snapshot** | Snapshot point-in-time | seconds | Copy-on-write | VM, storage array |
|
||||
| **Continuous (CDC)** | < 1 s | Seconds | Log stream | DB (binlog, WAL) |
|
||||
| **PITR** | Any point in time | Depends on volume | Full + WAL | RDS, PostgreSQL, SQL Server |
|
||||
|
||||
### Backup immutability
|
||||
|
||||
Key protection against ransomware:
|
||||
|
||||
| Technique | Description |
|
||||
|----------|-------|
|
||||
| **Object Lock (WORM)** | Backup cannot be deleted or overwritten for a defined retention period (S3 Object Lock, Azure Blob Immutable) |
|
||||
| **Air gap** | Backup is physically separated from the production network (offline disk, tape, cloud without VPN) |
|
||||
| **Isolated backup network** | Backup traffic goes through a dedicated network without access from production VLAN |
|
||||
| **Out-of-band access** | Backup management console is not accessible from the production network |
|
||||
|
||||
---
|
||||
|
||||
## DR architectures
|
||||
|
||||
### Multi-AZ (Single region)
|
||||
|
||||
```
|
||||
Region ┌────────────────────────────────────┐
|
||||
│ AZ-1 AZ-2 │
|
||||
│ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ App │ │ App │ │
|
||||
│ └─────┬────┘ └─────┬────┘ │
|
||||
│ │ │ │
|
||||
│ ┌─────▼────────────────▼─────┐ │
|
||||
│ │ Load Balancer (cross-AZ) │ │
|
||||
│ └─────────────┬──────────────┘ │
|
||||
│ │ │
|
||||
│ ┌─────────────▼──────────────┐ │
|
||||
│ │ DB Primary (AZ-1) │ │
|
||||
│ │ DB Standby (AZ-2) │ │
|
||||
│ │ Synchronous replication │ │
|
||||
│ └────────────────────────────┘ │
|
||||
└────────────────────────────────────┘
|
||||
```
|
||||
|
||||
- RTO: minutes (automatic failover)
|
||||
- RPO: 0 (sync replication)
|
||||
- Protection: against AZ failure, not region failure
|
||||
|
||||
### Multi-Region
|
||||
|
||||
```
|
||||
Region A (Primary) Region B (DR)
|
||||
┌─────────────────────┐ ┌─────────────────────┐
|
||||
│ ┌───────────────┐ │ │ ┌───────────────┐ │
|
||||
│ │ App + DB │ │ │ │ App + DB │ │
|
||||
│ │ Active │──┼──Async───────┼─►│ Standby │ │
|
||||
│ └───────────────┘ │ replication │ └───────────────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ ┌──────▼───────┐ │ │ ┌──────▼───────┐ │
|
||||
│ │ DNS / GSLB │ │ │ │ DNS / GSLB │ │
|
||||
│ └──────┬───────┘ │ │ └──────┬───────┘ │
|
||||
└─────────┼──────────┘ └─────────┼──────────┘
|
||||
│ │
|
||||
└──────────── Traffic Manager ───────┘
|
||||
```
|
||||
|
||||
| Variant | RTO | RPO | Cost | Failover |
|
||||
|----------|-----|-----|---------|----------|
|
||||
| **Active-Passive** | minutes–hours | seconds | Medium | Manual / auto |
|
||||
| **Active-Active** | seconds | < 1 s | High | Automatic (DNS) |
|
||||
| **Pilot Light** | tens of minutes | minutes | Low | Manual scaling |
|
||||
| **Warm Standby** | minutes | seconds | High | Auto (reduced copy) |
|
||||
| **Backup & Restore** | hours | 24 h | Low | Manual |
|
||||
|
||||
### On-prem → Cloud DR (Hybrid)
|
||||
|
||||
```
|
||||
On-prem DC Cloud (DR)
|
||||
┌─────────────────────┐ ┌─────────────────────┐
|
||||
│ ┌───────────────┐ │ │ ┌───────────────┐ │
|
||||
│ │ Application │ │ │ │ VM / App │ │
|
||||
│ │ + DB │ │ │ │ + DB replica │ │
|
||||
│ └───────┬───────┘ │ │ └───────┬───────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ ┌───────▼───────┐ │ site-to-site│ ┌───────▼───────┐ │
|
||||
│ │ Backup proxy │──┼────VPN───────┼─►│ Backup store │ │
|
||||
│ └───────────────┘ │ │ └───────────────┘ │
|
||||
│ │ │ │
|
||||
│ ┌───────────────┐ │ │ ┌───────────────┐ │
|
||||
│ │ Tape / NAS │ │ │ │ Veeam / Zerto│ │
|
||||
│ └───────────────┘ │ │ └───────────────┘ │
|
||||
└─────────────────────┘ └─────────────────────┘
|
||||
```
|
||||
|
||||
- **RTO**: tens of minutes (depends on VM startup)
|
||||
- **RPO**: minutes–hours (depends on replication tool)
|
||||
- **Tools**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
|
||||
- **Use case**: enterprise with on-prem DC that needs DR without a second DC
|
||||
|
||||
---
|
||||
|
||||
## DR testing
|
||||
|
||||
### Test types
|
||||
|
||||
| Type | Description | Frequency | Risk |
|
||||
|-----|-------|-----------|--------|
|
||||
| **Tabletop exercise** | Manual scenario walkthrough, no impact on production | Monthly | None |
|
||||
| **Walkthrough** | Runbook verification, ensure everyone knows what to do | Quarterly | None |
|
||||
| **Component test** | Test of a single component (e.g., restore one DB) | Monthly | Low |
|
||||
| **Integrated test** | Test of the entire stack in isolated environment | Quarterly | Low |
|
||||
| **Full failover test** | Production failover to DR site | Annually | High |
|
||||
| **Chaos experiment** | Targeted fault injection into production | Continuous | Medium |
|
||||
|
||||
### Runbook structure
|
||||
|
||||
Each DR scenario should have a runbook:
|
||||
|
||||
```yaml
|
||||
scenario: "Region A failure"
|
||||
triggers:
|
||||
- "CloudWatch alarm: Region A health check 5× timeout"
|
||||
- "PagerDuty incident P0"
|
||||
decision_tree: |
|
||||
1. Verify: is Region A really unavailable? (check from 3 different locations)
|
||||
2. Decide: is RTO at risk? If < 30 % RTO remaining → failover
|
||||
3. Failover: run playbook `dr-failover-region-b`
|
||||
4. Verification: smoke tests in Region B
|
||||
5. Communication: status page + stakeholders
|
||||
rollback: |
|
||||
1. After Region A recovery → replicate changes from B back to A
|
||||
2. Repoint DNS to A
|
||||
3. Verify data consistency
|
||||
4. Shut down Region B (or keep as hot standby)
|
||||
contacts:
|
||||
primary: "on-call@example.com"
|
||||
escalation: "infra-lead@example.com"
|
||||
management: "vp-engineering@example.com"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best practices
|
||||
|
||||
- **Test recovery, not backup** — a backup without tested recovery is not a backup
|
||||
- **Automate DR** — Terraform / Ansible for DR environment spin-up, DNS failover
|
||||
- **Document runbooks** — every scenario, contact, decision tree
|
||||
- **Expect failure** — design for failure, don't expect everything to work
|
||||
- **Don't underestimate WRT** — service recovery does not mean full operations (data warming, cache, connections)
|
||||
- **Align RTO/RPO with business** — technical capabilities must match business requirements
|
||||
- **Monitor SLI** — without data, SLO cannot be verified
|
||||
- **DR is not just IT** — communication, PR, legal, compliance
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- [CLOUD.md](CLOUD.md) — cloud DR strategy, AWS/Azure/GCP specific
|
||||
- [DATACENTERS.md](DATACENTERS.md) — DC redundancy, Tier classification
|
||||
- [MONITORING.md](MONITORING.md) — alerting, SLI/SLO/SLA
|
||||
- [CICD.md](CICD.md) — deployment strategy, rollback
|
||||
- [STORAGE.md](STORAGE.md) — backup storage, replication
|
||||
|
||||
## Sources
|
||||
|
||||
Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
|
||||
|
||||
*Last revised: 2026-06-11*
|
||||
336
DR.md
Normal file
336
DR.md
Normal file
@@ -0,0 +1,336 @@
|
||||
# 🔄 Disaster Recovery a Business Continuity
|
||||
|
||||
## Terminologie
|
||||
|
||||
| Zkratka | Význam | Popis |
|
||||
|---------|--------|-------|
|
||||
| **RTO** | Recovery Time Objective | Maximální doba od výpadku do obnovení služby |
|
||||
| **RPO** | Recovery Point Objective | Maximální přípustná ztráta dat (čas od poslední zálohy) |
|
||||
| **MTD** | Maximum Tolerable Downtime | Celková doba výpadku, kterou organizace přežije |
|
||||
| **WRT** | Work Recovery Time | Čas potřebný k plnému obnovení provozu po obnovení IT |
|
||||
| **MTBF** | Mean Time Between Failures | Střední doba mezi poruchami |
|
||||
| **MTTR** | Mean Time To Repair | Střední doba opravy |
|
||||
| **SLA** | Service Level Agreement | Smluvní závazek dostupnosti |
|
||||
| **SLO** | Service Level Objective | Interní cíl dostupnosti |
|
||||
| **SLI** | Service Level Indicator | Naměřená hodnota dostupnosti |
|
||||
|
||||
### Vztah RTO, RPO, MTD, WRT
|
||||
|
||||
```
|
||||
Výpadek ──── RPO ────► Obnova dat ──── RTO ────► Služba běží ──── WRT ────► Plný provoz
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
Ztracená data Čas bez služby Čas do plného výkonu
|
||||
|
||||
MTD = RTO + WRT (max. doba, kterou firma toleruje)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Výpočet uptimu
|
||||
|
||||
### Tabulka devítek
|
||||
|
||||
| Úroveň | Uptime | Downtime / rok | Downtime / měsíc | Downtime / týden |
|
||||
|--------|--------|---------------|------------------|------------------|
|
||||
| 90 % (jedna devítka) | 0.9 | 36,5 dne | 72 h | 16,8 h |
|
||||
| 99 % (dvě devítky) | 0.99 | 3,65 dne | 7,2 h | 1,68 h |
|
||||
| 99,5 % | 0.995 | 1,83 dne | 3,6 h | 50,4 min |
|
||||
| 99,9 % (tři devítky) | 0.999 | 8,76 h | 43,2 min | 10,1 min |
|
||||
| 99,95 % | 0.9995 | 4,38 h | 21,6 min | 5,04 min |
|
||||
| 99,99 % (čtyři devítky) | 0.9999 | 52,6 min | 4,32 min | 1,01 min |
|
||||
| 99,995 % | 0.99995 | 26,3 min | 2,16 min | 30,2 s |
|
||||
| 99,999 % (pět devítek) | 0.99999 | 5,26 min | 25,9 s | 6,05 s |
|
||||
| 99,9999 % (šest devítek) | 0.999999 | 31,6 s | 2,59 s | 0,605 s |
|
||||
|
||||
### Výpočet
|
||||
|
||||
```
|
||||
Dostupnost = (Celkový čas - Downtime) / Celkový čas × 100 %
|
||||
|
||||
Příklad:
|
||||
Rok = 365 × 24 × 60 = 525 600 minut
|
||||
Cíl: 99,9 % → povolený downtime = 525 600 × (1 - 0,999) = 525,6 minut = 8,76 h
|
||||
|
||||
Složená dostupnost (řetězec závislostí):
|
||||
A_web = 99,9 % (3 devítky)
|
||||
A_api = 99,99 % (4 devítky)
|
||||
A_db = 99,999 % (5 devítek)
|
||||
|
||||
A_celkem = 0,999 × 0,9999 × 0,99999 = 0,99889 ≈ 99,89 % (méně než 3 devítky!)
|
||||
|
||||
Paralelní dostupnost (redundance):
|
||||
A_celkem = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)
|
||||
|
||||
Příklad: 2 servery s 99% dostupností
|
||||
A_celkem = 1 - (1-0,99) × (1-0,99) = 1 - 0,01 × 0,01 = 0,9999 (99,99 %)
|
||||
```
|
||||
|
||||
### Kalkulačka
|
||||
|
||||
```python
|
||||
def uptime_percent_to_downtime(pct, period_days=365):
|
||||
"""Převede procento uptimu na downtime v daném období."""
|
||||
total_minutes = period_days * 24 * 60
|
||||
allowed_downtime = total_minutes * (1 - pct / 100)
|
||||
return allowed_downtime # minutes
|
||||
|
||||
def downtime_to_uptime_percent(downtime_minutes, period_days=365):
|
||||
"""Převede downtime v minutách na procento uptimu."""
|
||||
total_minutes = period_days * 24 * 60
|
||||
return (1 - downtime_minutes / total_minutes) * 100
|
||||
|
||||
def combined_availability(availabilities):
|
||||
"""Složená dostupnost (sériově zapojené komponenty)."""
|
||||
result = 1.0
|
||||
for a in availabilities:
|
||||
result *= a
|
||||
return result
|
||||
|
||||
def redundant_availability(availabilities):
|
||||
"""Paralelní dostupnost (redundantní komponenty)."""
|
||||
result = 1.0
|
||||
for a in availabilities:
|
||||
result *= (1 - a)
|
||||
return 1 - result
|
||||
```
|
||||
|
||||
### Fallacies výpočtu
|
||||
|
||||
- **Složená dostupnost není součet** — přidání další závislosti vždy snižuje celkovou dostupnost
|
||||
- **Redundance není zadarmo** — přidání standby komponenty vyžaduje detekci selhání + failover (MTTR se nezlepší automaticky)
|
||||
- **SLA není garance** — poskytovatelé často počítají SLA jako měsíční průměr, ne per-incident
|
||||
- **Měření je klíčové** — bez SLI nelze ověřit SLO; "nedoměřená dostupnost neexistuje"
|
||||
- **Plánovaná odstávka** — někdy se počítá do uptimu, někdy ne (záleží na definici SLA)
|
||||
|
||||
---
|
||||
|
||||
## DR scénáře
|
||||
|
||||
### Klasifikace
|
||||
|
||||
| Kategorie | Scénář | Typický RTO | Typické RPO | Frekvence |
|
||||
|-----------|--------|-------------|-------------|-----------|
|
||||
| **Site** | Výpadek celého DC / regionu | hodiny | minuty | Nízká |
|
||||
| **Infrastructure** | Selhání HW (storage, switch, server) | minuty–hodiny | sekundy | Střední |
|
||||
| **Software** | Selhání OS, aplikace, DB | minuty | vteřiny | Vysoká |
|
||||
| **Data** | Poškození dat, delete, cryptolocker | hodiny | okamžik zálohy | Nízká–střední |
|
||||
| **Human** | Chybný deployment, config change | minuty–hodiny | vteřiny | Střední |
|
||||
| **Security** | Útok, breach, ransomware | dny | před útokem | Nízká |
|
||||
| **Network** | Výpadek konektivity, DDoS | minuty–hodiny | N/A | Střední |
|
||||
| **Cloud provider** | Regionální výpadek (AWS, Azure, GCP) | hodiny | minuty | Velmi nízká |
|
||||
|
||||
### Detail scénářů
|
||||
|
||||
#### Site / Region failure
|
||||
|
||||
| Aspekt | Popis |
|
||||
|--------|-------|
|
||||
| **Příčina** | Blackout, požár, povodeň, zemětřesení, výpadek cloud providera |
|
||||
| **Prevence** | Multi-AZ architektura, multi-region deployment, active-active |
|
||||
| **Mitigace** | Automatický DNS failover (Route53, Azure Traffic Manager), replica v DR regionu |
|
||||
| **Testování** | Game day: vypnout primární region, ověřit automatický failover |
|
||||
|
||||
#### Data corruption / human error
|
||||
|
||||
| Aspekt | Popis |
|
||||
|--------|-------|
|
||||
| **Příčina** | Chybný SQL příkaz (DELETE bez WHERE), omylem smazaný bucket, chybná migrace |
|
||||
| **Prevence** | RBAC, MFA pro destructive operace, change management, peer review SQL |
|
||||
| **Mitigace** | Point-in-time recovery (PITR), transaction log replay, immutable backups |
|
||||
| **Testování** | Obnova zálohy do izolovaného prostředí, ověření integrity dat |
|
||||
|
||||
#### Ransomware / cyber attack
|
||||
|
||||
| Aspekt | Popis |
|
||||
|--------|-------|
|
||||
| **Příčina** | Útok na produkční systémy, zašifrování dat, exfiltrace |
|
||||
| **Prevence** | Immutable backups (object lock), air-gapped backups, network segmentation |
|
||||
| **Mitigace** | Obnova z čisté zálohy, re-build infrastructure from IaC |
|
||||
| **Testování** | Pravidelná obnova v izolované síti, ověření že backup není infikován |
|
||||
|
||||
---
|
||||
|
||||
## Prevence — strategie
|
||||
|
||||
### Backup strategie
|
||||
|
||||
| Aproach | Popis | Use case |
|
||||
|---------|-------|----------|
|
||||
| **3-2-1 pravidlo** | 3 kopie, 2 různá média, 1 off-site | Univerzální |
|
||||
| **3-2-1-0** | + 0 chyb po obnově (testování) | Enterprise, compliance |
|
||||
| **GFS (Grandfather-Father-Son)** | Denní, týdenní, měsíční rotace | Dlouhodobý archiv |
|
||||
| **Incremental forever** | Plná záloha 1×, pak jen změny | Velké objemy dat |
|
||||
| **Reverse incremental** | Plná + inkrementální, plná je vždy aktuální | Rychlá obnova |
|
||||
|
||||
### Zálohovací metody
|
||||
|
||||
| Metoda | RPO | RTO | Úložiště | Vhodné pro |
|
||||
|--------|-----|-----|----------|------------|
|
||||
| **Full backup** | Poslední full | Doba obnovy full | Velké | Malá data, weekly |
|
||||
| **Incremental** | Poslední inkrement | Full + všechny inkrementy | Malé | Velká data, daily |
|
||||
| **Differential** | Poslední diff | Full + poslední diff | Střední | Kompromis |
|
||||
| **Snapshot** | Okamžik snapshotu | vteřiny | Copy-on-write | VM, storage array |
|
||||
| **Continuous (CDC)** | < 1 s | Sekundy | Log stream | DB (binlog, WAL) |
|
||||
| **PITR** | Libovolný bod v čase | Dle objemu | Full + WAL | RDS, PostgreSQL, SQL Server |
|
||||
|
||||
### Imunabilita backupů
|
||||
|
||||
Klíčová ochrana proti ransomwaru:
|
||||
|
||||
| Technika | Popis |
|
||||
|----------|-------|
|
||||
| **Object Lock (WORM)** | Backup nelze smazat ani přepsat po defined retention period (S3 Object Lock, Azure Blob Immutable) |
|
||||
| **Air gap** | Backup je fyzicky oddělený od produkční sítě (offline disk, tape, cloud bez VPN) |
|
||||
| **Isolated backup network** | Backup traffic jde přes dedikovanou síť bez přístupu z produkční VLAN |
|
||||
| **Out-of-band access** | Backup management console není dostupná z produkční sítě |
|
||||
|
||||
---
|
||||
|
||||
## DR architektury
|
||||
|
||||
### Multi-AZ (Single region)
|
||||
|
||||
```
|
||||
Region ┌────────────────────────────────────┐
|
||||
│ AZ-1 AZ-2 │
|
||||
│ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ App │ │ App │ │
|
||||
│ └─────┬────┘ └─────┬────┘ │
|
||||
│ │ │ │
|
||||
│ ┌─────▼────────────────▼─────┐ │
|
||||
│ │ Load Balancer (cross-AZ) │ │
|
||||
│ └─────────────┬──────────────┘ │
|
||||
│ │ │
|
||||
│ ┌─────────────▼──────────────┐ │
|
||||
│ │ DB Primary (AZ-1) │ │
|
||||
│ │ DB Standby (AZ-2) │ │
|
||||
│ │ Synchronous replication │ │
|
||||
│ └────────────────────────────┘ │
|
||||
└────────────────────────────────────┘
|
||||
```
|
||||
|
||||
- RTO: minuty (automatický failover)
|
||||
- RPO: 0 (sync replication)
|
||||
- Ochrana: proti selhání AZ, nikoliv regionu
|
||||
|
||||
### Multi-Region
|
||||
|
||||
```
|
||||
Region A (Primary) Region B (DR)
|
||||
┌─────────────────────┐ ┌─────────────────────┐
|
||||
│ ┌───────────────┐ │ │ ┌───────────────┐ │
|
||||
│ │ App + DB │ │ │ │ App + DB │ │
|
||||
│ │ Active │──┼──Async───────┼─►│ Standby │ │
|
||||
│ └───────────────┘ │ replikace │ └───────────────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ ┌──────▼───────┐ │ │ ┌──────▼───────┐ │
|
||||
│ │ DNS / GSLB │ │ │ │ DNS / GSLB │ │
|
||||
│ └──────┬───────┘ │ │ └──────┬───────┘ │
|
||||
└─────────┼──────────┘ └─────────┼──────────┘
|
||||
│ │
|
||||
└──────────── Traffic Manager ───────┘
|
||||
```
|
||||
|
||||
| Varianta | RTO | RPO | Náklady | Failover |
|
||||
|----------|-----|-----|---------|----------|
|
||||
| **Active-Passive** | minuty–hodiny | sekundy | Střední | Manuální / auto |
|
||||
| **Active-Active** | sekundy | < 1 s | Vysoké | Automatický (DNS) |
|
||||
| **Pilot Light** | desítky minut | minuty | Nízké | Manuální škálování |
|
||||
| **Warm Standby** | minuty | sekundy | Vysoké | Auto (zmenšená kopie) |
|
||||
| **Backup & Restore** | hodiny | 24 h | Nízké | Manuální |
|
||||
|
||||
### On-prem → Cloud DR (Hybrid)
|
||||
|
||||
```
|
||||
On-prem DC Cloud (DR)
|
||||
┌─────────────────────┐ ┌─────────────────────┐
|
||||
│ ┌───────────────┐ │ │ ┌───────────────┐ │
|
||||
│ │ Aplikace │ │ │ │ VM / Aplikace│ │
|
||||
│ │ + DB │ │ │ │ + DB replica │ │
|
||||
│ └───────┬───────┘ │ │ └───────┬───────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ ┌───────▼───────┐ │ site-to-site│ ┌───────▼───────┐ │
|
||||
│ │ Backup proxy │──┼────VPN───────┼─►│ Backup store │ │
|
||||
│ └───────────────┘ │ │ └───────────────┘ │
|
||||
│ │ │ │
|
||||
│ ┌───────────────┐ │ │ ┌───────────────┐ │
|
||||
│ │ Tape / NAS │ │ │ │ Veeam / Zerto│ │
|
||||
│ └───────────────┘ │ │ └───────────────┘ │
|
||||
└─────────────────────┘ └─────────────────────┘
|
||||
```
|
||||
|
||||
- **RTO**: desítky minut (závisí na startup VM)
|
||||
- **RPO**: minuty–hodiny (závisí na replikačním nástroji)
|
||||
- **Nástroje**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
|
||||
- **Use case**: enterprise s on-prem DC, které potřebuje DR bez druhého DC
|
||||
|
||||
---
|
||||
|
||||
## DR testování
|
||||
|
||||
### Typy testů
|
||||
|
||||
| Typ | Popis | Frekvence | Riziko |
|
||||
|-----|-------|-----------|--------|
|
||||
| **Tabletop exercise** | Manuální procházení scénáře, žádný dopad na produkci | Měsíčně | Žádné |
|
||||
| **Walkthrough** | Verifikace runbooku, kontrola že všichni ví co dělat | Kvartálně | Žádné |
|
||||
| **Component test** | Test jedné komponenty (např. obnova jedné DB) | Měsíčně | Nízké |
|
||||
| **Integrated test** | Test celého stacku v izolovaném prostředí | Kvartálně | Nízké |
|
||||
| **Full failover test** | Produkční failover do DR site | Ročně | Vysoké |
|
||||
| **Chaos experiment** | Cílené vnášení poruch do produkce | Průběžně | Střední |
|
||||
|
||||
### Runbook struktura
|
||||
|
||||
Každý DR scénář by měl mít runbook:
|
||||
|
||||
```yaml
|
||||
scenario: "Region A failure"
|
||||
triggers:
|
||||
- "CloudWatch alarm: Region A health check 5× timeout"
|
||||
- "PagerDuty incident P0"
|
||||
decision_tree: |
|
||||
1. Ověřit: je Region A opravdu nedostupný? (check z 3 různých lokací)
|
||||
2. Rozhodnout: je RTO v ohrožení? Pokud zbývá < 30 % RTO → failover
|
||||
3. Failover: spustit playbook `dr-failover-region-b`
|
||||
4. Verifikace: smoke testy v Region B
|
||||
5. Komunikace: status page + stakeholders
|
||||
rollback: |
|
||||
1. Po obnovení Region A → replikace změn z B zpět do A
|
||||
2. Repoint DNS na A
|
||||
3. Ověřit konzistenci dat
|
||||
4. Vypnout Region B (nebo ponechat jako hot standby)
|
||||
contacts:
|
||||
primary: "on-call@example.com"
|
||||
escalation: "infra-lead@example.com"
|
||||
management: "vp-engineering@example.com"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best practices
|
||||
|
||||
- **Testuj obnovu, ne zálohu** — backup bez testované obnovy není backup
|
||||
- **Automatizuj DR** — Terraform / Ansible pro spin-up DR prostředí, DNS failover
|
||||
- **Dokumentuj runbooky** — každý scénář, kontakt, rozhodovací strom
|
||||
- **Počítej se selháním** — design for failure, nečekej že všechno poběží
|
||||
- **Nepodceňuj WRT** — obnova služby neznamená plný provoz (data warming, cache, connections)
|
||||
- **Slaď RTO/RPO s businessem** — technické možnosti musí odpovídat obchodním požadavkům
|
||||
- **Monitoruj SLI** — bez dat nelze ověřit SLO
|
||||
- **DR není jen IT** — komunikace, PR, právní, regulace
|
||||
|
||||
---
|
||||
|
||||
## Související
|
||||
|
||||
- [CLOUD.md](CLOUD.md) — cloud DR strategie, AWS/Azure/GCP specific
|
||||
- [DATACENTERS.md](DATACENTERS.md) — DC redundance, Tier klasifikace
|
||||
- [MONITORING.md](MONITORING.md) — alerting, SLI/SLO/SLA
|
||||
- [CICD.md](CICD.md) — deployment strategie, rollback
|
||||
- [STORAGE.md](STORAGE.md) — backup storage, replication
|
||||
|
||||
## Zdroje
|
||||
|
||||
Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
|
||||
|
||||
*Poslední revize: 2026-06-11*
|
||||
275
MESSAGING.en.md
Normal file
275
MESSAGING.en.md
Normal file
@@ -0,0 +1,275 @@
|
||||
# 📨 Messaging and streaming platforms
|
||||
|
||||
## Platform overview
|
||||
|
||||
| Platform | Type | Language | Protocol | Persistence | Use case |
|
||||
|-----------|-----|-------|----------|-------------|----------|
|
||||
| **Apache Kafka** | Distributed event store | Java/Scala | Binary (TCP) | Disk (log) | Event streaming, data pipeline, log aggregation |
|
||||
| **RabbitMQ** | Message broker | Erlang | AMQP 0-9-1, MQTT, STOMP | Disk / RAM | Application messaging, task queue, RPC |
|
||||
| **Apache Pulsar** | Distributed messaging + streaming | Java | Binary (TCP) + REST | Disk (segmented log) | Streaming + queue in one, multi-tenant |
|
||||
| **NATS** | Lightweight messaging | Go | NATS protocol (TCP) | Memory / JetStream (disk) | Microservices, IoT, edge, low-latency |
|
||||
| **AWS SQS** | Managed queue | — | HTTPS | Managed | Decoupling services, serverless |
|
||||
| **AWS SNS** | Managed pub/sub | — | HTTPS, SQS, Lambda, email | Managed | Push notifications, fanout |
|
||||
| **Azure Service Bus** | Managed messaging | — | AMQP, HTTPS | Managed | Enterprise messaging, sessions, transactions |
|
||||
| **Google Pub/Sub** | Managed streaming | — | gRPC, REST | Managed | Event-driven, data pipeline |
|
||||
| **Red Hat AMQ 7** (Artemis) | Message broker | Java | AMQP, MQTT, STOMP, OpenWire | Disk | Enterprise, JMS, high-availability |
|
||||
| **Oracle Service Bus (OSB)** | Enterprise ESB | Java | HTTP/S, JMS, SOAP, REST, MQ, FTP, AQ | Managed (WebLogic) | Enterprise integration, SOA, protocol mediation, routing |
|
||||
|
||||
---
|
||||
|
||||
## Platform details
|
||||
|
||||
### Apache Kafka
|
||||
|
||||
**Architecture:**
|
||||
|
||||
```
|
||||
Producer ──► Topic ──► Partition ──► Consumer Group
|
||||
│
|
||||
├── Partition 0 (Leader) ──► Broker 1
|
||||
├── Partition 1 (Follower) ──► Broker 2
|
||||
└── Partition 2 (Follower) ──► Broker 3
|
||||
```
|
||||
|
||||
| Concept | Description |
|
||||
|---------|-------|
|
||||
| **Topic** | Logical message category |
|
||||
| **Partition** | Append-only log, ordered sequence of messages |
|
||||
| **Broker** | Server in Kafka cluster |
|
||||
| **Producer** | Publishes messages to topic |
|
||||
| **Consumer** | Reads messages from partition (within consumer group) |
|
||||
| **Consumer Group** | Group of consumers sharing topic reading |
|
||||
| **Offset** | Position in partition (tracked by consumer) |
|
||||
| **KRaft** | Controller quorum (replaces Zookeeper from Kafka 3.x) |
|
||||
|
||||
**Replication and HA:**
|
||||
|
||||
| Parameter | Value |
|
||||
|----------|---------|
|
||||
| Replication factor | 2–3 (typically 3 for production) |
|
||||
| ISR (In-Sync Replicas) | Number of replicas keeping up with leader |
|
||||
| Min ISR | Minimum ISR for acknowledging writes (acks=all) |
|
||||
| acks=0 | Fire-and-forget (fastest, possible data loss) |
|
||||
| acks=1 | Write acknowledged by leader (compromise) |
|
||||
| acks=all | Write acknowledged by all ISR (safest) |
|
||||
| Leader failover | Automatic election of new leader from ISR |
|
||||
|
||||
**Important configuration:**
|
||||
|
||||
```properties
|
||||
# Production
|
||||
replication.factor=3
|
||||
min.insync.replicas=2
|
||||
default.replication.factor=3
|
||||
|
||||
# Retention
|
||||
log.retention.hours=168 # 7 days
|
||||
log.retention.bytes=-1 # unlimited (or limit)
|
||||
log.segment.bytes=1073741824 # 1 GB per segment
|
||||
|
||||
# Performance
|
||||
num.partitions=3 # adjust per need (scale-out)
|
||||
compression.type=snappy # (snappy, gzip, lz4, zstd)
|
||||
```
|
||||
|
||||
**Partitioning strategies:**
|
||||
|
||||
| Strategy | Key | Advantage | Disadvantage |
|
||||
|----------|------|--------|----------|
|
||||
| Round-robin | null | Even distribution | Per-key ordering lost |
|
||||
| Key-based | user_id, order_id | Same key → same partition | Uneven distribution (hot keys) |
|
||||
| Custom partitioner | Custom logic | Per use-case optimization | More complex maintenance |
|
||||
|
||||
### RabbitMQ
|
||||
|
||||
**Architecture:**
|
||||
|
||||
```
|
||||
Producer ──► Exchange ──► Binding ──► Queue ──► Consumer
|
||||
│
|
||||
┌───────────┼───────────┐
|
||||
▼ ▼ ▼
|
||||
Direct Topic Fanout
|
||||
Exchange Exchange Exchange
|
||||
```
|
||||
|
||||
| Concept | Description |
|
||||
|---------|-------|
|
||||
| **Exchange** | Receives messages from producer, routes to queue |
|
||||
| **Binding** | Exchange → queue link with routing key |
|
||||
| **Queue** | FIFO message queue (consumed by consumer) |
|
||||
| **Virtual Host (vhost)** | Tenant isolation within a single cluster |
|
||||
| **Publisher Confirm** | Broker acknowledges message receipt |
|
||||
| **Consumer Ack** | Consumer acknowledges message processing |
|
||||
|
||||
**Exchange types:**
|
||||
|
||||
| Type | Routing | Use case |
|
||||
|-----|---------|----------|
|
||||
| **Direct** | routing_key = binding_key | Task queue, point-to-point |
|
||||
| **Topic** | routing_key match binding pattern (wildcard `*`, `#`) | Pub/sub with filtering |
|
||||
| **Fanout** | All bound queues | Broadcast, event notification |
|
||||
| **Headers** | AMQP headers match | Complex routing (not routing key dependent) |
|
||||
|
||||
**Queue types:**
|
||||
|
||||
```properties
|
||||
# Classic Queue (deprecated in production)
|
||||
x-queue-type: classic
|
||||
|
||||
# Quorum Queue (recommended for production)
|
||||
x-queue-type: quorum
|
||||
x-quorum-initial-group-size: 3
|
||||
x-dead-letter-exchange: dlx
|
||||
|
||||
# Stream Queue (for large backlogs)
|
||||
x-queue-type: stream
|
||||
x-max-length-bytes: 1073741824
|
||||
```
|
||||
|
||||
**HA and clustering:**
|
||||
|
||||
| Mode | Description | Use case |
|
||||
|-------|-------|----------|
|
||||
| **Quorum Queues** | Raft-based replication (3–5 node), auto failover | Production, HA messaging |
|
||||
| **Federation** | Async message forwarding between independent RabbitMQ clusters | Multi-region, DR |
|
||||
| **Shovel** | Point-to-point message forwarding (Federation at queue level) | Migration, specific routing |
|
||||
| **Warm Standby (DR)** | Secondary cluster, started on failover | Cold DR |
|
||||
|
||||
### Apache Pulsar
|
||||
|
||||
**Unique architecture (compute/storage separation):**
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Producer │ │ Consumer │ │ Consumer │
|
||||
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
||||
│ │ │
|
||||
┌──────▼───────────────────▼───────────────────▼──────┐
|
||||
│ Broker (stateless) │
|
||||
│ Subscription: Exclusive / Shared / Failover │
|
||||
└──────────────────────┬──────────────────────────────┘
|
||||
│
|
||||
┌──────────────────────▼──────────────────────────────┐
|
||||
│ BookKeeper (stateful storage) │
|
||||
│ ├── Bookie 1 ├── Bookie 2 ├── Bookie 3 ├── ... │
|
||||
│ └── Ledger (append-only, segmented log) │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
| Concept | Description |
|
||||
|---------|-------|
|
||||
| **Topic** | Logical category (partitioned or non-partitioned) |
|
||||
| **Subscription** | Delivery mode (Exclusive, Shared, Failover, Key_Shared) |
|
||||
| **Ledger** | Storage unit in BookKeeper (append-only) |
|
||||
| **Bookie** | Storage node (BookKeeper) |
|
||||
| **Managed Ledger** | Segmented log with cache and retention |
|
||||
|
||||
**Advantages over Kafka:**
|
||||
- Compute/storage separation — independent scaling
|
||||
- Geo-replication built-in (native)
|
||||
- Multi-tenant (namespaces, isolation)
|
||||
- TTL, retry, dead letter topic (built-in)
|
||||
- Read-at-least-once / effectively-once
|
||||
|
||||
### NATS
|
||||
|
||||
| Feature | Description |
|
||||
|---------|-------|
|
||||
| **Core NATS** | Pub/sub, request-reply, < 1 ms latency |
|
||||
| **JetStream** | Persistence, exactly-once, key-value store, object store |
|
||||
| **Leaf nodes** | Hierarchical cluster connection |
|
||||
| **Super-cluster** | Multi-region clustering (global) |
|
||||
|
||||
**Use case:** IoT, edge computing, microservices communication, low-latency messaging.
|
||||
|
||||
### Oracle Service Bus (OSB)
|
||||
|
||||
Part of Oracle SOA Suite, runs on WebLogic Server. Enterprise service bus for integration in Oracle-heavy environments.
|
||||
|
||||
| Concept | Description |
|
||||
|---------|-------|
|
||||
| **Proxy Service** | Inbound endpoint (HTTP, JMS, MQ, SOAP, REST) |
|
||||
| **Business Service** | Target backend service |
|
||||
| **Pipeline** | Message processing — routing, transformation, validation |
|
||||
| **Split-Join** | Parallel/sequential orchestration of multiple services |
|
||||
| **Reporting** | Message tracking, SLA monitoring |
|
||||
|
||||
**Key features:**
|
||||
- **Protocol mediation** — translation between SOAP/REST/JMS/MQ/FTP
|
||||
- **Message transformation** — XSLT, XQuery, MFL (non-XML)
|
||||
- **Throttling, SLA, alerting** — built-in
|
||||
- **Oracle AQ (Advanced Queuing)** — integration with Oracle DB queues
|
||||
- **XPath, XQuery, XSLT 2.0/3.0** — native support
|
||||
- **Error handling** — fault policies, error queues, retry
|
||||
|
||||
**Use case:** Enterprise SOA, Oracle DB → Kafka bridging, legacy mainframe wrapping, B2B integration.
|
||||
|
||||
**Alternatives:** IBM Integration Bus (IIB), MuleSoft Anypoint, WSO2 EI, Apache Camel / ServiceMix.
|
||||
|
||||
---
|
||||
|
||||
## Platform comparison
|
||||
|
||||
### Performance and scaling
|
||||
|
||||
| Platform | Max throughput | Latency (P99) | Messages/s (1 broker) | Scaling |
|
||||
|-----------|--------------|---------------|-------------------------|-----------|
|
||||
| **Kafka** | > 1 GB/s | 2–10 ms | ~1,000,000 | Partitions (horizontal) |
|
||||
| **Pulsar** | > 1 GB/s | 5–15 ms | ~1,000,000 | Brokers + Bookies |
|
||||
| **RabbitMQ** | ~100 MB/s | < 1 ms (RAM) | ~100,000 | Clustering (node) |
|
||||
| **NATS** | > 10 GB/s | < 0.5 ms | ~10,000,000 | Clustering + Leaf nodes |
|
||||
| **OSB** | < 1 GB/s | 10–100 ms | ~10,000 | Vertical (WebLogic cluster)
|
||||
|
||||
### Delivery guarantees
|
||||
|
||||
| Platform | At most once | At least once | Exactly once | Ordering |
|
||||
|-----------|-------------|---------------|-------------|----------|
|
||||
| **Kafka** | Yes | Yes (acks=all + min.insync) | Yes (idempotent + transactional) | Per partition |
|
||||
| **Pulsar** | Yes | Yes | Yes (dedup + transactional) | Per partition |
|
||||
| **RabbitMQ** | Yes | Yes (Publisher Confirm + Consumer Ack) | Limited | Per queue |
|
||||
| **NATS** | Yes | Yes (JetStream) | Limited | Per subject |
|
||||
| **OSB** | Yes | Yes (XA transactions, exactly-once delivery) | Yes (XA + WS-AT) | Per pipeline |
|
||||
|
||||
### When to use what
|
||||
|
||||
| Use case | Recommended platform | Reasoning |
|
||||
|----------|---------------------|------------|
|
||||
| **Event sourcing / audit log** | Kafka, Pulsar | Append-only log, high throughput, replay |
|
||||
| **CDC (Change Data Capture)** | Kafka (Kafka Connect + Debezium) | Connector ecosystem |
|
||||
| **Task queue (job processing)** | RabbitMQ, SQS | Dead letter, retry, priority, scheduling |
|
||||
| **API messaging / microservices** | NATS, RabbitMQ | Low latency, simplicity |
|
||||
| **Data pipeline (ETL)** | Kafka (KSQL, Kafka Streams) | Stream processing in platform |
|
||||
| **IoT / Edge** | NATS, MQTT (RabbitMQ) | Lightweight, leaf nodes |
|
||||
| **Enterprise SOA / EAI** | OSB, IBM IIB, MuleSoft | Protocol mediation, XA, B2B, legacy wrapping |
|
||||
| **Multi-tenant cloud** | Pulsar | Native multi-tenant, geo-replication |
|
||||
| **Serverless / event-driven** | SQS/SNS, Pub/Sub | Managed, auto-scaling |
|
||||
|
||||
---
|
||||
|
||||
## DR and high availability
|
||||
|
||||
See [DATACENTERS.en.md](DATACENTERS.en.md) — section "Impact of individual technologies on DC topology selection" for detailed DR mapping per platform.
|
||||
|
||||
### Best practices
|
||||
|
||||
- **Don't lose messages in queue** — prefer acknowledgement-based consumption (not auto-ack)
|
||||
- **Dead letter queue** — every main queue has a DLQ for undeliverable messages
|
||||
- **Monitor lag** — consumer lag is a key metric (Kafka: `kafka.consumer:consumer_lag`)
|
||||
- **Idempotent consumer** — same message may be delivered twice
|
||||
- **Retry with backoff** — exponential backoff on processing failure
|
||||
- **Schema registry** — avoid deserialization errors (Avro, Protobuf, JSON Schema)
|
||||
- **Encryption** — TLS in transit, encryption at rest (Kafka: cluster-side + topic-level)
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- [DATACENTERS.en.md](DATACENTERS.en.md) — DR topology, per-platform mapping
|
||||
- [CLOUD.en.md](CLOUD.en.md) — managed messaging (SQS, SNS, Service Bus, Pub/Sub)
|
||||
|
||||
## Sources
|
||||
|
||||
Links, books, and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
|
||||
|
||||
*Last revision: 2026-06-12*
|
||||
275
MESSAGING.md
Normal file
275
MESSAGING.md
Normal file
@@ -0,0 +1,275 @@
|
||||
# 📨 Messaging a streaming platformy
|
||||
|
||||
## Přehled platformem
|
||||
|
||||
| Platforma | Typ | Jazyk | Protokol | Persistence | Use case |
|
||||
|-----------|-----|-------|----------|-------------|----------|
|
||||
| **Apache Kafka** | Distributed event store | Java/Scala | Binary (TCP) | Disk (log) | Event streaming, data pipeline, log aggregation |
|
||||
| **RabbitMQ** | Message broker | Erlang | AMQP 0-9-1, MQTT, STOMP | Disk / RAM | Aplikační messaging, task queue, RPC |
|
||||
| **Apache Pulsar** | Distributed messaging + streaming | Java | Binary (TCP) + REST | Disk (segmented log) | Streaming + queue v jednom, multi-tenant |
|
||||
| **NATS** | Lightweight messaging | Go | NATS protocol (TCP) | Memory / JetStream (disk) | Microservices, IoT, edge, low-latency |
|
||||
| **AWS SQS** | Managed queue | — | HTTPS | Managed | Decoupling services, serverless |
|
||||
| **AWS SNS** | Managed pub/sub | — | HTTPS, SQS, Lambda, email | Managed | Push notifications, fanout |
|
||||
| **Azure Service Bus** | Managed messaging | — | AMQP, HTTPS | Managed | Enterprise messaging, sessions, transactions |
|
||||
| **Google Pub/Sub** | Managed streaming | — | gRPC, REST | Managed | Event-driven, data pipeline |
|
||||
| **Red Hat AMQ 7** (Artemis) | Message broker | Java | AMQP, MQTT, STOMP, OpenWire | Disk | Enterprise, JMS, high-availability |
|
||||
| **Oracle Service Bus (OSB)** | Enterprise ESB | Java | HTTP/S, JMS, SOAP, REST, MQ, FTP, AQ | Managed (WebLogic) | Enterprise integration, SOA, protocol mediation, routing |
|
||||
|
||||
---
|
||||
|
||||
## Detail platformem
|
||||
|
||||
### Apache Kafka
|
||||
|
||||
**Architektura:**
|
||||
|
||||
```
|
||||
Producer ──► Topic ──► Partition ──► Consumer Group
|
||||
│
|
||||
├── Partition 0 (Leader) ──► Broker 1
|
||||
├── Partition 1 (Follower) ──► Broker 2
|
||||
└── Partition 2 (Follower) ──► Broker 3
|
||||
```
|
||||
|
||||
| Koncept | Popis |
|
||||
|---------|-------|
|
||||
| **Topic** | Logická kategorie zpráv |
|
||||
| **Partition** | Append-only log, ordered sequence of messages |
|
||||
| **Broker** | Server v Kafka clusteru |
|
||||
| **Producer** | Publikuje zprávy do topicu |
|
||||
| **Consumer** | Čte zprávy z partition (v rámci consumer group) |
|
||||
| **Consumer Group** | Skupina consumerů sdílejících čtení topicu |
|
||||
| **Offset** | Pozice v partition (sledovaná consumerem) |
|
||||
| **KRaft** | Controller quorum (nahrazuje Zookeeper od Kafka 3.x) |
|
||||
|
||||
**Replikace a HA:**
|
||||
|
||||
| Parametr | Hodnota |
|
||||
|----------|---------|
|
||||
| Replication factor | 2–3 (typicky 3 pro produkci) |
|
||||
| ISR (In-Sync Replicas) | Počet replik, které drží krok s leaderem |
|
||||
| Min ISR | Minimální počet ISR pro potvrzení zápisu (acks=all) |
|
||||
| acks=0 | Fire-and-forget (nejrychlejší, možná ztráta dat) |
|
||||
| acks=1 | Zápis potvrzen leaderem (kompromis) |
|
||||
| acks=all | Zápis potvrzen všemi ISR (nejbezpečnější) |
|
||||
| Leader failover | Automatický výběr nového leadera z ISR |
|
||||
|
||||
**Důležité konfigurace:**
|
||||
|
||||
```properties
|
||||
# Produkce
|
||||
replication.factor=3
|
||||
min.insync.replicas=2
|
||||
default.replication.factor=3
|
||||
|
||||
# Retention
|
||||
log.retention.hours=168 # 7 dní
|
||||
log.retention.bytes=-1 # neomezeno (nebo limit)
|
||||
log.segment.bytes=1073741824 # 1 GB per segment
|
||||
|
||||
# Performance
|
||||
num.partitions=3 # podle potřeb (scale-out)
|
||||
compression.type=snappy # (snappy, gzip, lz4, zstd)
|
||||
```
|
||||
|
||||
**Partitioning strategies:**
|
||||
|
||||
| Strategy | Klíč | Výhoda | Nevýhoda |
|
||||
|----------|------|--------|----------|
|
||||
| Round-robin | null | Rovnoměrné rozložení | Ztráta pořadí per klíč |
|
||||
| Key-based | user_id, order_id | Zprávy se stejným klíčem → stejná partition | Nerovnoměrné rozložení (hot keys) |
|
||||
| Custom partitioner | Vlastní logika | Optimalizace per use case | Složitější na údržbu |
|
||||
|
||||
### RabbitMQ
|
||||
|
||||
**Architektura:**
|
||||
|
||||
```
|
||||
Producer ──► Exchange ──► Binding ──► Queue ──► Consumer
|
||||
│
|
||||
┌───────────┼───────────┐
|
||||
▼ ▼ ▼
|
||||
Direct Topic Fanout
|
||||
Exchange Exchange Exchange
|
||||
```
|
||||
|
||||
| Koncept | Popis |
|
||||
|---------|-------|
|
||||
| **Exchange** | Přijímá zprávy od producera, routuje do queue |
|
||||
| **Binding** | Vazba exchange → queue s routing key |
|
||||
| **Queue** | FIFO fronta zpráv (consumer čte) |
|
||||
| **Virtual Host (vhost)** | Izolace tenantů v rámci jednoho clusteru |
|
||||
| **Publisher Confirm** | Potvrzení že broker zprávu přijal |
|
||||
| **Consumer Ack** | Potvrzení že consumer zprávu zpracoval |
|
||||
|
||||
**Exchange typy:**
|
||||
|
||||
| Typ | Routing | Use case |
|
||||
|-----|---------|----------|
|
||||
| **Direct** | routing_key = binding_key | Task queue, point-to-point |
|
||||
| **Topic** | routing_key match binding pattern (wildcard `*`, `#`) | Pub/sub s filtrováním |
|
||||
| **Fanout** | Všem bindovaným queue | Broadcast, event notification |
|
||||
| **Headers** | AMQP headers match | Komplexní routing (není závislý na routing key) |
|
||||
|
||||
**Queue typy:**
|
||||
|
||||
```properties
|
||||
# Classic Queue (deprecated v produkci)
|
||||
x-queue-type: classic
|
||||
|
||||
# Quorum Queue (doporučeno pro produkci)
|
||||
x-queue-type: quorum
|
||||
x-quorum-initial-group-size: 3
|
||||
x-dead-letter-exchange: dlx
|
||||
|
||||
# Stream Queue (pro large backlogs)
|
||||
x-queue-type: stream
|
||||
x-max-length-bytes: 1073741824
|
||||
```
|
||||
|
||||
**HA a clustering:**
|
||||
|
||||
| Režim | Popis | Use case |
|
||||
|-------|-------|----------|
|
||||
| **Quorum Queues** | Raft-based replikace (3–5 node), auto failover | Produkce, HA messaging |
|
||||
| **Federation** | Async forwarding zpráv mezi nezávislými RabbitMQ clustery | Multi-region, DR |
|
||||
| **Shovel** | Point-to-point forwarding zpráv (Federation na úrovni queue) | Migrace, specifický routing |
|
||||
| **Warm Standby (DR)** | Druhý cluster, start až při failoveru | Cold DR |
|
||||
|
||||
### Apache Pulsar
|
||||
|
||||
**Unikátní architektura (compute/storage separation):**
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Producer │ │ Consumer │ │ Consumer │
|
||||
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
||||
│ │ │
|
||||
┌──────▼───────────────────▼───────────────────▼──────┐
|
||||
│ Broker (stateless) │
|
||||
│ Subscription: Exclusive / Shared / Failover │
|
||||
└──────────────────────┬──────────────────────────────┘
|
||||
│
|
||||
┌──────────────────────▼──────────────────────────────┐
|
||||
│ BookKeeper (stateful storage) │
|
||||
│ ├── Bookie 1 ├── Bookie 2 ├── Bookie 3 ├── ... │
|
||||
│ └── Ledger (append-only, segmented log) │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
| Koncept | Popis |
|
||||
|---------|-------|
|
||||
| **Topic** | Logická kategorie (partitioned nebo non-partitioned) |
|
||||
| **Subscription** | Způsob doručení (Exclusive, Shared, Failover, Key_Shared) |
|
||||
| **Ledger** | Storage unit v BookKeeper (append-only) |
|
||||
| **Bookie** | Storage node (BookKeeper) |
|
||||
| **Managed Ledger** | Segmentovaný log s cache a retention |
|
||||
|
||||
**Výhody oproti Kafce:**
|
||||
- Compute/storage separation — nezávislé škálování
|
||||
- Geo-replication built-in (nativní)
|
||||
- Multi-tenant (namespaces, isolation)
|
||||
- TTL, retry, dead letter topic (built-in)
|
||||
- Read-at-least-once / effectively-once
|
||||
|
||||
### NATS
|
||||
|
||||
| Feature | Popis |
|
||||
|---------|-------|
|
||||
| **Core NATS** | Pub/sub, request-reply, < 1 ms latence |
|
||||
| **JetStream** | Persistence, exactly-once, key-value store, object store |
|
||||
| **Leaf nodes** | Hierarchické propojení clusterů |
|
||||
| **Super-cluster** | Multi-region clustering (global) |
|
||||
|
||||
**Use case:** IoT, edge computing, microservices communication, low-latency messaging.
|
||||
|
||||
### Oracle Service Bus (OSB)
|
||||
|
||||
Součást Oracle SOA Suite, běží na WebLogic Serveru. Enterprise service bus pro integraci v Oracle-heavy prostředích.
|
||||
|
||||
| Koncept | Popis |
|
||||
|---------|-------|
|
||||
| **Proxy Service** | Vstupní endpoint (HTTP, JMS, MQ, SOAP, REST) |
|
||||
| **Business Service** | Cílový backend service |
|
||||
| **Pipeline** | Message processing — routing, transformation, validation |
|
||||
| **Split-Join** | Parallel/sequential orchestration více služeb |
|
||||
| **Reporting** | Message tracking, SLA monitoring |
|
||||
|
||||
**Klíčové vlastnosti:**
|
||||
- **Protocol mediation** — překlad mezi SOAP/REST/JMS/MQ/FTP
|
||||
- **Message transformation** — XSLT, XQuery, MFL (neXML)
|
||||
- **Throttling, SLA, alerting** — built-in
|
||||
- **Oracle AQ (Advanced Queuing)** — integrace s Oracle DB frontami
|
||||
- **XPath, XQuery, XSLT 2.0/3.0** — nativní podpora
|
||||
- **Error handling** — fault policies, error queues, retry
|
||||
|
||||
**Use case:** Enterprise SOA, Oracle DB → Kafka bridging, legacy mainframe wrapping, B2B integration.
|
||||
|
||||
**Alternativy:** IBM Integration Bus (IIB), MuleSoft Anypoint, WSO2 EI, Apache Camel / ServiceMix.
|
||||
|
||||
---
|
||||
|
||||
## Srovnání platformem
|
||||
|
||||
### Výkon a škálování
|
||||
|
||||
| Platforma | Max throughput | Latence (P99) | Počet zpráv/s (1 broker) | Škálování |
|
||||
|-----------|--------------|---------------|-------------------------|-----------|
|
||||
| **Kafka** | > 1 GB/s | 2–10 ms | ~1 000 000 | Partitions (horizontální) |
|
||||
| **Pulsar** | > 1 GB/s | 5–15 ms | ~1 000 000 | Brokers + Bookies |
|
||||
| **RabbitMQ** | ~100 MB/s | < 1 ms (RAM) | ~100 000 | Clustering (node) |
|
||||
| **NATS** | > 10 GB/s | < 0,5 ms | ~10 000 000 | Clustering + Leaf nodes |
|
||||
| **OSB** | < 1 GB/s | 10–100 ms | ~10 000 | Vertikální (WebLogic cluster)
|
||||
|
||||
### Delivery guarantees
|
||||
|
||||
| Platforma | At most once | At least once | Exactly once | Ordering |
|
||||
|-----------|-------------|---------------|-------------|----------|
|
||||
| **Kafka** | Ano | Ano (acks=all + min.insync) | Ano (idempotent + transactional) | Per partition |
|
||||
| **Pulsar** | Ano | Ano | Ano (dedup + transactional) | Per partition |
|
||||
| **RabbitMQ** | Ano | Ano (Publisher Confirm + Consumer Ack) | Omezeně | Per queue |
|
||||
| **NATS** | Ano | Ano (JetStream) | Omezeně | Per subject |
|
||||
| **OSB** | Ano | Ano (XA transactions, exactly-once delivery) | Ano (XA + WS-AT) | Per pipeline |
|
||||
|
||||
### Kdy co použít
|
||||
|
||||
| Use case | Doporučená platforma | Zdůvodnění |
|
||||
|----------|---------------------|------------|
|
||||
| **Event sourcing / audit log** | Kafka, Pulsar | Append-only log, high throughput, replay |
|
||||
| **CDC (Change Data Capture)** | Kafka (Kafka Connect + Debezium) | Ekosystém konektorů |
|
||||
| **Task queue (job processing)** | RabbitMQ, SQS | Dead letter, retry, priority, scheduling |
|
||||
| **API messaging / microservices** | NATS, RabbitMQ | Nízká latence, jednoduchost |
|
||||
| **Data pipeline (ETL)** | Kafka (KSQL, Kafka Streams) | Stream processing v platformě |
|
||||
| **IoT / Edge** | NATS, MQTT (RabbitMQ) | Lightweight, leaf nodes |
|
||||
| **Enterprise SOA / EAI** | OSB, IBM IIB, MuleSoft | Protocol mediation, XA, B2B, legacy wrapping |
|
||||
| **Multi-tenant cloud** | Pulsar | Nativní multi-tenant, geo-replication |
|
||||
| **Serverless / event-driven** | SQS/SNS, Pub/Sub | Managed, auto-scaling |
|
||||
|
||||
---
|
||||
|
||||
## DR a vysoká dostupnost
|
||||
|
||||
Viz [DATACENTERS.md](DATACENTERS.md) — sekce "Vliv jednotlivých technologií na výběr DC topologie" pro detail DR mapping per platforma.
|
||||
|
||||
### Best practices
|
||||
|
||||
- **Neztrať zprávu v queue** — preferovat aknowledge-based consumption (ne auto-ack)
|
||||
- **Dead letter queue** — každá hlavní queue má DLQ pro nedoručitelné zprávy
|
||||
- **Monitoring lag** — consumer lag je klíčová metrika (Kafka: `kafka.consumer:consumer_lag`)
|
||||
- **Idempotentní consumer** — stejná zpráva může být doručena dvakrát
|
||||
- **Retry s backoff** — exponenciální backoff při selhání zpracování
|
||||
- **Schema registry** — vyhnout se deserialization errors (Avro, Protobuf, JSON Schema)
|
||||
- **Šifrování** — TLS in transit, encryption at rest (Kafka: cluster-side + topic-level)
|
||||
|
||||
---
|
||||
|
||||
## Související
|
||||
|
||||
- [DATACENTERS.md](DATACENTERS.md) — DR topologie, per-platforma mapping
|
||||
- [CLOUD.md](CLOUD.md) — managed messaging (SQS, SNS, Service Bus, Pub/Sub)
|
||||
|
||||
## Zdroje
|
||||
|
||||
Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
|
||||
|
||||
*Poslední revize: 2026-06-12*
|
||||
@@ -52,9 +52,10 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
|
||||
| 🌐 Network architecture | [NETWORKING.md](NETWORKING.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
|
||||
| 📊 Monitoring & observability | [MONITORING.md](MONITORING.md) | Prometheus, Grafana, OTel, logging, alerting | — |
|
||||
| 🔄 CI/CD & DevOps | [CICD.md](CICD.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
|
||||
| 🔄 Disaster Recovery | [DR.md](DR.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
|
||||
| 🗄️ Database architecture | [DATABASES.md](DATABASES.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VEKTOROVE-DB, DATABAZOVE-ENGINY |
|
||||
| 🖥️ Hypervisors | [HYPERVISORS.md](HYPERVISORS.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
|
||||
| 🏭 Data centers | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC services | MONITORING |
|
||||
| 🏭 Data centers | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING |
|
||||
| 💾 Storage | [STORAGE.md](STORAGE.md) | SAN/NAS/object, RAID, SDS, Ceph, OpenStack Cinder/Swift/Manila | — |
|
||||
| 🔌 Server connectivity | [CONNECTIVITY.md](CONNECTIVITY.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
|
||||
| 🔧 Server hardware | [SERVER-HW.md](SERVER-HW.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
|
||||
@@ -89,9 +90,10 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
|
||||
| 🌐 Network architecture | [NETWORKING.en.md](NETWORKING.en.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
|
||||
| 📊 Monitoring & observability | [MONITORING.en.md](MONITORING.en.md) | Prometheus, Grafana, OTel, logging, alerting | — |
|
||||
| 🔄 CI/CD & DevOps | [CICD.en.md](CICD.en.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
|
||||
| 🔄 Disaster Recovery | [DR.en.md](DR.en.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
|
||||
| 🗄️ Database architecture | [DATABASES.en.md](DATABASES.en.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VECTOR-DBS, DATABASE-ENGINES |
|
||||
| 🖥️ Hypervisors | [HYPERVISORS.en.md](HYPERVISORS.en.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
|
||||
| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services | MONITORING |
|
||||
| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING |
|
||||
| 💾 Storage | [STORAGE.en.md](STORAGE.en.md) | SAN/NAS/object, RAID, SDS, Ceph | — |
|
||||
| 🔌 Server connectivity | [CONNECTIVITY.en.md](CONNECTIVITY.en.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
|
||||
| 🔧 Server hardware | [SERVER-HW.en.md](SERVER-HW.en.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
|
||||
@@ -136,6 +138,7 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
|
||||
| `DATACENTERS.md` / `DATACENTERS.en.md` | [`MONITORING.md`](MONITORING.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
| `MONITORING.md` / `MONITORING.en.md` | [`sources/monitoring/sources.md`](sources/monitoring/sources.md) |
|
||||
| `CICD.md` / `CICD.en.md` | [`sources/cicd/sources.md`](sources/cicd/sources.md) |
|
||||
| `DR.md` / `DR.en.md` | [`CLOUD.md`](CLOUD.md), [`DATACENTERS.md`](DATACENTERS.md), [`MONITORING.md`](MONITORING.md), [`CICD.md`](CICD.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
| `PROVISIONING.md` / `PROVISIONING.en.md` | [`CICD.md`](CICD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
| `STORAGE.md` / `STORAGE.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
| `GPU.md` / `GPU.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
|
||||
15
README.md
15
README.md
@@ -52,15 +52,18 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
|
||||
| 🌐 Síťová architektura | [NETWORKING.md](NETWORKING.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
|
||||
| 📊 Monitoring a observabilita | [MONITORING.md](MONITORING.md) | Prometheus, Grafana, OTel, logging, alerting, SLO | — |
|
||||
| 🔄 CI/CD a DevOps | [CICD.md](CICD.md) | Pipelines, GitOps, IaC (Terraform), deployment strategie | — |
|
||||
| 🔄 Disaster Recovery | [DR.md](DR.md) | RTO, RPO, scénáře, prevence, výpočet uptimu | CLOUD, DATACENTERS, MONITORING |
|
||||
| 🗄️ Databázová architektura | [DATABASES.md](DATABASES.md) | Klasifikace, sharding, replikace, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VEKTOROVE-DB, DATABAZOVE-ENGINY |
|
||||
| 🖥️ Hypervisory | [HYPERVISORS.md](HYPERVISORS.md) | VMware, Hyper-V, KVM, Proxmox, migrace | STORAGE, SERVER-HW |
|
||||
| 🏭 Datová centra | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC služby | MONITORING |
|
||||
| 🏭 Datová centra | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC služby, sekundární DC topologie | MONITORING, MESSAGING |
|
||||
| 💾 Storage | [STORAGE.md](STORAGE.md) | SAN/NAS/object, RAID, SDS, Ceph, OpenStack Cinder/Swift/Manila | — |
|
||||
| 🔌 Server connectivity | [CONNECTIVITY.md](CONNECTIVITY.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
|
||||
| 🔧 Server hardware | [SERVER-HW.md](SERVER-HW.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
|
||||
| 🎮 GPU | [GPU.md](GPU.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — |
|
||||
| ⚙️ Server config | [SERVER-CONFIG.md](SERVER-CONFIG.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — |
|
||||
| 📦 Provisioning | [PROVISIONING.md](PROVISIONING.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD |
|
||||
| 📨 Messaging & streaming | [MESSAGING.md](MESSAGING.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD |
|
||||
| 🏗️ Migrace DC | [DC-MIGRATION.md](DC-MIGRATION.md) | Strategie, fáze, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE |
|
||||
| 📋 Původní rozcestník | [HARDWARE.md](HARDWARE.md) | Legacy index → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING |
|
||||
| 📋 Původní infrastruktura | [INFRASTRUCTURE.md](INFRASTRUCTURE.md) | Legacy index → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE |
|
||||
| 📋 Review workflow | [REVIEW.md](REVIEW.md) | Proces oponentury a kontroly obsahu | — |
|
||||
@@ -89,15 +92,18 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
|
||||
| 🌐 Network architecture | [NETWORKING.en.md](NETWORKING.en.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
|
||||
| 📊 Monitoring & observability | [MONITORING.en.md](MONITORING.en.md) | Prometheus, Grafana, OTel, logging, alerting | — |
|
||||
| 🔄 CI/CD & DevOps | [CICD.en.md](CICD.en.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
|
||||
| 🔄 Disaster Recovery | [DR.en.md](DR.en.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
|
||||
| 🗄️ Database architecture | [DATABASES.en.md](DATABASES.en.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VECTOR-DBS, DATABASE-ENGINES |
|
||||
| 🖥️ Hypervisors | [HYPERVISORS.en.md](HYPERVISORS.en.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
|
||||
| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services | MONITORING |
|
||||
| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING, MESSAGING |
|
||||
| 💾 Storage | [STORAGE.en.md](STORAGE.en.md) | SAN/NAS/object, RAID, SDS, Ceph | — |
|
||||
| 🔌 Server connectivity | [CONNECTIVITY.en.md](CONNECTIVITY.en.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
|
||||
| 🔧 Server hardware | [SERVER-HW.en.md](SERVER-HW.en.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
|
||||
| 🎮 GPU | [GPU.en.md](GPU.en.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — |
|
||||
| ⚙️ Server config | [SERVER-CONFIG.en.md](SERVER-CONFIG.en.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — |
|
||||
| 📦 Provisioning | [PROVISIONING.en.md](PROVISIONING.en.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD |
|
||||
| 📨 Messaging & streaming | [MESSAGING.en.md](MESSAGING.en.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD |
|
||||
| 🏗️ DC Migration | [DC-MIGRATION.en.md](DC-MIGRATION.en.md) | Strategies, phases, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE |
|
||||
| 📋 Legacy index | [HARDWARE.en.md](HARDWARE.en.md) | → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING |
|
||||
| 📋 Legacy infra | [INFRASTRUCTURE.en.md](INFRASTRUCTURE.en.md) | → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE |
|
||||
| 📋 Review workflow | [REVIEW.en.md](REVIEW.en.md) | Review and content control process | — |
|
||||
@@ -136,6 +142,9 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
|
||||
| `DATACENTERS.md` / `DATACENTERS.en.md` | [`MONITORING.md`](MONITORING.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
| `MONITORING.md` / `MONITORING.en.md` | [`sources/monitoring/sources.md`](sources/monitoring/sources.md) |
|
||||
| `CICD.md` / `CICD.en.md` | [`sources/cicd/sources.md`](sources/cicd/sources.md) |
|
||||
| `DR.md` / `DR.en.md` | [`CLOUD.md`](CLOUD.md), [`DATACENTERS.md`](DATACENTERS.md), [`MONITORING.md`](MONITORING.md), [`CICD.md`](CICD.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
| `MESSAGING.md` / `MESSAGING.en.md` | [`DATACENTERS.md`](DATACENTERS.md), [`CLOUD.md`](CLOUD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
| `DC-MIGRATION.md` / `DC-MIGRATION.en.md` | [`DATACENTERS.md`](DATACENTERS.md), [`CLOUD.md`](CLOUD.md), [`DR.md`](DR.md), [`NETWORKING.md`](NETWORKING.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
| `PROVISIONING.md` / `PROVISIONING.en.md` | [`CICD.md`](CICD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
| `STORAGE.md` / `STORAGE.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
| `GPU.md` / `GPU.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
|
||||
@@ -187,4 +196,4 @@ Raw referenční data (dokumentace, knihy, standardy) podle oblastí:
|
||||
|
||||
---
|
||||
|
||||
*Rozcestník je automaticky udržován agentem `kb-index`. Poslední aktualizace: 2026-06-11.*
|
||||
*Rozcestník je automaticky udržován agentem `kb-index`. Poslední aktualizace: 2026-06-12.*
|
||||
|
||||
@@ -111,7 +111,22 @@ Rozděleno do samostatných souborů:
|
||||
| VMware Migration in 2026: Proxmox, KVM, XCP-ng & Veeam — StarWind | https://starwindsoftware.com/blog/vmware-migration-to-proxmox-kvm-xcp-ng-2026 | `[done]` |
|
||||
| Complete guide to modern vSphere alternatives — Spectro Cloud | https://www.spectrocloud.com/blog/vsphere-alternatives | `[done]` |
|
||||
| Broadcom VMware Acquisition: What's Next — Sayers | https://www.sayers.com/blog/after-the-deal-whats-next-for-vmware-customers | `[done]` |
|
||||
| Stanford University migration from VMware to Proxmox | https://itcommunity.stanford.edu/news/enterprise-technology-completes-successful-virtual-infrastructure-migration-vmware-proxmox | `[done]` |
|
||||
| Stanford University migration from VMware to Proxmox | https://itcommunity.stanford.edu/news/enterprise-technology-completes-successful-virtual-infrastructure-migration-vmware-proxmox | `[done]` |
|
||||
| | **Messaging / streaming** | |
|
||||
| Apache Kafka docs | https://kafka.apache.org/documentation/ | `[done]` |
|
||||
| RabbitMQ docs | https://www.rabbitmq.com/documentation.html | `[done]` |
|
||||
| Apache Pulsar docs | https://pulsar.apache.org/docs/ | `[done]` |
|
||||
| NATS docs | https://docs.nats.io/ | `[done]` |
|
||||
| Designing Event-Driven Systems (Confluent) | https://www.confluent.io/designing-event-driven-systems/ | `[done]` |
|
||||
| Kafka: The Definitive Guide (2nd ed.) — Confluent | https://www.confluent.io/resources/kafka-the-definitive-guide/ | `[done]` |
|
||||
| Enterprise Integration Patterns — Hohpe & Woolf | https://www.enterpriseintegrationpatterns.com/ | `[done]` |
|
||||
| | **DC migrace** | |
|
||||
| AWS Cloud Migration — 6 Strategies for Migrating to the Cloud | https://aws.amazon.com/blogs/enterprise-strategy/6-strategies-for-migrating-applications-to-the-cloud/ | `[done]` |
|
||||
| Azure Cloud Migration — Microsoft Cloud Adoption Framework | https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ | `[done]` |
|
||||
| Gartner 5 Rs of Cloud Migration | https://www.gartner.com/en/documents/3984835 | `[done]` |
|
||||
| VMware Site Recovery Manager — documentation | https://docs.vmware.com/en/Site-Recovery-Manager/ | `[done]` |
|
||||
| Zerto — Disaster Recovery & Migration | https://www.zerto.com/resources/ | `[done]` |
|
||||
| The Phoenix Project — IT Ops & Migration patterns | https://itrevolution.com/product/the-phoenix-project/ | `[done]` |
|
||||
|
||||
## Výrobci hardware
|
||||
|
||||
|
||||
Reference in New Issue
Block a user