new files

2026-06-16 15:47:45 +02:00
parent 3fa11ef0f6
commit b53714113c
11 changed files with 2298 additions and 7 deletions
--- a/DATACENTERS.en.md
+++ b/DATACENTERS.en.md
@@ -658,6 +658,281 @@ flowchart TD
    CLIM -->|"Cold (SE, NO)"| FC3["Free cooling 7000+ h/year<br/>Air-side economizer<br/>PUE < 1.2"]
 ```

+## Secondary data center topologies
+
+When planning a second DC, the choice of topology is key based on distance, RPO/RTO, and budget.
+
+### Distance classification
+
+| Category | Distance | Latency (round-trip) | Use case |
+|-----------|-----------|---------------------|----------|
+| **Metro (Campus)** | 1–20 km | < 1 ms | Synchronous replication, stretched cluster |
+| **Metro** | 20–100 km | 1–5 ms | Metro cluster, mostly sync replication |
+| **Regional** | 100–500 km | 5–20 ms | Asynchronous replication, warm standby |
+| **Continent** | 500–3000 km | 20–100 ms | Asynchronous replication, cold standby |
+| **Global** | 3000+ km | > 100 ms | Async only, no real-time dependencies |
+
+### Topologies by operational mode
+
+#### Active-Active (Hot-Hot)
+
+```
+DC-A (Primary)                 DC-B (Active)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  App Active        │
+│  DB Active         │◄─sync─►│  DB Active         │
+│  Users → LB → A    │        │  Users → LB → B    │
+└────────────────────┘        └────────────────────┘
+           │                         │
+           └──── Global Load Balancer ────┘
+```
+
+| Parameter | Value |
+|----------|---------|
+| **RTO** | 0–seconds (automatic failover, traffic is redirected) |
+| **RPO** | 0 (sync replication, commit is confirmed only after write to both DCs) |
+| **Max distance** | < 100 km (latency < 5 ms RTT for sync DB replication) |
+| **Operating costs** | 2× (both DCs fully active, both fully equipped) |
+| **Advantages** | Zero downtime, instant switchover, full utilization of both DCs |
+| **Disadvantages** | Requires synchronous replication → distance limit, complex networking, split-brain risk |
+
+**Split-brain solutions**: STONITH (Shoot The Other Node In The Head), watchdog, quorum (3rd node in 3rd location / cloud), fencing, SCSI-3 persistent reservation.
+
+**Use case**: Financial services, telco, payment gateways — where even a minute of downtime = millions.
+
+#### Active-Passive (Hot-Warm, MetroCluster)
+
+```
+DC-A (Primary)                 DC-B (Standby)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  App Standby       │
+│  DB Primary        │──sync──►│  DB Standby        │
+│  Users → LB → A    │        │  ~~~ (waiting) ~~~ │
+│  DNS: A-record     │        │  DNS: health check │
+└────────────────────┘        └────────────────────┘
+```
+
+| Parameter | Value |
+|----------|---------|
+| **RTO** | tens of seconds–minutes (DNS failover + App startup) |
+| **RPO** | 0 (sync) or seconds (async) |
+| **Max distance** | sync < 100 km, async unlimited |
+| **Operating costs** | 1.5–1.8× (second DC has reduced or idle compute) |
+| **MetroCluster** | Specific implementation: FC SAN over DWDM, sync mirror, automatic failover |
+
+**MetroCluster** (NetApp, Dell EMC, HPE):
+- Storage-based cluster with synchronous mirroring between DCs
+- Automatic failover on entire DC failure
+- Requires dedicated DWDM or dark fiber interconnection
+- Typical distance: up to 50 km (for latency < 1 ms RTT)
+- Use case: enterprise storage, primary+secondary DC in metropolitan area
+
+#### Hot-Cold (Warm Standby → Cold)
+
+```
+DC-A (Primary)                 DC-B (Cold Standby)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  ~~~ powered off ~~~│
+│  DB Active         │──async─►│  Backup storage    │
+│  Users → A         │        │  ~~~ no compute ~~~│
+└────────────────────┘        └────────────────────┘
+```
+
+| Parameter | Value |
+|----------|---------|
+| **RTO** | hours–days (purchase/rent HW, restore from backup) |
+| **RPO** | hours (last backup) |
+| **Max distance** | unlimited |
+| **Operating costs** | 1.1–1.3× (only storage and facility, compute only at failover) |
+| **Typical use case** | Low-cost DR, compliance, last resort |
+
+#### Pilot Light
+
+```
+DC-A (Primary)                 DC-B (Pilot Light)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  ~~~ off ~~~       │
+│  DB Active         │──async─►│  DB replica (mini) │
+│  All services      │        │  Core services only│
+│                    │        │  (DNS, LDAP, mon)  │
+└────────────────────┘        └────────────────────┘
+                              On DR: spin-up compute
+                              from IaC, rest from backup
+```
+
+- DC-B runs with minimum compute (only core services and DB replica)
+- Application layer is spun up from IaC (Terraform, Ansible) only during DR
+- Compromise between cost and RTO
+
+### Comparison table
+
+| Topology | RTO | RPO | Cost (× primary) | Max distance | Failover |
+|-----------|-----|-----|-------------------|-------------|----------|
+| **Active-Active** | 0–s | 0 | 2.0× | < 100 km | Auto (traffic) |
+| **MetroCluster** | s–min | 0 | 1.8–2.0× | < 50 km | Auto (storage) |
+| **Active-Passive (sync)** | min | 0 | 1.5–1.8× | < 100 km | Semi-auto |
+| **Active-Passive (async)** | min–h | s–min | 1.3–1.5× | unlimited | Semi-auto |
+| **Pilot Light** | h | min–h | 1.2–1.4× | unlimited | Manual |
+| **Warm Standby** | min–h | s–min | 1.5–1.8× | unlimited | Semi-auto |
+| **Cold Standby** | days | h | 1.1–1.3× | unlimited | Manual |
+
+### Stretched Cluster
+
+```
+┌──── Site A (50 km) ────┐    ┌──── Site B ──────────┐
+│  ┌──────────────────┐   │    │  ┌──────────────────┐ │
+│  │  ESXi / Hyper-V  │   │    │  │  ESXi / Hyper-V  │ │
+│  │  VM               │   │    │  │  VM (complement) │ │
+│  └────────┬─────────┘   │    │  └────────┬─────────┘ │
+│           │             │    │           │            │
+│  ┌────────▼─────────┐  │    │  ┌────────▼─────────┐  │
+│  │  Storage (SAN)   │──┼────┼──│  Storage (SAN)   │  │
+│  │  MetroCluster    │  │    │  │  MetroCluster    │  │
+│  └──────────────────┘  │    │  └──────────────────┘  │
+└────────────────────────┘    └────────────────────────┘
+                │
+          ┌─────▼──────┐
+          │  vCenter / │
+          │  Cluster   │
+          │  (single)  │
+          └────────────┘
+```
+
+- One cluster stretched across two sites (single management domain)
+- VMs can live-migrate between sites (vMotion over distance)
+- Storage synchronously mirrored (MetroCluster, VPLEX, vSAN延伸)
+- **Requirements**: dark fiber / DWDM, low latency (< 5 ms), high link reliability
+- **Risks**: split-brain, brain drain (split-site cluster), network dependency
+- **Use case**: enterprise with own dark fiber between two DCs in a metropolitan area
+
+### Decision tree
+
+```mermaid
+flowchart TD
+    Start(["Secondary DC"]) --> RPO{"Required RPO?"}
+    RPO -->|"0 (no data loss)"| SYNC{"Sync replication possible?"}
+    SYNC -->|"Yes, < 100 km"| ACT{"Want zero downtime?"}
+    ACT -->|"Yes"| AA["Active-Active<br/>RTO=0, RPO=0, 2× cost"]
+    ACT -->|"No"| AP["Active-Passive<br/>RTO=min, RPO=0, 1.5×"]
+    SYNC -->|"No, > 100 km"| ASYNC["Active-Passive (async)<br/>RTO=min, RPO=s, 1.3×"]
+
+    RPO -->|"minutes–hours"| WARM{"Want fast failover?"}
+    WARM -->|"Yes"| PILOT["Pilot Light<br/>RTO=h, RPO=min, 1.2×"]
+    WARM -->|"No"| COLD["Cold Standby<br/>RTO=days, RPO=h, 1.1×"]
+
+    Start --> DIST{"Distance between DCs"}
+    DIST -->|"< 50 km, own fiber"| MC["MetroCluster / Stretched Cluster<br/>Single management, sync storage"]
+    DIST -->|"50–300 km"| REG["Regional DR<br/>Active-Passive, async replication"]
+    DIST -->|"> 300 km"| GLOBAL["Global DR<br/>Cold standby, backup & restore"]
+```
+
+### Physical infrastructure for DC interconnection
+
+| Technology | Bandwidth | Max distance | Latency | Use case |
+|------------|-----------|-------------|---------|----------|
+| **Dark fiber** | 100 GbE–800 GbE | 10–80 km (single-mode) | < 0.1 ms | MetroCluster, stretched cluster |
+| **DWDM** | 400 GbE–1.6 TbE (per lambda) | 80–120 km (without amplifier) | < 0.5 ms | Metro, metro cluster |
+| **CWDM** | 10–25 GbE (per channel) | 10–40 km | < 0.3 ms | Campus, smaller metro |
+| **MPLS L2VPN** | 10–100 GbE | unlimited | 1–10 ms | Regional DR, async replication |
+| **Internet IPsec** | 1–10 GbE | unlimited | 5–50 ms | Cold standby, backup |
+
+### Impact of individual technologies on DC topology selection
+
+Choosing a secondary DC topology is not purely an infrastructure decision — each layer (DB, hypervisor, orchestration, messaging) brings its own constraints.
+
+#### Databases
+
+| DB technology | Sync replication | Max distance | Auto-failover | Split-brain handling | Note |
+|---------------|---------------|-------------|---------------|-------------------|----------|
+| **PostgreSQL** | Synchronous commit (synchronous_standby_names) | < 100 km (latency < 10 ms) | Patroni / repmgr + etcd | Quorum (etcd, 3+ node) | Streaming replication, needs wal_keep_segments |
+| **MySQL** | Group Replication (multi-primary, single-primary) | < 100 km | MySQL InnoDB Cluster + MySQL Router | Paxos (Group Replication, 3+ node) | Semi-sync as compromise |
+| **Oracle** | Data Guard (SYNC/FASTSYNC/ASYNC), RAC extended | sync < 100 km, async unlimited | Data Guard Broker / FSFO (Fast Start Failover) | Observer (3rd node) | Far Sync for remote DCs |
+| **MSSQL** | AlwaysOn Availability Groups (SYNCHRONOUS_COMMIT) | < 100 km | AlwaysOn + Cluster quorum | File share majority / cloud witness | Multi-site cluster support |
+| **MongoDB** | Majority write concern + journaling | < 100 km | Replica set auto-election | Arbitration node (voting member) | Priority-based failover |
+| **Cassandra** | N/A (multi-master, eventual consistency) | unlimited | Yes (peer-to-peer) | None (multi-master, gossip protocol) | Snitch-aware topology, NetworkTopologyStrategy |
+| **Redis** | Redis Sentinel / Redis Cluster (async) | unlimited (async) | Sentinel / Cluster failover | Quorum (Sentinel, majority) | PSYNC replication, replication lag |
+
+Key limitation for **sync replication**: latency < 5 ms RTT (commit must wait for confirmation from both DCs). At 100 km RTT ~1 ms — OK. At 1000 km (~10 ms RTT) sync replication reduces transaction throughput by 80+ %.
+
+Suitable for **Active-Active**:
+- **Cassandra / ScyllaDB** — native multi-DC, eventual consistency, no split-brain
+- **MySQL Group Replication (multi-primary)** — 3+ DC for quorum
+- **CockroachDB / TiDB** — native multi-region, ACID across DCs
+- **Redis Enterprise** — Active-Active (CRDT-based)
+
+Suitable for **Active-Passive**:
+- **PostgreSQL + Patroni** — auto-failover, etcd quorum
+- **Oracle Data Guard** — FSFO, far sync for remote DCs
+- **MSSQL AlwaysOn** — cloud witness
+- **MongoDB Replica Set** — arbitration node in 3rd location
+
+#### Hypervisors
+
+| Hypervisor | Cluster technology | Stretched cluster | Max distance | Split-brain |
+|-----------|-------------------|-------------------|-------------|-------------|
+| **VMware vSphere** | vSAN延伸, Metro vCenter, Site Recovery Manager | Yes (vSAN延伸, Metro Cluster) | < 50 km (vSAN延伸), < 10 ms RTT | Fencing (STONITH), witness host |
+| **Hyper-V** | Storage Replica + Failover Cluster | Yes (Cluster Sets) | < 50 km (sync), unlimited (async) | File share witness / cloud witness |
+| **Proxmox VE** | Proxmox HA + Ceph | Limited (Ceph stretch cluster) | < 50 km (Ceph sync) | Ceph monitor quorum (3+ DC) |
+| **XCP-ng / XenServer** | Xen Orchestra HA + SR (Storage Repository) replication | Limited | depends on storage replication | — |
+| **Nutanix AHV** | Metro Availability (sync), Async DR | Yes (Metro) | < 100 km (sync), unlimited (async) | Witness VM (cloud / 3rd site) |
+| **KVM / oVirt** | oVirt HA + GlusterFS / NFS | Limited | depends on storage replication | — |
+
+**vSAN延伸 specific requirements:**
+- Dedicated vSAN network (25 GbE min., < 5 ms RTT)
+- Witness host in 3rd location (or cloud witness)
+- All VM policies (FTT=1, mirroring striped)
+- Storage policy: `site-A + site-B + witness`
+
+#### Kubernetes and container platforms
+
+| Platform | Multi-cluster DR | Replication | Max distance | Failover |
+|-----------|-----------------|-----------|-------------|----------|
+| **Vanilla K8s** | KubeFed, Cluster API, Velero + Restic | Velero (backup/restore), Rook (Ceph) | unlimited | Manual (Velero restore) |
+| **OpenShift** | ACM (Advanced Cluster Management), Velero | OADP (OpenShift API for Data Protection) | unlimited | ACM failover (subscription) |
+| **Rancher** | Rancher Multi-Cluster App, Velero | Longhorn (sync/async DR), Velero | unlimited | Semi-auto |
+| **Google GKE** | Multi-cluster Services, Backup for GKE | Config Sync, Backup for GKE | unlimited | Manual |
+| **Azure AKS** | Azure ARC + Velero + Azure Traffic Manager | AKS backup (velero), Azure Site Recovery | unlimited | Manual (Velero) |
+| **AWS EKS** | EKS multi-cluster, Velero + S3 cross-region | Velero (S3), Rook (EBS snapshots) | unlimited | Manual |
+
+**Key K8s DR principles:**
+- **Applications must be stateless** (or state externalized to DB/storage)
+- **Velero** — backup/restore entire cluster (PV, resources, helm releases)
+- **Rook/Ceph** — cross-region mirroring RBD volumes
+- **KubeFed / ACM** — subscription-based deploy to multiple clusters
+- **Ingress/Gateway API** — traffic routing between clusters
+- **External DNS** — DNS failover on cluster outage
+
+#### Messaging / streaming
+
+| Platform | Replication | Topology | DR support | Max distance |
+|-----------|-----------|-----------|------------|-------------|
+| **Apache Kafka** | MirrorMaker 2, Confluent Cluster Linking, KRaft quorum | Active-Passive (MM2), Active-Active (Cluster Linking) | MM2: async, Cluster Linking: async | unlimited |
+| **RabbitMQ** | Classic Queue Mirroring, Quorum Queues | Active-Passive (Warm Standby) | Federation / Shovel (async) | unlimited |
+| **Red Hat AMQ** | (Artemis) Cluster + HA | Active-Passive (shared store / replication) | Live-backup pair | < 100 km (sync) |
+| **NATS** | NATS JetStream (cluster + cross-account) | Active-Active (Leaf nodes, cross-account) | Super-cluster, failover | unlimited |
+| **Apache Pulsar** | BookKeeper (bookie rack-aware), geo-replication | Active-Active (geo-replication) | Built-in (cluster-level) | unlimited (async) |
+| **AWS SQS/SNS** | Managed, AWS region pairs | Active-Active (multi-region) | Built-in (AWS managed) | unlimited |
+| **Azure Service Bus** | Managed, paired region | Active-Passive (paired region) | Built-in (geo-recovery) | unlimited |
+| **Oracle Service Bus (OSB)** | Oracle WebLogic Cluster + JDBC store + AQ | Active-Passive (WebLogic Cluster + Data Guard) | OSB/WLS cluster + Oracle RAC/Data Guard sync | < 100 km (Data Guard sync), unlimited (async) |
+
+**Messaging DR recommendations:**
+- **Kafka**: use Cluster Linking for Active-Active, or MirrorMaker 2 for Active-Passive; replicate only critical topics
+- **RabbitMQ**: Quorum Queues + Federation upstream for DR; avoid Classic Queue Mirroring (deprecated)
+- **Pulsar**: native geo-replication, bookie rack-aware for stretched cluster; easiest DR among messaging platforms
+- **OSB**: WebLogic cluster + Oracle RAC/Data Guard; DR depends on DB layer, not on OSB itself
+
+### Per-layer limitations summary table
+
+| Layer | Limiting factor for secondary DC | Max distance for sync | Impact on topology selection |
+|--------|-----------------------------------|----------------------|--------------------------|
+| **Storage** | Sync mirror latency, DWDM cost | < 50 km (MetroCluster) | Stretched cluster only in metro |
+| **Databases** | Commit wait for sync replication | < 100 km (5 ms RTT) | Active-Active only with multi-master DB |
+| **Hypervisor** | Stretched cluster quorum + fencing | < 50 km (vSAN, 5 ms) | MetroCluster / stretched cluster |
+| **Kubernetes** | Velero restore time, Rook mirror latency | unlimited (async) | Active-Passive, cold standby |
+| **Messaging** | Replication lag, offset management | unlimited (async) | Active-Active (Kafka, Pulsar, NATS) or Active-Passive |
+| **Network** | Dark fiber/DWDM cost, latency | < 100 km (metro fiber) | Limits sync replication options |
+| **Application** | Stateful/stateless, connection draining | depends on architecture | Stateless app → any topology |
+
 ## Disk monitoring — S.M.A.R.T.

 Self-Monitoring, Analysis and Reporting Technology — predictive monitoring of HDD/SSD.