18.6.2026

new files
2026-06-18 16:25:33 +02:00 · 2026-06-16 15:47:45 +02:00
47 changed files with 5923 additions and 124 deletions
--- a/AI-INFRASTRUCTURE.en.md
+++ b/AI-INFRASTRUCTURE.en.md
@@ -0,0 +1,600 @@
+# 🧠 AI/ML Infrastructure
+
+## Component overview
+
+```mermaid
+flowchart TD
+    subgraph Compute
+        GPU["GPU (H100/B200/Instinct)"]
+        CPU["CPU (AMD EPYC / Intel Xeon)"]
+        ASIC["ASIC (TPU, Trainium, Inferentia)"]
+    end
+    subgraph Network
+        IB["InfiniBand NDR/XDR"]
+        ROCE["RoCEv2"]
+        NVL["NVLink / NVSwitch"]
+    end
+    subgraph Storage
+        FS["Parallel FS (Lustre, GPFS, Weka)"]
+        OBJ["Object Store (S3, MinIO)"]
+        NVME["Local NVMe cache"]
+    end
+    subgraph Orchestration
+        S["Slurm"]
+        K["Kubernetes + Volcano/Kueue"]
+    end
+    subgraph Cooling
+        DLC["Direct-to-chip liquid"]
+        IMM["Immersion"]
+        AIR["Air (high-density)"]
+    end
+
+    Compute --> Network --> Storage
+    Orchestration --> Compute
+    Cooling --> Compute
+```
+
+---
+
+## GPU compute
+
+### NVIDIA
+
+| GPU | Architecture | FP8 | FP16/BF16 | FP64 | HBM | NVLink | TDP | Rack config |
+|-----|-------------|-----|-----------|------|-----|--------|-----|------|
+| **H100 SXM** | Hopper | 3,958 TFLOPS | 1,979 TFLOPS | 67 TFLOPS | 80 GB HBM3 | 900 GB/s | 700 W | 6–8× in DGX H100 |
+| **H200 SXM** | Hopper (HBM3e) | 3,958 TFLOPS | 1,979 TFLOPS | 67 TFLOPS | 141 GB HBM3e | 900 GB/s | 700 W | 6–8× in DGX H200 |
+| **B200** | Blackwell | ~9,000 TFLOPS | ~4,500 TFLOPS | ~40 TFLOPS | 192 GB HBM3e | 1,800 GB/s | 1,000 W | 6–8× in DGX B200 |
+| **GB200 Grace Hopper** | Blackwell | ~18,000 TFLOPS | ~9,000 TFLOPS | — | 192 GB + 480 GB (Grace) | NVLink-C2C | 1,000 W (GPU) + 500 W (CPU) | DGX GB200 (36× GPU) |
+| **L40S** | Ada Lovelace | 733 TFLOPS | 367 TFLOPS | — | 48 GB GDDR6 | N/A | 350 W | Inference, enterprise |
+| **A100 SXM** | Ampere | 1,248 TFLOPS | 624 TFLOPS | 19.5 TFLOPS | 80 GB HBM2e | 600 GB/s | 400 W | DGX A100 |
+
+### AMD
+
+| GPU | Architecture | FP8 | FP16/BF16 | FP64 | HBM | Infinity Fabric | TDP |
+|-----|-------------|-----|-----------|------|-----|----------------|-----|
+| **MI300X** | CDNA 3 | 2,615 TFLOPS | 1,307 TFLOPS | 81 TFLOPS | 192 GB HBM3 | 896 GB/s | 750 W |
+| **MI250** | CDNA 2 | — | 383 TFLOPS | 95.7 TFLOPS | 128 GB HBM2e | 400 GB/s | 500 W |
+
+### Intel
+
+| GPU | Architecture | FP16/BF16 | FP32 | HBM | TDP |
+|-----|-------------|-----------|------|-----|-----|
+| **Gaudi 3** | Custom | 1,835 TFLOPS | — | 144 GB HBM2e | 600 W |
+| **Max 1550** | Xe HPC | 600+ TFLOPS | 200 TFLOPS | 128 GB HBM2e | 600 W |
+
+### Cloud ASIC
+
+| ASIC | Provider | Use case | Performance |
+|------|----------|----------|-------|
+| **TPU v5p** | Google | Training | ~4,600 TFLOPS (BF16) per pod |
+| **Trainium 2** | AWS | Training | ~1,000 TFLOPS (BF16) per chip |
+| **Inferentia 2** | AWS | Inference | ~400 TOPS (INT8) per chip |
+| **Maia 100** | Microsoft | Training + inference | Custom, 800 W TDP |
+
+---
+
+## AI networking
+
+### Technology comparison
+
+| Technology | Bandwidth per link | Latency | Topology | Use case |
+|-------------|-------------------|---------|-----------|----------|
+| **InfiniBand NDR200** | 200 Gb/s | < 1 µs | Fat-tree, Dragonfly+ | Training (NVIDIA) |
+| **InfiniBand NDR400** | 400 Gb/s | < 1 µs | Fat-tree, Dragonfly+ | Training (NVIDIA) |
+| **InfiniBand XDR** | 800 Gb/s (planned) | < 1 µs | Dragonfly+ | Next-gen training |
+| **RoCEv2** (CX-7/8) | 200–400 Gb/s | 1–2 µs | Fat-tree, Spine-leaf | Training (AMD, Intel, open) |
+| **NVLink 4.0** | 900 GB/s per GPU | < 0.5 µs | NVSwitch full-mesh | Intra-node GPU comm |
+| **NVLink 5.0** | 1,800 GB/s per GPU | < 0.5 µs | NVSwitch full-mesh | Intra-node (Blackwell) |
+| **Ethernet (400 GbE)** | 400 Gb/s | 2–5 µs | Spine-leaf | Inference, data pipeline |
+
+### AI fabric principles
+
+- **Rail-optimized topology** — each GPU communicates on dedicated "rails" (same GPU indices across nodes connect to the same switch)
+- **Fat-tree (Clos)** — standard for InfiniBand and RoCE, non-blocking bisection bandwidth
+- **Dragonfly+** — reduces hop count while maintaining bandwidth (used in largest clusters)
+- **GPU Direct RDMA** — direct GPU ↔ GPU communication without CPU involvement, supports InfiniBand and RoCE
+- **SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)** — in-network reduction for AllReduce (InfiniBand only)
+
+### Bandwidth sizing
+
+```text
+Rule of thumb: InfiniBand bandwidth ≥ 50 % GPU HBM bandwidth for scalable training
+
+Example: H100 has 3.35 TB/s HBM
+  → Needs min. 1.6 TB/s bisection bandwidth per GPU
+  → 8× H100 in DGX: 4× NDR400 IB per GPU = 4 × 50 GB/s = 200 GB/s
+  → Reality: 8× 200 Gb/s (25 GB/s) per GPU in typical config = ~6 % HBM → bottleneck
+```
+
+---
+
+## AI storage
+
+### Requirements
+
+| Dataset size | IO pattern | Recommended storage | Bandwidth |
+|-------------|-----------|-------------------|-----------|
+| < 10 TB | Sequential read (data loading) | Local NVMe | > 10 GB/s per node |
+| 10–100 TB | Random read (checkpointing) | Parallel FS (Lustre, Weka) | > 100 GB/s cluster-wide |
+| 100 TB–10 PB | Mixed (training + checkpoint) | Parallel FS + object store | > 500 GB/s |
+| 10 PB+ | Multi-modal, video, LLM | Tiered (NVMe cache + parallel FS + object) | > 1 TB/s |
+
+### Storage solution comparison
+
+| Solution | Type | Bandwidth per node | Max capacity | Scaling | Use case |
+|--------|-----|-------------------|-------------|-----------|----------|
+| **Lustre** | Parallel FS (POSIX) | > 100 GB/s (cluster) | 100s PB | OST + MDS | HPC, LLM training (standard) |
+| **GPFS / StorageScale** | Parallel FS (POSIX) | > 100 GB/s | 100s PB | NSD servers | HPC, AI (IBM) |
+| **WekaFS** | Parallel FS (POSIX + NFS/SMB) | ~80 GB/s per 10 nodes | 10s PB | Container-native | AI/ML, NVIDIA DGX preferred |
+| **VAST Data** | Universal storage (NVMe + QLC) | ~100 GB/s per cluster | 10s PB | Scale-out | AI, checkpoint, data lake |
+| **Pure Storage//E** | All-flash (NVMe) | ~50 GB/s | ~30 PB | Scale-out | Enterprise AI, database |
+| **MinIO / S3** | Object store | ~20 GB/s per gateway | EB | Erasure coding | Dataset repository, checkpoint |
+| **NetApp AFF** | NAS + S3 | ~10 GB/s per controller | ~50 PB | HA pair | Enterprise, NFS baseline |
+
+### Checkpointing strategies
+
+| Strategy | RPO | Storage impact | Description |
+|-----------|-----|---------------|-------|
+| **Full checkpoint** | every N steps | High (stops training) | Full model + optimizer state |
+| **Async checkpoint** | every N steps | Medium (non-blocking) | Copy to staging buffer, async write |
+| **Distributed checkpoint** (NVIDIA NeMo) | every N steps | Low | Each rank writes its own shard |
+| **In-memory checkpoint** (IBM) | on failover | Minimal (DRAM) | Replication to another node's DRAM |
+| **Continuous checkpoint** (Microsoft) | every 1–5 min | Low (delta) | Changed shards only |
+
+---
+
+## AI cluster architecture
+
+### Physical topology — DGX H100 example
+
+```
+┌──────── DGX H100 (8× GPU) ────────┐
+│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
+│  │GPU 0│ │GPU 1│ │GPU 2│ │GPU 3│ │
+│  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
+│  ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │
+│  │GPU 4│ │GPU 5│ │GPU 6│ │GPU 7│ │
+│  └─────┘ └─────┘ └─────┘ └─────┘ │
+│  NVSwitch (NVLink 4.0, 900 GB/s)  │
+│  InfiniBand CX-7: 8× NDR400       │
+└────────────────────────────────────┘
+         │ 8× IB rails
+    ┌────┴──────────────┐
+    │  IB NDR400 Switches │  (rail-optimized)
+    └────────────────────┘
+```
+
+### Kubernetes for AI
+
+| Component | Role |
+|-----------|------|
+| **Volcano** | Batch scheduling, gang scheduling, queue management |
+| **Kueue** | Multi-tenant admission, resource quotas, fair sharing |
+| **NVIDIA GPU Operator** | Driver, container toolkit, MIG, DCGM, monitoring |
+| **HAMi** (ex k8s-vGPU-scheduler) | GPU sharing, MIG partitioning, fractional GPU |
+| **Node Feature Discovery** | GPU type detection, NUMA topology |
+| **Topology Manager** | NUMA-aware pod placement |
+| **DPDK / SR-IOV** | High-performance networking for GPU Direct RDMA |
+
+### Slurm for AI
+
+| Component | Role |
+|-----------|------|
+| **slurm.conf** | Partition for GPU nodes, GRES (Generic Resource) |
+| **gres.conf** | GPU type, GPU count per node |
+| **srun --gres=gpu:8** | Allocate 8 GPUs per job |
+| **sbatch --nodes=64 --ntasks=512** | 64 nodes, 512 ranks (8 GPU/node) |
+| **Pixis** | NVIDIA orchestration plugin for Slurm |
+
+---
+
+## AI cluster cooling
+
+### Power density comparison
+
+| Configuration | TDP per node | Racks | kW/rack | Note |
+|-------------|-------------|-------|---------|----------|
+| Standard server (2U) | 1 kW | 20 | 5–10 | Typical DC |
+| GPU server (DGX H100, 6×) | 42 kW | 6 | 45–50 | Air cooling limit |
+| GPU server (DGX B200, 6×) | 72 kW | 6 | 90–100 | Liquid cooling required |
+| GPU server (GB200 NVL72) | 120 kW | — | ~120 | Liquid cooling mandatory |
+| NVIDIA NVL72 rack | 120 kW | 1 | 120 | Fully liquid cooled |
+
+### Cooling technologies
+
+| Method | Max kW/rack | CAPEX | OPEX | Complexity |
+|--------|-------------|-------|------|-----------|
+| **Air cooling (CRAC/CRAH)** | < 15 | Low | Medium | Low |
+| **Air cooling (in-row)** | 15–30 | Medium | Medium | Low |
+| **Rear-door heat exchanger** | 30–50 | Medium | Low | Medium |
+| **Direct-to-chip liquid (cold plate)** | 50–150 | High | Low | High |
+| **Immersion (single-phase)** | 100–200 | High | Low | High |
+| **Immersion (two-phase)** | 200+ | Very high | Low | Very high |
+
+---
+
+## Inference infrastructure
+
+### Inference server comparison
+
+| Tool | Frameworks | Optimization | Use case |
+|---------|-----------|-------------|----------|
+| **vLLM** | Megatron, HF, AWQ, GPTQ | PagedAttention, KV cache, continuous batching | LLM inference (open source) |
+| **TensorRT-LLM** | TensorRT | INT4/INT8/FP8, inflight batching, attention optimizations | Production (NVIDIA) |
+| **Triton Inference Server** | All (TensorRT, vLLM, PyTorch) | Model ensemble, model caching, concurrent execution | Enterprise, multi-model |
+| **SageMaker** | Managed | Auto-scaling, model parallelism | AWS managed |
+| **OpenAI API / TGI** | HF Transformers | Continuous batching, flash attention | Hosting |
+
+### Inference optimization
+
+| Technique | Latency improvement | Throughput improvement | Memory reduction |
+|----------|-----------------|---------------------|------------------|
+| **FP8/INT8 quantization** | — | 2× | 2× |
+| **INT4 quantization** | — | 4× | 4× |
+| **Flash Attention 2/3** | 2–4× | — | 50 % (KV cache) |
+| **PagedAttention** | — | 2–5× | 95 % (KV cache fragmentation) |
+| **Continuous batching** | — | 10–20× | — |
+| **Speculative decoding** | 2–3× | — | — |
+| **Multi-LoRA / S-LoRA** | — | 8–16× | — |
+
+---
+
+## Distributed training techniques
+
+| Technique | Description | Frameworks |
+|----------|-------|------------|
+| **Data Parallelism (DDP/FSDP)** | Each GPU has model copy, different batch | PyTorch DDP, FSDP |
+| **Tensor Parallelism (TP)** | Model split across layers (intra-node) | Megatron-LM, DeepSpeed |
+| **Pipeline Parallelism (PP)** | Layers split across nodes | Megatron-LM, DeepSpeed |
+| **Sequence Parallelism (SP)** | Sequence split across GPUs | Megatron-LM |
+| **Expert Parallelism (EP)** | Different expert subnets on different GPUs | Mixture-of-Experts (MoE) |
+| **3D Parallelism** | TP + PP + DP combination | Megatron-LM, NeMo |
+| **ZeRO (1/2/3)** | Optimizer/gradient/parameter sharding | DeepSpeed |
+| **NCCL / RCCL** | GPU collective communication library | NVIDIA/AMD |
+
+---
+
+## Operating systems for AI
+
+### Distribution comparison
+
+| OS | GPU driver | CUDA | Container toolkit | IB/RoCE | Lustre client | Production support |
+|----|-----------|------|-------------------|---------|--------------|-------------------|
+| **Ubuntu 22.04 LTS** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED, rdma-core | Yes (lustre-client) | NVIDIA DGX standard |
+| **Ubuntu 24.04 LTS** | NVIDIA 550+ | 12.5+ | nvidia-container-toolkit | MLNX_OFED, rdma-core | Yes | Latest GPU support |
+| **RHEL 9 / Rocky 9** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED | Yes (EL repo) | Red Hat, enterprise |
+| **DGX OS** (Ubuntu-based) | NVIDIA custom | 12.x | Pre-installed | Pre-configured | Yes | NVIDIA DGX only supported |
+| **SLES 15 SP5** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED | Yes | HPC, some Lustre clusters |
+| **Debian 12** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | rdma-core | Yes (backports) | Community, research |
+| **Flatcar / Bottlerocket** | Container-host | — | nvidia-container-toolkit | Limited | No | K8s-only, minimal footprint |
+
+### Limitations and constraints
+
+#### GPU drivers and CUDA
+
+| Constraint | Detail |
+|----------|--------|
+| **Driver-CUDA compatibility** | NVIDIA driver major version must match CUDA toolkit (driver ≥ CUDA req). E.g., CUDA 12.5 requires driver ≥ 550 |
+| **Kernel version** | NVIDIA driver not compatible with all kernels. New kernel (6.8+) may require DKMS build or delayed support |
+| **Secure Boot** | NVIDIA driver requires signed module (MOK, shim) or disabled Secure Boot — common enterprise issue |
+| **Open vs Proprietary driver** | NVIDIA `nvidia-open` (since R515) — open source kernel module. GPU support: DC (H100+) → OK, older GPUs → proprietary required |
+| **nvidia-persistenced** | Required to maintain GPU initialization; without it GPUs may sleep after idle timeout (`nvidia-smi -pm 1`) |
+| **GPU reset** | After crashed training job, GPU may hang. `nvidia-smi --gpu-reset` or reboot node, sometimes power cycle |
+| **Multi-instance GPU (MIG)** | Requires specific driver, MIG mode on GPU, GPU restart. Cannot be changed at runtime. A100, H100, B200 only |
+
+#### Network (InfiniBand / RoCE)
+
+| Constraint | Detail |
+|----------|--------|
+| **MLNX_OFED vs rdma-core** | MLNX_OFED (NVIDIA) — full support, but own kernel modules, kernel version compatibility needed. `rdma-core` (open) — limited support, no custom modules |
+| **Kernel compatibility** | MLNX_OFED supports only specific kernel versions (major.minor). Kernel upgrade → MLNX_OFED rebuild required |
+| **NCCL** | NCCL version must be compatible with CUDA and IB firmware. `nccl-tests` for validation |
+| **SHARP** | In-network reduction requires specific MLNX_OFED + IB switch firmware combination |
+| **GPU Direct RDMA** | Requires `nvidia-peermem` module + MLNX_OFED. Does not work with all GPU and IB card combinations |
+| **RoCE PFC/ECN** | RoCE requires lossless fabric (PFC, ECN, DCQCN). Switch and host configuration — complex tuning |
+
+#### Storage
+
+| Constraint | Detail |
+|----------|--------|
+| **Lustre client** | Client version must match server. Server upgrade → upgrade all clients. Compatible with RHEL/Debian derivatives only |
+| **POSIX locking** | NFS and Lustre have different POSIX locking behavior. Distributed training relies on flock → problematic with mixed FS |
+| **Filesystem cache** | Page cache can mask IO bottlenecks. Training jobs often require `O_DIRECT` or sync IO |
+| **Local NVMe vs parallel FS** | Dataset staging on local NVMe eliminates network dependency but requires space and pre-fetch pipeline |
+
+#### Container runtime
+
+| Constraint | Detail |
+|----------|--------|
+| **Docker + GPU** | `nvidia-container-toolkit` (formerly nvidia-docker2). Requires runtime installation and config in `/etc/docker/daemon.json` |
+| **Podman + GPU** | Requires `nvidia-container-toolkit` + podman hook. Less tested than Docker |
+| **containerd + GPU** | Standard for K8s. Requires `cdi` (Container Device Interface) or `nvidia-container-runtime` |
+| **Enroot + Pyxis** | NVIDIA container stack for Slurm (Enroot = daemonless container runtime, Pyxis = Slurm plugin) |
+| **User namespace mapping** | Container GPU access requires device cgroup; rootless may fail (exception for /dev/dri and /dev/nvidia*) |
+
+#### Kernel parameters
+
+```text
+# AI workload recommended sysctl
+net.core.rmem_max = 134217728       # sufficient for NCCL
+net.core.wmem_max = 134217728
+net.ipv4.tcp_rmem = 4096 87380 134217728
+net.ipv4.tcp_wmem = 4096 65536 134217728
+net.core.netdev_budget = 600        # for high packet rate
+vm.max_map_count = 1048576          # PyTorch DataLoader workers
+kernel.numa_balancing = 0           # disable NUMA balancing (breaks locality)
+kernel.sched_min_granularity_ns = 10000000
+
+# Disable security mitigations for perf (dedicated AI clusters only)
+mitigations=off
+transparent_hugepages=never         # or madvise — THP may cause latency spikes
+intel_idle.max_cstate=1             # reduce C-state transition latency
+```
+
+#### Firmware and HW
+
+| Constraint | Detail |
+|----------|--------|
+| **GPU firmware (VBIOS)** | NVIDIA datacenter GPUs (H100, B200) have VBIOS updates via NVFlash. Without update → missing partitioning support or newer CUDA features |
+| **InfiniBand firmware** | IB switch and HCA firmware must be compatible. Mix old switch + new HCA → degraded perf |
+| **NVSwitch firmware** | DGX systems have NVSwitch firmware updatable only via NVIDIA DGX tools |
+| **Power capping (nvidia-smi)** | `nvidia-smi -pl <power>` — limit TDP for power budget management. Test impact on training throughput |
+| **GPU clock locking** | `nvidia-smi -ac <clock,mem>` — locked clock frequency for stable benchmarks. Apply after `nvidia-persistenced` |
+| **PCIe Gen** | GPU in PCIe Gen4 slot (instead of Gen5) → bottleneck for CPU↔GPU data transfer. Important for FSDP sharding |
+
+### Recommended OS per use case
+
+| Use case | OS | Rationale |
+|----------|-----|-------|
+| **DGX cluster (production)** | DGX OS / Ubuntu 22.04 LTS | NVIDIA standard, best driver support |
+| **Enterprise K8s (OpenShift)** | RHEL 9 / RHCOS | Red Hat support, GPU Operator compatible |
+| **Vanilla K8s (on-prem)** | Ubuntu 22.04 LTS + Flatcar (workers) | Widest community support, Flatcar for minimal footprint |
+| **Slurm cluster (HPC/AI)** | Rocky Linux 9 / Ubuntu 22.04 LTS | EL ecosystem (Lustre, OFED) or Ubuntu (community) |
+| **Research / rapid prototyping** | Ubuntu 24.04 LTS | Latest CUDA, PyTorch, driver support |
+| **Edge inference** | NVIDIA JetPack / Ubuntu (ARM) | Embedded GPU (Jetson Orin, AGX) |
+
+---
+
+## AI-ready data center — check-list
+
+| Area | Requirement |
+|--------|-----------|
+| **Power** | 30–120 kW/rack, HVDC (400 V DC), UPS supporting GPU spikes |
+| **Cooling** | Liquid cooling ready (direct-to-chip), rear-door for 30+ kW |
+| **Network** | InfiniBand (NDR/XDR) or RoCEv2, rail-optimized fat-tree |
+| **Storage** | Parallel FS (Lustre/Weka), checkpoint bandwidth > 100 GB/s |
+| **GPU density** | Max GPU/rack, minimize NVSwitch hops |
+| **Physical** | Floor load 1,500+ kg/m², rack 52U–60U |
+| **Security** | Tenant isolation, network segmentation, data encryption |
+| **Monitoring** | DCGM, NCCL health checks, thermals, power capping |
+
+---
+
+## Model and throughput limitations
+
+### Model size per GPU
+
+Maximum model size fitting on a single GPU depends on HBM capacity and precision:
+
+| GPU | HBM | FP32 | FP16/BF16 | INT8 | INT4 |
+|-----|-----|------|-----------|------|------|
+| **H100 80GB** | 80 GB | ~10B | ~40B | ~80B | ~160B |
+| **H200 141GB** | 141 GB | ~18B | ~70B | ~140B | ~280B |
+| **B200 192GB** | 192 GB | ~24B | ~96B | ~192B | ~384B |
+| **MI300X 192GB** | 192 GB | ~24B | ~96B | ~192B | ~384B |
+| **A100 80GB** | 80 GB | ~10B | ~40B | ~80B | ~160B |
+| **GB200 (192+480)** | 192 GB GPU + 480 GB Grace | — | ~96B + CPU offload | — | — |
+
+*Approximate: 1B params ≈ 2 GB FP16 ≈ 4 GB FP32 ≈ 1 GB INT8 ≈ 0.5 GB INT4. Subtract ~10–15 % HBM for activations, KV cache, optimizer states.*
+
+### Memory breakdown inference
+
+| Component | Llama 3 70B (FP16) | Llama 3 8B (FP16) |
+|------------|-------------------|-------------------|
+| Model weights | 140 GB | 16 GB |
+| KV cache (4K context, batch 1) | ~2 GB | ~0.2 GB |
+| KV cache (128K context, batch 1) | ~60 GB | ~6.5 GB |
+| Activations (peak) | ~5 GB | ~1 GB |
+| **Total 4K ctx** | ~147 GB | ~17 GB |
+| **Total 128K ctx** | ~205 GB | ~23 GB |
+
+**Conclusion:** Llama 3 70B FP16 does not fit on a single H100 (80 GB). Required: INT8 (170 GB → 2× H100), INT4 (85 GB → 1× H200), or tensor parallelism.
+
+### Context length vs memory
+
+| Context | KV cache 70B (FP16) | KV cache 8B (FP16) | Note |
+|---------|-------------------|-------------------|------|
+| 4K | ~2.2 GB | ~0.25 GB | Typical chat |
+| 32K | ~18 GB | ~2 GB | Documents |
+| 128K | ~72 GB | ~8 GB | Long-context (Claude, Gemini) |
+| 1M | ~560 GB | ~64 GB | Experimental (Gemini 1.5 Pro) |
+
+KV cache is **linear with context length** and quadratic with attention head count. Critical for long-context inference.
+
+### Throughput inference
+
+| Model | GPU | Precision | Batch size | Tokens/s | QPS (1K output) |
+|-------|-----|-----------|-----------|----------|-----------------|
+| Llama 3 8B | H100 | FP16 | 1 | ~800 | ~0.8 |
+| Llama 3 8B | H100 | FP16 | 128 | ~4 500 | ~35 |
+| Llama 3 8B | H100 | INT4 | 128 | ~8 000 | ~62 |
+| Llama 3 70B | 4× H100 | FP16 | 1 | ~180 | ~0.18 |
+| Llama 3 70B | 4× H100 | INT4 | 64 | ~1 200 | ~19 |
+| Llama 3 70B | 8× H100 | FP16 (TP=8) | 128 | ~2 500 | ~20 |
+| DeepSeek-R1 671B | 8× H200 | FP8 (MoE) | 64 | ~500 | ~8 |
+| GPT-4 class (est.) | — | — | — | ~100–300 | ~1–3 |
+
+**Notes:**
+- QPS (queries per second) depends on output length (1K tokens ≈ ~1 query)
+- Larger batch increases throughput but increases TTFB (time to first token)
+- Tensor Parallelism (TP) scales, but communication overhead grows linearly
+
+### Training limits
+
+#### Scaling efficiency
+
+| GPU count | Model | Efficiency | Reason |
+|-----------|-------|-----------|-------|
+| 8 (1 node) | Llama 3 8B | ~95 % | NVLink intra-node |
+| 64 (8 nodes) | Llama 3 8B | ~85 % | IB inter-node |
+| 512 (64 nodes) | Llama 3 70B | ~75 % | Communication overhead |
+| 4 096 (512 nodes) | Llama 3 70B | ~60 % | Pipeline bubble, network |
+| 16 384 (2 048 nodes) | Llama 3 405B | ~45 % | Synchronous SGD overhead |
+
+**Note:** Efficiency = (actual throughput) / (ideal linear speedup). Decreases logarithmically with GPU count.
+
+#### Memory breakdown training
+
+| Component | Llama 3 70B (BF16) | Llama 3 8B (BF16) |
+|------------|-------------------|-------------------|
+| Model weights | 140 GB | 16 GB |
+| Optimizer states (Adam) | 280 GB | 32 GB |
+| Gradients | 140 GB | 16 GB |
+| Activations (peak) | ~30 GB | ~4 GB |
+| **Total (DDP)** | ~590 GB | ~68 GB |
+| **Total (FSDP shard=8)** | ~74 GB | ~8.5 GB |
+
+**Conclusion:** FSDP (Fully Sharded Data Parallelism) is required for training models > 10B. Adam optimizer doubles memory vs inference (weights + optimizer + gradients).
+
+#### Time to train
+
+| Model | GPU count | GPU type | Precision | Time | Cost (on-prem estimate) |
+|-------|-----------|---------|-----------|------|---------------------|
+| Llama 3 8B | 64 | H100 | BF16 | ~3 days | ~$5 000 |
+| Llama 3 70B | 512 | H100 | BF16 | ~14 days | ~$100 000 |
+| Llama 3 405B | 16 384 | H100 | BF16 | ~60 days | ~$14 M |
+| DeepSeek-R1 671B (MoE) | 2 048 | H800 | BF16 | ~30 days | ~$6 M |
+| GPT-4 (est.) | 25 000 | A100/H100 | Mixed | ~90–100 days | ~$100 M |
+
+### Power and thermal limits
+
+| Configuration | TDP limit | Throughput loss | Reason |
+|-------------|-----------|------------------|--------|
+| H100 SXM | 700 W (default) | 0 % | Nominal |
+| H100 SXM | 600 W (-15 %) | ~5–8 % | Power capping |
+| H100 SXM | 500 W (-30 %) | ~15–25 % | Significant throttling |
+| H100 SXM | 400 W (-43 %) | ~30–50 % | Emergency only |
+| DGX H100 (8×) | 5.6 kW (max) | 0 % | Liquid cooling required |
+| DGX H100 (8×) | 4.5 kW (air) | ~10–15 % | Rear-door heat exchanger |
+
+GPU throttles when exceeding TDP or temperature (85°C+). Power capping correlates linearly with frequency but non-linearly with throughput.
+
+### API and operational limits
+
+| Limit | Description | Typical value |
+|-------|-------|-----------------|
+| **Rate limit** | Max requests per minute/hour | 100–10 000 RPM (per tier) |
+| **Tokens per minute (TPM)** | Max tokens per minute | 1M–300M (per model) |
+| **Context window** | Max input tokens | 4K–2M (per model) |
+| **Max output tokens** | Max generated tokens | 4K–32K (per model) |
+| **Concurrent requests** | Parallel request count | 10–10 000 (per backend) |
+| **Batch window** | Time to accumulate batch | 0–20 s (vLLM, TGI) |
+| **TTFB timeout** | Max latency to first token | 30–120 s |
+| **Idle timeout** | GPU idle → scale to 0 | 5–15 min (cloud) |
+
+### Limits per deployment model
+
+| Dimension | On-prem HW | Managed cloud (SageMaker, Vertex) | API (OpenAI, Anthropic) |
+|-----------|--------------|----------------------------------|------------------------|
+| **Model size** | Limited by HBM (max 192 GB/GPU) | Unlimited (cluster scaling) | Unlimited |
+| **Queries** | Limited by GPU count | Auto-scaling | Rate limit (per tier) |
+| **Latency** | < 10 ms (same node) | 10–100 ms (network hop) | 100 ms – 10 s |
+| **Customization** | Full (fine-tuning, quantization) | Managed (SageMaker, Bedrock) | Prompt engineering only |
+| **Data privacy** | Yes (on-prem) | Contractual (region, encryption) | Limited |
+| **Cost per 1M tokens** | ~$0.10–0.50 (FP16 inference) | ~$0.20–1.00 | ~$0.15–15.00 |
+| **Max context** | 128K+ (depending on GPU count) | 128K+ | 32K–2M |
+| **Cold start** | 0 (always-on) | 30 s – 5 min | 0 (shared infra) |
+
+---
+
+## GPU pricing and price/performance (2026)
+
+> Prices are approximate — NVIDIA does not publish official datacenter GPU price lists. Cloud prices from public providers (Q2 2026). HW purchase prices vary by volume, reseller, and region.
+
+### Purchase price (buy)
+
+| GPU | Price/GPU | Price 8× GPU baseboard | $/PFLOPS (FP16) | Note |
+|-----|---------|----------------------|----------------|------|
+| **H100 SXM** | $27,000–40,000 | ~$200,000 | $25,000 | Scarcity 2023–2024, now stabilized |
+| **H200 SXM** | $35,000–50,000 | ~$280,000 | ~$35,000 | H100 upgrade, HBM3e |
+| **B200** | ~$60,000–70,000 | ~$500,000+ | ~$31,000 | Blackwell, FP4 support |
+| **B100** | ~$30,000 | ~$240,000 | ~$20,000 | Lower price than B200, similar FP8 perf |
+| **GB200** (Grace+Blackwell) | ~$70,000–100,000 | ~$2,000,000 (rack) | — | CPU+GPU unified, high-density |
+| **A100 80GB** | ~$10,000–15,000 | ~$120,000 | ~$19,200 | Previous gen, still relevant |
+| **MI300X** | ~$12,000–18,000 | ~$100,000 | ~$9,600 | AMD, 192 GB HBM3 |
+| **Gaudi 3** | ~$15,625 | ~$125,000 | **$8,515** | Intel, best $/PFLOPS |
+| **L40S** | ~$8,000–10,000 | — | — | Inference, enterprise |
+
+### Cloud pricing (on-demand $/GPU/hr)
+
+| GPU | Cheapest | Mid-range (CoreWeave, Lambda) | Hyperscaler (AWS, GCP, Azure) |
+|-----|----------|-----------------------------|-------------------------------|
+| **H100 SXM** | $1.38 (Thunder) | $2.89–3.29 | $4.15–6.88 |
+| **H100 PCIe** | $2.01 (Spheron) | $2.50 | — |
+| **H200 SXM** | $3.89 (Spheron) | $4.54 | $5.00+ |
+| **B200** | **$3.39** (Spheron) | $6.02 | $14.24 (AWS) |
+| **B200 spot** | **$2.12** (Spheron) | — | — |
+| **GB200** | $3.50 (Runcrate) | $5.85 (Oracle) | $6.95 (GCP) |
+| **MI300X** | **$1.50** (TensorWave) | $1.85 (Vultr) | $7.86 (Azure) |
+| **A100 80GB** | $1.07 (Spheron) | $1.50–2.00 | $3.00+ |
+| **Gaudi 3** | ~$1.50–2.50 | — | — |
+| **L40S** | $0.91 (Spheron) | $1.50–2.00 | — |
+
+### Inference cost ($/M tokens)
+
+| GPU | Provider | $/hr | Est. tok/s | $/M tok |
+|-----|----------|------|-----------|--------|
+| **B200** | Spheron | $3.39 | ~4,000 | **$0.42** |
+| **B200 spot** | Spheron | $2.12 | ~4,000 | **$0.15** |
+| **H100 PCIe** | Spheron | $2.01 | ~1,200 | $0.47 |
+| **A100 80GB** | Spheron | $1.07 | ~520 | $0.57 |
+| **H100 SXM** | AWS | $6.88 | ~1,200 | $1.59 |
+| **H200 SXM** | Spheron | $4.54 | ~1,800 | $0.70 |
+| **L40S** | Spheron | $0.91 | ~450 | $0.56 |
+
+*Values for Llama 3 70B (INT8, batch=1, output 1K tok). Actual values vary by batch size, context, and quantization.*
+
+### Cost per GB HBM
+
+| GPU | HBM | Price/hr cloud | $/GB/hr | Best for memory-bound workloads |
+|-----|-----|-------------|--------|--------------------------------|
+| **MI300X** | 192 GB | $1.50 | **$0.0078** | ✅ Best |
+| **B200** | 192 GB | $3.39 | $0.0177 | ✅ Good |
+| **H200** | 141 GB | $3.89 | $0.0276 | ⚠️ |
+| **H100 SXM** | 80 GB | $1.38 | $0.0173 | ⚠️ Only up to 70B models |
+| **GB200** | 384 GB | $3.50 | $0.0091 | ✅✅ (2× MI300X capacity) |
+
+### Price/performance by scenario
+
+| Scenario | Winner | Rationale |
+|----------|--------|-----------|
+| **Absolute performance** (cost no object) | **GB200 DGX NVL72** | 72× GPU, 18 PFLOPS FP8, 384 GB HBM/GPU |
+| **Cloud inference** — best $/token | **B200 spot** | $0.15/M tok; 4× H100 throughput at lower cost |
+| **Cloud inference** — on-demand | **B200** | $0.42/M tok |
+| **Cloud inference** — budget | **A100 / L40S** | $0.57–0.56/M tok |
+| **Training** — price/perf on purchase | **Gaudi 3** | $8,515/PFLOPS, 2.5–3× better than H100 |
+| **Training** — cloud | **H100 SXM** | $1.38/hr, CUDA ecosystem, NCCL |
+| **Memory-bound** — long context, 70B+ | **MI300X / GB200** | 192–384 GB, $0.0078–0.0091/GB |
+| **Ecosystem + safe choice** | **H100/H200** | CUDA, widest SW, NVIDIA tools |
+| **Spot / preemptible** — lowest cost | **A100 / H100** | $1.07–1.38/hr, 50–90% off on-demand |
+
+### 2026 Trends
+
+- **H100** — price dropped 64% from peak $8/hr to $1.38–2.89/hr, then 40% rebound from inference demand
+- **B200** — new high-end, $3.39/hr cloud → ~$0.15/M tok on spot — new inference benchmark
+- **MI300X** — supply growing (TensorWave, Vultr, CoreWeave, Oracle, Azure), from $1.50/hr
+- **Gaudi 3** — best $/PFLOPS on purchase, but narrow ecosystem and limited cloud availability
+- **Market bifurcation** — prior gen (H100, A100) commoditizing, new gen (B200, GB200) commanding premium
+
+- [GPU.en.md](GPU.en.md) — GPU architecture, NVIDIA/AMD, vGPU, MIG
+- [NETWORKING.en.md](NETWORKING.en.md) — InfiniBand, RoCE, network topology
+- [STORAGE.en.md](STORAGE.en.md) — parallel filesystem, object store
+- [DATACENTERS.en.md](DATACENTERS.en.md) — DC layout, power, cooling
+- [CLOUD.en.md](CLOUD.en.md) — cloud AI services (SageMaker, Vertex AI)
+
+## Sources
+
+Links, books, and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
+
+*Last revision: 2026-06-18*
--- a/AI-INFRASTRUCTURE.md
+++ b/AI-INFRASTRUCTURE.md
@@ -0,0 +1,602 @@
+# 🧠 Infrastruktura pro AI/ML
+
+## Přehled komponent
+
+```mermaid
+flowchart TD
+    subgraph Compute
+        GPU["GPU (H100/B200/Instinct)"]
+        CPU["CPU (AMD EPYC / Intel Xeon)"]
+        ASIC["ASIC (TPU, Trainium, Inferentia)"]
+    end
+    subgraph Network
+        IB["InfiniBand NDR/XDR"]
+        ROCE["RoCEv2"]
+        NVL["NVLink / NVSwitch"]
+    end
+    subgraph Storage
+        FS["Parallel FS (Lustre, GPFS, Weka)"]
+        OBJ["Object Store (S3, MinIO)"]
+        NVME["Local NVMe cache"]
+    end
+    subgraph Orchestration
+        S["Slurm"]
+        K["Kubernetes + Volcano/Kueue"]
+    end
+    subgraph Cooling
+        DLC["Direct-to-chip liquid"]
+        IMM["Immersion"]
+        AIR["Air (high-density)"]
+    end
+
+    Compute --> Network --> Storage
+    Orchestration --> Compute
+    Cooling --> Compute
+```
+
+---
+
+## GPU compute
+
+### NVIDIA
+
+| GPU | Architektura | FP8 | FP16/BF16 | FP64 | HBM | NVLink | TDP | Rack |
+|-----|-------------|-----|-----------|------|-----|--------|-----|------|
+| **H100 SXM** | Hopper | 3 958 TFLOPS | 1 979 TFLOPS | 67 TFLOPS | 80 GB HBM3 | 900 GB/s | 700 W | 6–8× v DGX H100 |
+| **H200 SXM** | Hopper (HBM3e) | 3 958 TFLOPS | 1 979 TFLOPS | 67 TFLOPS | 141 GB HBM3e | 900 GB/s | 700 W | 6–8× v DGX H200 |
+| **B200** | Blackwell | ~9 000 TFLOPS | ~4 500 TFLOPS | ~40 TFLOPS | 192 GB HBM3e | 1 800 GB/s | 1 000 W | 6–8× v DGX B200 |
+| **GB200 Grace Hopper** | Blackwell | ~18 000 TFLOPS | ~9 000 TFLOPS | — | 192 GB + 480 GB (Grace) | NVLink-C2C | 1 000 W (GPU) + 500 W (CPU) | DGX GB200 (36× GPU) |
+| **L40S** | Ada Lovelace | 733 TFLOPS | 367 TFLOPS | — | 48 GB GDDR6 | N/A | 350 W | Inference, enterprise |
+| **A100 SXM** | Ampere | 1 248 TFLOPS | 624 TFLOPS | 19,5 TFLOPS | 80 GB HBM2e | 600 GB/s | 400 W | DGX A100 |
+
+### AMD
+
+| GPU | Architektura | FP8 | FP16/BF16 | FP64 | HBM | Infinity Fabric | TDP |
+|-----|-------------|-----|-----------|------|-----|----------------|-----|
+| **MI300X** | CDNA 3 | 2 615 TFLOPS | 1 307 TFLOPS | 81 TFLOPS | 192 GB HBM3 | 896 GB/s | 750 W |
+| **MI250** | CDNA 2 | — | 383 TFLOPS | 95,7 TFLOPS | 128 GB HBM2e | 400 GB/s | 500 W |
+
+### Intel
+
+| GPU | Architektura | FP16/BF16 | FP32 | HBM | TDP |
+|-----|-------------|-----------|------|-----|-----|
+| **Gaudi 3** | Custom | 1 835 TFLOPS | — | 144 GB HBM2e | 600 W |
+| **Max 1550** | Xe HPC | 600+ TFLOPS | 200 TFLOPS | 128 GB HBM2e | 600 W |
+
+### Cloud ASIC
+
+| ASIC | Provider | Use case | Výkon |
+|------|----------|----------|-------|
+| **TPU v5p** | Google | Training | ~4 600 TFLOPS (BF16) per pod |
+| **Trainium 2** | AWS | Training | ~1 000 TFLOPS (BF16) per chip |
+| **Inferentia 2** | AWS | Inference | ~400 TOPS (INT8) per chip |
+| **Maia 100** | Microsoft | Training + inference | Custom, 800 W TDP |
+
+---
+
+## AI networking
+
+### Srovnání technologií
+
+| Technologie | Bandwidth per link | Latence | Topologie | Use case |
+|-------------|-------------------|---------|-----------|----------|
+| **InfiniBand NDR200** | 200 Gb/s | < 1 µs | Fat-tree, Dragonfly+ | Training (NVIDIA) |
+| **InfiniBand NDR400** | 400 Gb/s | < 1 µs | Fat-tree, Dragonfly+ | Training (NVIDIA) |
+| **InfiniBand XDR** | 800 Gb/s (planned) | < 1 µs | Dragonfly+ | Next-gen training |
+| **RoCEv2** (CX-7/8) | 200–400 Gb/s | 1–2 µs | Fat-tree, Spine-leaf | Training (AMD, Intel, open) |
+| **NVLink 4.0** | 900 GB/s per GPU | < 0,5 µs | NVSwitch full-mesh | Intra-node GPU comm |
+| **NVLink 5.0** | 1 800 GB/s per GPU | < 0,5 µs | NVSwitch full-mesh | Intra-node (Blackwell) |
+| **Ethernet (400 GbE)** | 400 Gb/s | 2–5 µs | Spine-leaf | Inference, data pipeline |
+
+### Principy AI fabric
+
+- **Rail-optimized topology** — každá GPU komunikuje na dedikovaném "rails" (stejné GPU indexy napříč uzly jsou na stejném switchi)
+- **Fat-tree (Clos)** — standard pro InfiniBand a RoCE, non-blocking bisection bandwidth
+- **Dragonfly+** — redukce počtu hopů při zachování bandwidth (používáno v největších clusterech)
+- **GPU Direct RDMA** — přímá komunikace GPU ↔ GPU bez CPU involvementu, podpora InfiniBand a RoCE
+- **SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)** — in-network reduction pro AllReduce (pouze InfiniBand)
+
+### Bandwidth dimenzování
+
+```text
+Pravidlo: InfiniBand bandwidth ≥ 50 % GPU HBM bandwidth pro škálovatelné training
+
+Příklad: H100 má 3,35 TB/s HBM
+  → Potřebuje min. 1,6 TB/s bisection bandwidth per GPU
+  → 8× H100 v DGX: 4× NDR400 IB na GPU = 4 × 50 GB/s = 200 GB/s
+  → Reálně: 8× 200 Gb/s (25 GB/s) per GPU v typické konfiguraci = ~6 % HBM → bottleneck
+```
+
+---
+
+## AI storage
+
+### Požadavky
+
+| Dataset size | IO pattern | Doporučený storage | Bandwidth |
+|-------------|-----------|-------------------|-----------|
+| < 10 TB | Sequential read (data loading) | Local NVMe | > 10 GB/s per node |
+| 10–100 TB | Random read (checkpointing) | Parallel FS (Lustre, Weka) | > 100 GB/s cluster-wide |
+| 100 TB–10 PB | Mixed (training + checkpoint) | Parallel FS + object store | > 500 GB/s |
+| 10 PB+ | Multi-modal, video, LLM | Tiered (NVMe cache + parallel FS + object) | > 1 TB/s |
+
+### Srovnání storage řešení
+
+| Řešení | Typ | Bandwidth per node | Max capacity | Škálování | Use case |
+|--------|-----|-------------------|-------------|-----------|----------|
+| **Lustre** | Parallel FS (POSIX) | > 100 GB/s (cluster) | 100s PB | OST + MDS | HPC, LLM training (standard) |
+| **GPFS / StorageScale** | Parallel FS (POSIX) | > 100 GB/s | 100s PB | NSD servers | HPC, AI (IBM) |
+| **WekaFS** | Parallel FS (POSIX + NFS/SMB) | ~80 GB/s per 10 nodes | 10s PB | Container-native | AI/ML, NVIDIA DGX preferred |
+| **VAST Data** | Universal storage (NVMe + QLC) | ~100 GB/s per cluster | 10s PB | Scale-out | AI, checkpoint, data lake |
+| **Pure Storage//E** | All-flash (NVMe) | ~50 GB/s | ~30 PB | Scale-out | Enterprise AI, database |
+| **MinIO / S3** | Object store | ~20 GB/s per gateway | EB | Erasure coding | Dataset repository, checkpoint |
+| **NetApp AFF** | NAS + S3 | ~10 GB/s per controller | ~50 PB | HA pair | Enterprise, NFS baseline |
+
+### Checkpointing strategie
+
+| Strategie | RPO | Storage impact | Popis |
+|-----------|-----|---------------|-------|
+| **Full checkpoint** | každý N step | Vysoký (zastaví training) | Celý model + optimizer state |
+| **Async checkpoint** | každý N step | Střední (non-blocking) | Kopie do staging bufferu, zápis na pozadí |
+| **Distributed checkpoint** (NVIDIA NeMo) | každý N step | Nízký | Každá rank zapisuje svůj shard |
+| **In-memory checkpoint** (IBM) | při failover | Minimální (DRAM) | Replikace do DRAM jiného node |
+| **Continuous checkpoint** (Microsoft) | každý 1–5 min | Nízký (delta) | Jen changed shardy |
+
+---
+
+## AI cluster architektura
+
+### Fyzická topologie — DGX H100 example
+
+```
+┌──────── DGX H100 (8× GPU) ────────┐
+│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
+│  │GPU 0│ │GPU 1│ │GPU 2│ │GPU 3│ │
+│  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
+│  ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │
+│  │GPU 4│ │GPU 5│ │GPU 6│ │GPU 7│ │
+│  └─────┘ └─────┘ └─────┘ └─────┘ │
+│  NVSwitch (NVLink 4.0, 900 GB/s)  │
+│  InfiniBand CX-7: 8× NDR400       │
+└────────────────────────────────────┘
+         │ 8× IB rails
+    ┌────┴──────────────┐
+    │  IB NDR400 Switches │  (rail-optimized)
+    └────────────────────┘
+```
+
+### Kubernetes pro AI
+
+| Komponenta | Role |
+|-----------|------|
+| **Volcano** | Batch scheduling, gang scheduling, queue management |
+| **Kueue** | Multi-tenant admission, resource quotas, fair sharing |
+| **NVIDIA GPU Operator** | Driver, container toolkit, MIG, DCGM, monitoring |
+| **HAMi** (ex k8s-vGPU-scheduler) | GPU sharing, MIG partitioning, fractional GPU |
+| **Node Feature Discovery** | Detekce GPU typu, NUMA topologie |
+| **Topology Manager** | NUMA-aware pod placement |
+| **DPDK / SR-IOV** | High-performance networking pro GPU Direct RDMA |
+
+### Slurm pro AI
+
+| Komponenta | Role |
+|-----------|------|
+| **slurm.conf** | Partition pro GPU nodes, GRES (Generic Resource) |
+| **gres.conf** | GPU typ, počet GPU na node |
+| **srun --gres=gpu:8** | Alokace 8 GPU pro job |
+| **sbatch --nodes=64 --ntasks=512** | 64 uzly, 512 ranků (8 GPU/node) |
+| **Pixis** | NVIDIA orchestrace plugin pro Slurm |
+
+---
+
+## Chlazení AI clusterů
+
+### Power density srovnání
+
+| Konfigurace | TDP per node | Racků | kW/rack | Poznámka |
+|-------------|-------------|-------|---------|----------|
+| Standardní server (2U) | 1 kW | 20 | 5–10 | Běžné DC |
+| GPU server (DGX H100, 6×) | 42 kW | 6 | 45–50 | Air cooling limit |
+| GPU server (DGX B200, 6×) | 72 kW | 6 | 90–100 | Liquid cooling nutný |
+| GPU server (GB200 NVL72) | 120 kW | — | ~120 | Liquid cooling mandatory |
+| NVIDIA NVL72 rack | 120 kW | 1 | 120 | Plně liquid cooled |
+
+### Chladící technologie
+
+| Metoda | Max kW/rack | CAPEX | OPEX | Komplexita |
+|--------|-------------|-------|------|-----------|
+| **Air cooling (CRAC/CRAH)** | < 15 | Nízká | Střední | Nízká |
+| **Air cooling (in-row)** | 15–30 | Střední | Střední | Nízká |
+| **Rear-door heat exchanger** | 30–50 | Střední | Nízká | Střední |
+| **Direct-to-chip liquid (cold plate)** | 50–150 | Vysoká | Nízká | Vysoká |
+| **Immersion (single-phase)** | 100–200 | Vysoká | Nízká | Vysoká |
+| **Immersion (two-phase)** | 200+ | Velmi vysoká | Nízká | Velmi vysoká |
+
+---
+
+## Inference infrastruktura
+
+### Srovnání inference serverů
+
+| Nástroj | Frameworky | Optimalizace | Use case |
+|---------|-----------|-------------|----------|
+| **vLLM** | Megatron, HF, AWQ, GPTQ | PagedAttention, KV cache, continuous batching | LLM inference (open source) |
+| **TensorRT-LLM** | TensorRT | INT4/INT8/FP8, inflight batching, attention optimizations | Produkce (NVIDIA) |
+| **Triton Inference Server** | Vše (TensorRT, vLLM, PyTorch) | Model ensemble, model caching, concurrent execution | Enterprise, multi-model |
+| **SageMaker** | Managed | Auto-scaling, model parallelism | AWS managed |
+| **OpenAI API / TGI** | HF Transformers | Continuous batching, flash attention | Hosting |
+
+### Optimalizace pro inference
+
+| Technika | Latence zlepšení | Propustnost zlepšení | Memory reduction |
+|----------|-----------------|---------------------|------------------|
+| **FP8/INT8 quantization** | — | 2× | 2× |
+| **INT4 quantization** | — | 4× | 4× |
+| **Flash Attention 2/3** | 2–4× | — | 50 % (KV cache) |
+| **PagedAttention** | — | 2–5× | 95 % (KV cache fragmentation) |
+| **Continuous batching** | — | 10–20× | — |
+| **Speculative decoding** | 2–3× | — | — |
+| **Multi-LoRA / S-LoRA** | — | 8–16× | — |
+
+---
+
+## Distribuované training techniky
+
+| Technika | Popis | Frameworky |
+|----------|-------|------------|
+| **Data Parallelism (DDP/FSDP)** | Každá GPU má kopii modelu, různé batch | PyTorch DDP, FSDP |
+| **Tensor Parallelism (TP)** | Model rozdělen po vrstvách (intra-node) | Megatron-LM, DeepSpeed |
+| **Pipeline Parallelism (PP)** | Vrstvy rozděleny napříč uzly | Megatron-LM, DeepSpeed |
+| **Sequence Parallelism (SP)** | Sekvence rozdělena napříč GPU | Megatron-LM |
+| **Expert Parallelism (EP)** | Různé expertní subsítě na různých GPU | Mixture-of-Experts (MoE) |
+| **3D Parallelism** | TP + PP + DP kombinace | Megatron-LM, NeMo |
+| **ZeRO (1/2/3)** | Optimalizátor/gradient/parametry sharding | DeepSpeed |
+| **NCCL / RCCL** | GPU collective communication library | NVIDIA/AMD |
+
+---
+
+## Operační systémy pro AI
+
+### Srovnání distribucí
+
+| OS | GPU driver | CUDA | Container toolkit | IB/RoCE | Lustre klient | Produkční podpora |
+|----|-----------|------|-------------------|---------|--------------|-------------------|
+| **Ubuntu 22.04 LTS** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED, rdma-core | Ano (lustre-client) | NVIDIA DGX standard |
+| **Ubuntu 24.04 LTS** | NVIDIA 550+ | 12.5+ | nvidia-container-toolkit | MLNX_OFED, rdma-core | Ano | Nejnovější GPU podpora |
+| **RHEL 9 / Rocky 9** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED | Ano (EL repo) | Red Hat, enterprise |
+| **DGX OS** (Ubuntu-based) | NVIDIA custom | 12.x | Pre-installed | Pre-configured | Ano | NVIDIA DGX jediná podporovaná |
+| **SLES 15 SP5** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED | Ano | HPC, některé Lustre clustery |
+| **Debian 12** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | rdma-core | Ano (backports) | Community, research |
+| **Flatcar / Bottlerocket** | Container-host | — | nvidia-container-toolkit | Omezeně | Ne | K8s-only, minimal footprint |
+
+### Omezení a limity
+
+#### GPU drivery a CUDA
+
+| Omezení | Detail |
+|----------|--------|
+| **Driver-CUDA kompatibilita** | NVIDIA driver major verze musí odpovídat CUDA toolkit (driver ≥ CUDA req). Např. CUDA 12.5 vyžaduje driver ≥ 550 |
+| **Kernel version** | NVIDIA driver není kompatibilní se všemi kernely. Nový kernel (6.8+) může vyžadovat DKMS build nebo opožděnou podporu |
+| **Secure Boot** | NVIDIA driver vyžaduje podepsaný modul (MOK, shim) nebo vypnutý Secure Boot — častý problém v enterprise |
+| **Open vs Proprietary driver** | NVIDIA `nvidia-open` (od R515) — open source kernel modul. Podpora GPU: datové centrum (H100+) → OK, starší GPU → proprietary nutný |
+| **nvidia-persistenced** | Nutný pro udržení GPU initialization, bez něj GPU po idle timeout usnou (`nvidia-smi -pm 1`) |
+| **GPU reset** | Po crash training jobu může GPU viset. `nvidia-smi --gpu-reset` nebo reboot node, někdy i power cycle |
+| **Multi-instance GPU (MIG)** | Vyžaduje specifický driver, MIG mode na GPU, restart GPU. Nelze měnit za běhu. Podpora jen A100, H100, B200 |
+
+#### Network (InfiniBand / RoCE)
+
+| Omezení | Detail |
+|----------|--------|
+| **MLNX_OFED vs rdma-core** | MLNX_OFED (NVIDIA) — plná podpora, ale vlastní kernel moduly, nutná compatibility s kernel verzí. `rdma-core` (open) — omezená podpora, ale bez modulů |
+| **Kernel compatibility** | MLNX_OFED podporuje jen specifické kernel verze (major.minor). Upgrade kernelu → nutný rebuild MLNX_OFED |
+| **NCCL** | Verze NCCL musí být kompatibilní s CUDA a IB firmware. `nccl-tests` jako validace |
+| **SHARP** | In-network reduction vyžaduje specifickou MLNX_OFED + IB switch firmware kombinaci |
+| **GPU Direct RDMA** | Vyžaduje `nvidia-peermem` modul + MLNX_OFED. Nefunguje se všemi GPU a IB kartami |
+| **RoCE v PFC/ECN** | RoCE vyžaduje lossless fabric (PFC, ECN, DCQCN). Nastavení switch i host — komplexní tuning |
+
+#### Storage
+
+| Omezení | Detail |
+|----------|--------|
+| **Lustre klient** | Verze klienta musí odpovídat serveru. Upgrade serveru → upgrade všech klientů. Kompatibilní jen s RHEL/Debian deriváty |
+| **POSIX locking** | NFS a Lustre mají odlišné POSIX locking chování. Distributed training spoléhá na flock → problém při smíšených FS |
+| **Filesystem cache** | Page cache může maskovat IO bottleneck. Training joby často vyžadují `O_DIRECT` nebo `sync` IO |
+| **Local NVMe vs parallel FS** | Dataset staging na lokální NVMe eliminuje síťovou závislost, ale vyžaduje prostor a pre-fetch pipeline |
+
+#### Kontejnerový runtime
+
+| Omezení | Detail |
+|----------|--------|
+| **Docker + GPU** | `nvidia-container-toolkit` (dříve nvidia-docker2). Nutná instalace runtime a config v `/etc/docker/daemon.json` |
+| **Podman + GPU** | Vyžaduje `nvidia-container-toolkit` + podman hook. Méně testováno než Docker |
+| **containerd + GPU** | Standart pro K8s. Vyžaduje `cdi` (Container Device Interface) nebo `nvidia-container-runtime` |
+| **Enroot + Pyxis** | NVIDIA container stack pro Slurm (Enroot = container runtime bez daemona, Pyxis = Slurm plugin) |
+| **User namespace mapping** | Kontejnerové GPU access vyžaduje device cgroup a rootless může selhat (výjimka pro /dev/dri a /dev/nvidia*) |
+
+#### Kernel parametry
+
+```text
+# AI workload recommended sysctl
+net.core.rmem_max = 134217728       # dostatečný pro NCCL
+net.core.wmem_max = 134217728
+net.ipv4.tcp_rmem = 4096 87380 134217728
+net.ipv4.tcp_wmem = 4096 65536 134217728
+net.core.netdev_budget = 600        # pro vysokou packet rate
+vm.max_map_count = 1048576          # PyTorch DataLoader workers
+kernel.numa_balancing = 0           # vypnout NUMA balancing (ruší locality)
+kernel.sched_min_granularity_ns = 10000000
+
+# Disable security mitigations pro perf (pouze na dedicated AI clusterech)
+mitigations=off
+transparent_hugepages=never         # nebo madvise — THP může způsobovat latency spiky
+intel_idle.max_cstate=1             # redukce C-state transition latency
+```
+
+#### Firmware a HW
+
+| Omezení | Detail |
+|----------|--------|
+| **GPU firmware (VBIOS)** | NVIDIA datacenter GPU (H100, B200) mají VBIOS updates přes NVFlash. Bez update → chybí podpora partitioning nebo novějších CUDA feature |
+| **InfiniBand firmware** | IB switch a HCA firmware musí být kompatibilní. Mix starého switch + nového HCA → degraded perf |
+| **NVSwitch firmware** | DGX systémy mají NVSwitch firmware updatovatelný jen přes NVIDIA DGX tools |
+| **Power capping (nvidia-smi)** | `nvidia-smi -pl <power>` — omezení TDP pro power budget management. Nutné testovat vliv na training throughput |
+| **GPU clock locking** | `nvidia-smi -ac <clock,mem>` — locked clock frekvence pro stabilní benchmarky. Aplikace až po `nvidia-persistenced` |
+| **PCIe Gen** | GPU v PCIe Gen4 slotu (místo Gen5) → bottleneck pro data transfer CPU↔GPU. Důležité pro FSDP sharding |
+
+### Doporučené OS per use case
+
+| Use case | OS | Zdůvodnění |
+|----------|-----|-------|
+| **DGX cluster (produkce)** | DGX OS / Ubuntu 22.04 LTS | NVIDIA standard, nejlepší driver support |
+| **Enterprise K8s (OpenShift)** | RHEL 9 / RHCOS | Red Hat support, GPU Operator kompatibilní |
+| **Vanilla K8s (on-prem)** | Ubuntu 22.04 LTS + Flatcar (workers) | Nejširší community support, Flatcar pro minimal footprint |
+| **Slurm cluster (HPC/AI)** | Rocky Linux 9 / Ubuntu 22.04 LTS | EL ekosystém (Lustre, OFED) nebo Ubuntu (community) |
+| **Výzkum / rapid prototyping** | Ubuntu 24.04 LTS | Nejnovější CUDA, PyTorch, driver support |
+| **Edge inference** | NVIDIA JetPack / Ubuntu (ARM) | Embedded GPU (Jetson Orin, AGX) |
+
+---
+
+## AI-ready datové centrum — check-list
+
+| Oblast | Požadavek |
+|--------|-----------|
+| **Power** | 30–120 kW/rack, HVDC (400 V DC), UPS s podporou GPU špiček |
+| **Cooling** | Liquid cooling ready (direct-to-chip), rear-door pro 30+ kW |
+| **Network** | InfiniBand (NDR/XDR) nebo RoCEv2, rail-optimized fat-tree |
+| **Storage** | Parallel FS (Lustre/Weka), checkpoint bandwidth > 100 GB/s |
+| **GPU density** | Max GPU/rack, minimalizace NVSwitch hopů |
+| **Physical** | Podlaha nosnost 1 500+ kg/m², rack 52U–60U |
+| **Security** | Tenant isolation, network segmentation, data encryption |
+| **Monitoring** | DCGM, NCCL health checks, thermals, power capping |
+
+---
+
+## Omezení modelů a propustnosti
+
+### Model size per GPU
+
+Maximální velikost modelu, který se vejde na jednu GPU, závisí na HBM kapacitě a precision:
+
+| GPU | HBM | FP32 | FP16/BF16 | INT8 | INT4 |
+|-----|-----|------|-----------|------|------|
+| **H100 80GB** | 80 GB | ~10B | ~40B | ~80B | ~160B |
+| **H200 141GB** | 141 GB | ~18B | ~70B | ~140B | ~280B |
+| **B200 192GB** | 192 GB | ~24B | ~96B | ~192B | ~384B |
+| **MI300X 192GB** | 192 GB | ~24B | ~96B | ~192B | ~384B |
+| **A100 80GB** | 80 GB | ~10B | ~40B | ~80B | ~160B |
+| **GB200 (192+480)** | 192 GB GPU + 480 GB Grace | — | ~96B + CPU offload | — | — |
+
+*Hodnoty orientační: 1B parametrů ≈ 2 GB FP16 ≈ 4 GB FP32 ≈ 1 GB INT8 ≈ 0,5 GB INT4. Reálně odečíst ~10–15 % HBM pro activations, KV cache, optimizer states.*
+
+### Memory breakdown inference
+
+| Komponenta | Llama 3 70B (FP16) | Llama 3 8B (FP16) |
+|------------|-------------------|-------------------|
+| Model weights | 140 GB | 16 GB |
+| KV cache (4K context, batch 1) | ~2 GB | ~0,2 GB |
+| KV cache (128K context, batch 1) | ~60 GB | ~6,5 GB |
+| Activations (peak) | ~5 GB | ~1 GB |
+| **Celkem 4K ctx** | ~147 GB | ~17 GB |
+| **Celkem 128K ctx** | ~205 GB | ~23 GB |
+
+**Závěr:** Llama 3 70B v FP16 se nevejde na jednu H100 (80 GB). Nutné: INT8 (170 GB → 2× H100), INT4 (85 GB → 1× H200), nebo tensor parallelism.
+
+### Context length vs memory
+
+| Context | KV cache 70B (FP16) | KV cache 8B (FP16) | Poznámka |
+|---------|-------------------|-------------------|----------|
+| 4K | ~2,2 GB | ~0,25 GB | Běžný chat |
+| 32K | ~18 GB | ~2 GB | Dokumenty |
+| 128K | ~72 GB | ~8 GB | Long-context (Claude, Gemini) |
+| 1M | ~560 GB | ~64 GB | Experimentální (Gemini 1.5 Pro) |
+
+KV cache je **lineární s délkou kontextu** a kvadratická s počtem hlav pozornosti. Pro long-context je kritická.
+
+### Throughput inference
+
+| Model | GPU | Precision | Batch size | Tokens/s | QPS (1K output) |
+|-------|-----|-----------|-----------|----------|-----------------|
+| Llama 3 8B | H100 | FP16 | 1 | ~800 | ~0,8 |
+| Llama 3 8B | H100 | FP16 | 128 | ~4 500 | ~35 |
+| Llama 3 8B | H100 | INT4 | 128 | ~8 000 | ~62 |
+| Llama 3 70B | 4× H100 | FP16 | 1 | ~180 | ~0,18 |
+| Llama 3 70B | 4× H100 | INT4 | 64 | ~1 200 | ~19 |
+| Llama 3 70B | 8× H100 | FP16 (TP=8) | 128 | ~2 500 | ~20 |
+| DeepSeek-R1 671B | 8× H200 | FP8 (MoE) | 64 | ~500 | ~8 |
+| GPT-4 class (est.) | — | — | — | ~100–300 | ~1–3 |
+
+**Poznámky:**
+- QPS (queries per second) závisí na output délce (1K tokenů ≈ ~1 query)
+- Batch size zvyšuje throughput, ale zvyšuje TTFB (time to first token)
+- Tensor Parallelism (TP) škáluje, ale komunikační režba roste lineárně
+
+### Training limits
+
+#### Scaling efficiency
+
+| Počet GPU | Model | Efficiency | Důvod |
+|-----------|-------|-----------|-------|
+| 8 (1 node) | Llama 3 8B | ~95 % | NVLink intra-node |
+| 64 (8 nodes) | Llama 3 8B | ~85 % | IB inter-node |
+| 512 (64 nodes) | Llama 3 70B | ~75 % | Komunikační režie |
+| 4 096 (512 nodes) | Llama 3 70B | ~60 % | Pipeline bubble, network |
+| 16 384 (2 048 nodes) | Llama 3 405B | ~45 % | Synchronous SGD overhead |
+
+**Poznámka:** Efficiency = (actual throughput) / (ideal linear speedup). Klesá logaritmicky s počtem GPU.
+
+#### Memory breakdown training
+
+| Komponenta | Llama 3 70B (BF16) | Llama 3 8B (BF16) |
+|------------|-------------------|-------------------|
+| Model weights | 140 GB | 16 GB |
+| Optimizer states (Adam) | 280 GB | 32 GB |
+| Gradients | 140 GB | 16 GB |
+| Activations (peak) | ~30 GB | ~4 GB |
+| **Celkem (DDP)** | ~590 GB | ~68 GB |
+| **Celkem (FSDP shard=8)** | ~74 GB | ~8,5 GB |
+
+**Závěr:** FSDP (Fully Sharded Data Parallelism) je nutný pro trénování modelů > 10B. Adam optimizer zdvojnásobuje memory oproti inference (weights + optimizer + gradients).
+
+#### Time to train
+
+| Model | GPU count | GPU type | Precision | Time | Cost (on-prem odhad) |
+|-------|-----------|---------|-----------|------|---------------------|
+| Llama 3 8B | 64 | H100 | BF16 | ~3 dny | ~$5 000 |
+| Llama 3 70B | 512 | H100 | BF16 | ~14 dní | ~$100 000 |
+| Llama 3 405B | 16 384 | H100 | BF16 | ~60 dní | ~$14 M |
+| DeepSeek-R1 671B (MoE) | 2 048 | H800 | BF16 | ~30 dní | ~$6 M |
+| GPT-4 (est.) | 25 000 | A100/H100 | Mixed | ~90–100 dní | ~$100 M |
+
+### Power a thermal limity
+
+| Konfigurace | TDP limit | Throughput ztráta | Důvod |
+|-------------|-----------|------------------|-------|
+| H100 SXM | 700 W (default) | 0 % | Nominální |
+| H100 SXM | 600 W (-15 %) | ~5–8 % | Power capping |
+| H100 SXM | 500 W (-30 %) | ~15–25 % | Výrazný throttling |
+| H100 SXM | 400 W (-43 %) | ~30–50 % | Jen pro emergency |
+| DGX H100 (8×) | 5,6 kW (max) | 0 % | Nutné liquid cooling |
+| DGX H100 (8×) | 4,5 kW (air) | ~10–15 % | Rear-door heat exchanger |
+
+GPU throttluje při překročení TDP nebo teploty (85°C+). Power capping je lineární korelace s frekvencí, ale nelineární s propustností.
+
+### API a provozní limity
+
+| Limit | Popis | Typická hodnota |
+|-------|-------|-----------------|
+| **Rate limit** | Max requestů za minutu/hodinu | 100–10 000 RPM (dle tieru) |
+| **Tokens per minute (TPM)** | Max tokenů za minutu | 1M–300M (dle modelu) |
+| **Context window** | Max vstupních tokenů | 4K–2M (dle modelu) |
+| **Max output tokens** | Max vygenerovaných tokenů | 4K–32K (dle modelu) |
+| **Concurrent requests** | Počet paralelních requestů | 10–10 000 (dle backendu) |
+| **Batch window** | Čas na sebírání batch | 0–20 s (vLLM, TGI) |
+| **TTFB timeout** | Max latence na první token | 30–120 s |
+| **Idle timeout** | GPU idle → škálování na 0 | 5–15 min (cloud) |
+
+### Limity per deployment model
+
+| Model | Samostatný HW | Managed cloud (SageMaker, Vertex) | API (OpenAI, Anthropic) |
+|-------|--------------|----------------------------------|------------------------|
+| **Model size** | Limitován HBM (max 192 GB/GPU) | Neomezen (škálování cluster) | Neomezen |
+| **Queries** | Limitován GPU count | Auto-scaling | Rate limit (dle tieru) |
+| **Latency** | < 10 ms (same node) | 10–100 ms (network hop) | 100 ms – 10 s |
+| **Customization** | Plná (fine-tuning, quantization) | Managed (SageMaker, Bedrock) | Pouze prompt engineering |
+| **Data privacy** | Ano (on-prem) | Smluvní (region, encryption) | Omezená |
+| **Cost per 1M tokens** | ~$0,10–0,50 (FP16 inference) | ~$0,20–1,00 | ~$0,15–15,00 |
+| **Max context** | 128K+ (dle GPU count) | 128K+ | 32K–2M |
+| **Cold start** | 0 (always-on) | 30 s – 5 min | 0 (shared infra) |
+
+---
+
+## Ceny GPU a poměr cena/výkon (2026)
+
+> Ceny jsou orientační — NVIDIA nezveřejňuje oficiální ceník pro datacenter GPU. Cloud ceny dle veřejných providerů (Q2 2026). Při koupi HW se cena liší dle objemu, resellera a regionu.
+
+### Pořizovací cena (buy)
+
+| GPU | Cena/GPU | Cena 8× GPU baseboard | $/PFLOPS (FP16) | Poznámka |
+|-----|---------|----------------------|----------------|----------|
+| **H100 SXM** | $27 000–40 000 | ~$200 000 | $25 000 | Scareita 2023–2024, nyní stabilizace |
+| **H200 SXM** | $35 000–50 000 | ~$280 000 | ~$35 000 | Upgrade H100, HBM3e |
+| **B200** | ~$60 000–70 000 | ~$500 000+ | ~$31 000 | Blackwell, FP4 support |
+| **B100** | ~$30 000 | ~$240 000 | ~$20 000 | Nižší cena než B200, podobný výkon FP8 |
+| **GB200** (Grace+Blackwell) | ~$70 000–100 000 | ~$2 000 000 (rack) | — | CPU+GPU unified, high-density |
+| **A100 80GB** | ~$10 000–15 000 | ~$120 000 | ~$19 200 | Předchozí generace, stále relevantní |
+| **MI300X** | ~$12 000–18 000 | ~$100 000 | ~$9 600 | AMD, 192 GB HBM3 |
+| **Gaudi 3** | ~$15 625 | ~$125 000 | **$8 515** | Intel, nejlepší $/PFLOPS |
+| **L40S** | ~$8 000–10 000 | — | — | Inference, enterprise |
+
+### Cloud ceny (on-demand $/GPU/hr)
+
+| GPU | Nejdostupnější | Mid-range (CoreWeave, Lambda) | Hyperscaler (AWS, GCP, Azure) |
+|-----|--------------|-------------------------------|-------------------------------|
+| **H100 SXM** | $1.38 (Thunder) | $2.89–3.29 | $4.15–6.88 |
+| **H100 PCIe** | $2.01 (Spheron) | $2.50 | — |
+| **H200 SXM** | $3.89 (Spheron) | $4.54 | $5.00+ |
+| **B200** | **$3.39** (Spheron) | $6.02 | $14.24 (AWS) |
+| **B200** | **$2.12** (spot) | — | — |
+| **GB200** | $3.50 (Runcrate) | $5.85 (Oracle) | $6.95 (GCP) |
+| **MI300X** | **$1.50** (TensorWave) | $1.85 (Vultr) | $7.86 (Azure) |
+| **A100 80GB** | $1.07 (Spheron) | $1.50–2.00 | $3.00+ |
+| **Gaudi 3** | ~$1.50–2.50 | — | — |
+| **L40S** | $0.91 (Spheron) | $1.50–2.00 | — |
+
+### Cena za inferenci ($/M tokenů)
+
+| GPU | Provider | $/hr | Est. tok/s | $/M tok |
+|-----|----------|------|-----------|--------|
+| **B200** | Spheron | $3.39 | ~4 000 | **$0.42** |
+| **B200** (spot) | Spheron | $2.12 | ~4 000 | **$0.15** |
+| **H100 PCIe** | Spheron | $2.01 | ~1 200 | $0.47 |
+| **A100 80GB** | Spheron | $1.07 | ~520 | $0.57 |
+| **H100 SXM** | AWS | $6.88 | ~1 200 | $1.59 |
+| **H200 SXM** | Spheron | $4.54 | ~1 800 | $0.70 |
+| **L40S** | Spheron | $0.91 | ~450 | $0.56 |
+
+*Hodnoty pro Llama 3 70B (INT8, batch=1, output 1K tok). Reálné hodnoty se liší dle batch size, kontextu a kvantizace.*
+
+### Cena za GB HBM
+
+| GPU | HBM | Cena/hr cloud | $/GB/hr | Vhodnost pro memory-bound workloady |
+|-----|-----|-------------|--------|-----------------------------------|
+| **MI300X** | 192 GB | $1.50 | **$0.0078** | ✅ Nejlepší |
+| **B200** | 192 GB | $3.39 | $0.0177 | ✅ Dobrý |
+| **H200** | 141 GB | $3.89 | $0.0276 | ⚠️ |
+| **H100 SXM** | 80 GB | $1.38 | $0.0173 | ⚠️ Jen do 70B modelů |
+| **GB200** | 384 GB | $3.50 | $0.0091 | ✅✅ (2× MI300X kapacita) |
+
+### Poměr cena/výkon dle scénáře
+
+| Scénář | Vítěz | Zdůvodnění |
+|--------|-------|-------|
+| **Absolutní výkon** (cena není limit) | **GB200 DGX NVL72** | 72× GPU, 18 PFLOPS FP8, 384 GB HBM/GPU |
+| **Cloud inference** — nejlepší $/token | **B200 spot** | $0.15/M tok; 4× throughput H100 při nižší ceně |
+| **Cloud inference** — on-demand | **B200** | $0.42/M tok |
+| **Cloud inference** — rozpočet | **A100 / L40S** | $0.57–0.56/M tok |
+| **Training** — cena/výkon při koupi | **Gaudi 3** | $8 515/PFLOPS, 2.5–3× lepší než H100 |
+| **Training** — cloud | **H100 SXM** | $1.38/hr, CUDA ekosystém, NCCL |
+| **Memory-bound** — long context, 70B+ | **MI300X / GB200** | 192–384 GB, $0.0078–0.0091/GB |
+| **Ekosystém + bezpečná volba** | **H100/H200** | CUDA, nejširší SW, NVIDIA tools |
+| **Spot / preemptible** — nejnižší cena | **A100 / H100** | $1.07–1.38/hr, 50–90 % sleva oproti on-demand |
+
+### Trendy 2026
+
+- **H100** — cena klesla o 64 % z peaku $8/hr na $1.38–2.89/hr, pak rebound o 40 % díky inference boomu
+- **B200** — nový high-end, $3.39/hr cloud → ~$0.15/M tok na spotu — benchmark pro inference
+- **MI300X** — nabídka roste (TensorWave, Vultr, CoreWeave, Oracle, Azure), cena od $1.50/hr
+- **Gaudi 3** — nejlepší $/PFLOPS při koupi, ale úzký ekosystém a omezená cloud dostupnost
+- **Market se bifurkoval** — starší generace (H100, A100) komoditizují, nová (B200, GB200) drží prémii
+
+## Související
+
+- [GPU.md](GPU.md) — GPU architektura, NVIDIA/AMD, vGPU, MIG
+- [NETWORKING.md](NETWORKING.md) — InfiniBand, RoCE, network topologie
+- [STORAGE.md](STORAGE.md) — parallel filesystem, object store
+- [DATACENTERS.md](DATACENTERS.md) — DC layout, power, cooling
+- [CLOUD.md](CLOUD.md) — cloud AI služby (SageMaker, Vertex AI)
+
+## Zdroje
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Poslední revize: 2026-06-18*
--- a/BIG-DATA.en.md
+++ b/BIG-DATA.en.md
@@ -0,0 +1,232 @@
+# 🗄️ Big Data — ecosystem, architecture, tools
+
+## Overview
+
+The Big Data ecosystem in 2026: "Hadoop is dead, and yet it's everywhere." HDFS has shrunk, MapReduce is effectively gone, the Cloudera/Hortonworks era is over. But YARN lives on, the Hive Metastore has changed clothes into Iceberg/Delta, and the lakehouse pattern (cheap object storage + table format + distributed engine) is the inheritance Hadoop left behind.
+
+The modern Big Data stack has 8 layers:
+
+1. **Storage** — HDFS, S3, GCS, ABFS, MinIO
+2. **Table format** — Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon
+3. **Catalog** — Hive Metastore, Unity Catalog, Polaris, Nessie, AWS Glue
+4. **Batch processing** — Apache Spark, Trino-on-Spark, Dremio
+5. **Stream processing** — Apache Flink, Spark Structured Streaming, Kafka Streams
+6. **Distributed SQL** — Trino, Presto, StarRocks, ClickHouse
+7. **Transformation** — dbt, SQLMesh
+8. **Orchestration** — Apache Airflow 3.0, Dagster, Prefect, Kestra
+
+---
+
+## Storage
+
+### HDFS (Hadoop Distributed File System)
+
+| Feature | Detail |
+|---------|--------|
+| **Architecture** | Master/worker: NameNode (metadata) + DataNode (data) |
+| **Replication** | Default 3×, configurable (rack-aware) |
+| **Block size** | Default 128 MB (range 64 MB – 256 MB) |
+| **Limits** | NameNode memory ~ 1 GB / 1 million blocks; ~1000 DataNodes per cluster |
+| **Use case** | On-prem clusters, sequential read/write, large files |
+| **Status 2026** | Declining — most projects migrate to object storage (S3, GCS, MinIO) |
+
+HDFS remains relevant for on-prem environments where object storage is unavailable, or for specific use cases (YARN clusters, Spark shuffle). For new projects, object storage is recommended.
+
+### Object storage as Data Lake
+
+| Platform | Service | Use case |
+|----------|--------|----------|
+| **AWS** | S3 | Primary data lake, Iceberg/Delta on S3 |
+| **Azure** | ADLS Gen2 / Blob | Data lake for Azure ecosystem |
+| **GCP** | GCS | Data lake for GCP (Dataproc, BigQuery) |
+| **On-prem** | MinIO | S3-compatible object storage on own HW |
+
+### HDFS capacity planning
+
+| Data size | Configuration |
+|-----------|-------------|
+| **< 100 TB** | 3–5 DataNodes, 10 GbE, replication 3× |
+| **100 TB – 1 PB** | 5–20 DataNodes, 25/100 GbE, rack-aware, NameNode HA |
+| **1 PB+** | 20+ DataNodes, 100 GbE, Federation (multiple NameNodes) |
+
+---
+
+## Open Table Formats
+
+Table formats bring ACID transactions, schema evolution, and time travel to data lake object storage.
+
+| Format | Organization | Engine compatibility | Streaming | Catalog |
+|--------|-------------|---------------------|-----------|---------|
+| **Apache Iceberg** | Apache Foundation | Spark, Flink, Trino, Dremio, Athena, Snowflake | Flink sink, snapshot-based | REST catalog, Polaris, Glue, Hive |
+| **Delta Lake** | Linux Foundation (Databricks) | Spark (native), Trino, Flink (limited), Athena | Spark Streaming, DLT | Unity Catalog (proprietary), Hive |
+| **Apache Hudi** | Apache Foundation | Spark, Flink, Trino (connector) | Built-in CDC, incremental | Hive, Glue (limited) |
+| **Apache Paimon** | Apache Foundation | Flink (native), Spark | LSM-tree, changelog mode | Hive, REST |
+
+**Recommendation 2026:**
+- **Iceberg** — broadest multi-engine support, vendor-neutral, open catalog (Polaris)
+- **Delta Lake** — best for Spark/Databricks ecosystem, UniForm for cross-format reads
+- **Hudi** — losing momentum, only if already in production
+- **Paimon** — emerging, Flink-native, LSM architecture
+
+---
+
+## Processing Engines
+
+### Apache Spark
+
+Dominant batch processing engine and unifying engine (batch + streaming + SQL + ML).
+
+| Feature | Detail |
+|---------|--------|
+| **Version 2026** | Spark 4.x (4.1.0), native Kubernetes support, Structured Streaming, Delta Lake integration |
+| **API** | Scala, Java, Python (PySpark), SQL, R (SparkR) |
+| **Batch** | DataFrame/Dataset, RDD, SQL queries — 10–100× faster than MapReduce |
+| **Streaming** | Structured Streaming (micro-batch), latency ~100 ms – 5 s |
+| **SQL** | Spark SQL, ANSI SQL, Hive compatible |
+| **ML** | MLlib, SparkML, MLflow integration |
+| **Scheduler** | YARN, Kubernetes (production-ready since Spark 3.x), standalone |
+| **Fault tolerance** | RDD lineage, checkpointing |
+
+**When to use Spark:**
+- Batch ETL/ELT pipelines
+- Unified engine for batch + streaming (team preference)
+- Machine learning pipelines (MLlib, SparkML)
+- SQL analytics on large datasets
+
+### Apache Flink
+
+Highest-performance engine for true streaming (per-event processing).
+
+| Feature | Detail |
+|---------|--------|
+| **Version 2026** | Flink 2.x (streaming-first, batch as bounded stream) |
+| **API** | DataStream API, Table/SQL API, ProcessFunction (low-level) |
+| **Latency** | < 100 ms (true streaming, Chandy-Lamport checkpointing) |
+| **State management** | Managed state (ValueState, ListState, MapState), RocksDB backend |
+| **Event time** | Native, watermarks, out-of-order handling |
+| **Batch** | Batch as bounded stream (same runtime) |
+| **Deployment** | YARN, Kubernetes, standalone |
+| **Economics** | Higher memory requirements (managed state), requires careful tuning |
+
+**When to use Flink:**
+- Fraud detection, real-time bidding, IoT (< 100 ms latency)
+- Complex stateful stream processing
+- CDC pipelines
+- Event-driven architectures
+
+### Trino (ex PrestoSQL)
+
+Distributed SQL query engine — federated queries across various sources.
+
+| Feature | Detail |
+|---------|--------|
+| **Architecture** | Coordinator + Worker (no storage, no scheduler) |
+| **Connectors** | Iceberg, Delta, Hive, HDFS, S3, GCS, ADLS, PostgreSQL, MySQL, Kafka, Elasticsearch |
+| **Use case** | Interactive SQL, federated queries, lakehouse queries |
+| **Version 2026** | Trino 470+, Iceberg native, Delta Lake connector |
+
+---
+
+## Spark vs Flink vs Trino comparison
+
+| Criteria | Spark | Flink | Trino |
+|----------|-------|-------|-------|
+| **Primary use case** | Batch + unifying | True streaming | Interactive SQL |
+| **Streaming latency** | 100 ms – 5 s (micro-batch) | < 100 ms (true streaming) | N/A |
+| **Throughput** | High (batch-optimized) | High (pipeline-optimized) | Medium (ad-hoc) |
+| **State management** | State store (external) | Managed state (embedded) | N/A |
+| **SQL support** | Spark SQL | Flink SQL | ANSI SQL (broadest) |
+| **ML/AI** | MLlib, SparkML | — | — |
+| **Kubernetes** | Native (production) | Native (production) | Native (production) |
+| **Learning curve** | Medium | High | Low |
+| **Operational complexity** | Medium | High | Medium |
+
+---
+
+## Orchestration
+
+| Tool | Version 2026 | Use case |
+|------|-------------|----------|
+| **Apache Airflow** | 3.0+ (taskflow API, dynamic tasks, deferrable operators) | Universal orchestration, largest ecosystem |
+| **Dagster** | 1.x (asset-oriented, software-defined assets) | Data pipelines, observability, asset lineage |
+| **Prefect** | 3.x (native async, workers, blocks) | Python-native, serverless workers |
+| **Kestra** | 1.x (YAML-native, declarative) | Event-driven orchestration |
+| **Apache NiFi** | 2.x (flow-based, visual) | Data ingestion, CDC, streaming |
+
+---
+
+## Lakehouse architecture
+
+Lakehouse combines data lake flexibility (object storage) with data warehouse performance and governance.
+
+```
+┌──────────────────────────────────────────────────────┐
+│                    Query Engines                      │
+│   Trino    Spark SQL    Flink SQL    Dremio    Athena  │
+└─────────────────────────┬────────────────────────────┘
+                          │
+┌─────────────────────────▼────────────────────────────┐
+│                  Table Format Layer                    │
+│         Apache Iceberg / Delta Lake / Hudi            │
+│       (ACID, time travel, schema evolution)           │
+└─────────────────────────┬────────────────────────────┘
+                          │
+┌─────────────────────────▼────────────────────────────┐
+│                    Storage Layer                       │
+│           S3 / GCS / ADLS / MinIO / HDFS              │
+│               (Parquet / ORC / Avro)                  │
+└──────────────────────────────────────────────────────┘
+```
+
+For Iceberg details see [DATABASES.en.md — Apache Iceberg Lakehouse](DATABASES.en.md#apache-iceberg-lakehouse).
+
+---
+
+## Big Data Infrastructure
+
+### Cluster sizing
+
+| Component | Spark (batch) | Flink (streaming) | Trino (SQL) |
+|-----------|--------------|-------------------|-------------|
+| **CPU** | 16–64 cores/node | 16–32 cores/node | 8–32 cores/node |
+| **RAM** | 64–256 GB/node | 64–256 GB/node (incl. managed state) | 64–256 GB/node |
+| **Storage** | HDFS / object storage | Object storage (checkpoints) | None (stateless) |
+| **Network** | 25–100 GbE (shuffle-heavy) | 25–100 GbE (checkpointing) | 25–100 GbE |
+| **Disk** | NVMe (scratch, shuffle) | NVMe (RocksDB state backend) | — |
+| **Cluster size** | 5–200+ nodes | 3–100+ nodes | 5–50 nodes |
+
+### Network considerations
+
+- **Spark shuffle** — heavy network traffic between nodes; recommend 25–100 GbE, ideally no oversubscription
+- **Flink checkpointing** — periodic state writes to object storage; requires stable latency
+- **HDFS rack awareness** — optimizes replication across racks
+- **Data locality** — HDFS: local disk reads; object storage: network-bound
+
+### Kubernetes vs YARN
+
+| Criteria | YARN | Kubernetes |
+|----------|------|-----------|
+| **Resource isolation** | Cgroups (YARN containers) | Cgroups + namespaces (pods) |
+| **Ecosystem fit** | Hadoop-native (HDFS, Hive, Spark) | Cloud-native, Spark, Flink, Trino |
+| **Operational complexity** | Lower (single cluster manager) | Higher (requires K8s cluster) |
+| **Multi-tenant isolation** | YARN queues (Capacity/Fair Scheduler) | Namespaces, ResourceQuotas, LimitRanges |
+| **Stateful workloads** | Limited | StatefulSets, PVC, Operators |
+| **2026 trend** | Legacy (declining) | Standard for new projects |
+
+---
+
+## Cloud deployment
+
+| Cloud | Batch processing | Streaming | SQL | Managed K8s |
+|-------|-----------------|-----------|-----|-------------|
+| **AWS** | EMR (Spark, Hive, Flink) | Kinesis, MSK (Kafka), EMR Flink | Athena (Trino), Redshift | EKS |
+| **Azure** | HDInsight (Spark, Hive), Synapse | Event Hubs, HDInsight Flink | Synapse SQL, Azure Data Explorer | AKS |
+| **GCP** | Dataproc (Spark, Flink, Hive, Trino) | Pub/Sub, Dataflow (Beam), Dataproc Flink | BigQuery | GKE |
+
+---
+
+## Sources
+
+Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
+
+*Last revision: 2026-06-18*
--- a/BIG-DATA.md
+++ b/BIG-DATA.md
@@ -0,0 +1,232 @@
+# 🗄️ Big Data — ekosystém, architektura, nástroje
+
+## Přehled
+
+Big Data ekosystém v roce 2026: "Hadoop je mrtvý, a přitom je všude." HDFS se zmenšil, MapReduce je fakticky mrtvý, Cloudera/Hortonworks éra skončila. Ale YARN žije, Hive Metastore se převlékl do Iceberg/Delta a lakehouse pattern (levné object storage + tabulkový formát + distribuovaný engine) je dědictví, které Hadoop zanechal.
+
+Moderní Big Data stack má 8 vrstev:
+
+1. **Storage** — HDFS, S3, GCS, ABFS, MinIO
+2. **Tabulkový formát** — Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon
+3. **Catalog** — Hive Metastore, Unity Catalog, Polaris, Nessie, AWS Glue
+4. **Dávkové zpracování** — Apache Spark, Trino-on-Spark, Dremio
+5. **Streamové zpracování** — Apache Flink, Spark Structured Streaming, Kafka Streams
+6. **Distribuované SQL** — Trino, Presto, StarRocks, ClickHouse
+7. **Transformace** — dbt, SQLMesh
+8. **Orchestrace** — Apache Airflow 3.0, Dagster, Prefect, Kestra
+
+---
+
+## Úložiště (Storage)
+
+### HDFS (Hadoop Distributed File System)
+
+| Vlastnost | Detail |
+|-----------|--------|
+| **Architektura** | Master/worker: NameNode (metadata) + DataNode (data) |
+| **Replikace** | Výchozí 3×, konfigurovatelná (rack-aware) |
+| **Block size** | Výchozí 128 MB (lze 64 MB – 256 MB) |
+| **Limity** | NameNode memory ~ 1 GB / 1 milion bloků; ~1000 DataNode v clusteru |
+| **Use case** | On-prem clustery, sekvenční čtení/zápis, velké soubory |
+| **Stav 2026** | Klesající podíl — většina migruje na object storage (S3, GCS, MinIO) |
+
+HDFS je stále relevantní pro on-prem prostředí, kde object storage není dostupná, nebo pro specifické use case (YARN cluster, Spark shuffle). Pro nové projekty se doporučuje object storage.
+
+### Object storage jako Data Lake
+
+| Platforma | Služba | Use case |
+|-----------|--------|----------|
+| **AWS** | S3 | Hlavní data lake, Iceberg/Delta na S3 |
+| **Azure** | ADLS Gen2 / Blob | Data lake pro Azure ekosystém |
+| **GCP** | GCS | Data lake pro GCP (Dataproc, BigQuery) |
+| **On-prem** | MinIO | S3-kompatibilní object storage na vlastním HW |
+
+### Kapacitní plánování HDFS
+
+| Velikost dat | Konfigurace |
+|-------------|------------|
+| **< 100 TB** | 3–5 DataNode, 10 GbE, replication 3× |
+| **100 TB – 1 PB** | 5–20 DataNode, 25/100 GbE, rack-aware, NameNode HA |
+| **1 PB+** | 20+ DataNode, 100 GbE, Federation (více NameNode) |
+
+---
+
+## Tabulkové formáty (Open Table Formats)
+
+Tabulkové formáty přináší ACID transakce, schema evolution a time travel do data lake objektového úložiště.
+
+| Formát | Organizace | Engine kompatibilita | Streaming | Katalog |
+|--------|-----------|---------------------|-----------|---------|
+| **Apache Iceberg** | Apache Foundation | Spark, Flink, Trino, Dremio, Athena, Snowflake | Flink sink, snapshot-based | REST catalog, Polaris, Glue, Hive |
+| **Delta Lake** | Linux Foundation (Databricks) | Spark (native), Trino, Flink (limited), Athena | Spark Streaming, DLT | Unity Catalog (proprietary), Hive |
+| **Apache Hudi** | Apache Foundation | Spark, Flink, Trino (connector) | Built-in CDC, incremental | Hive, Glue (limited) |
+| **Apache Paimon** | Apache Foundation | Flink (native), Spark | LSM-tree, changelog mode | Hive, REST |
+
+**Doporučení 2026:**
+- **Iceberg** — nejširší multi-engine podpora, vendor-neutral, otevřený katalog (Polaris)
+- **Delta Lake** — nejlepší pro Spark/Databricks ekosystém, UniForm pro cross-format čtení
+- **Hudi** — ztrácí momentum, jen pokud již v produkci
+- **Paimon** — emerging, Flink-native, LSM architektura
+
+---
+
+## Zpracování (Processing Engines)
+
+### Apache Spark
+
+Dominantní engine pro dávkové zpracování a unifying engine (batch + streaming + SQL + ML).
+
+| Vlastnost | Detail |
+|-----------|--------|
+| **Verze 2026** | Spark 4.x (4.1.0), native Kubernetes support, Structured Streaming, Delta Lake integrace |
+| **API** | Scala, Java, Python (PySpark), SQL, R (SparkR) |
+| **Batch** | DataFrame/Dataset, RDD, SQL queries — 10–100× rychlejší než MapReduce |
+| **Streaming** | Structured Streaming (micro-batch), latence ~100 ms – 5 s |
+| **SQL** | Spark SQL, ANSI SQL, Hive兼容 |
+| **ML** | MLlib, SparkML, integrace s MLflow |
+| **Scheduler** | YARN, Kubernetes (production-ready od Spark 3.x), standalone |
+| **Fault tolerance** | RDD lineage, checkpointing |
+
+**Kdy použít Spark:**
+- Dávkové ETL/ELT pipelines
+- Jednotný engine pro batch + streaming (team preference)
+- Machine learning pipelines (MLlib, SparkML)
+- SQL analytika na velkých datech
+
+### Apache Flink
+
+Nejvýkonnější engine pro true streaming (per-event zpracování).
+
+| Vlastnost | Detail |
+|-----------|--------|
+| **Verze 2026** | Flink 2.x (streaming-first, batch jako speciální případ streamu) |
+| **API** | DataStream API, Table/SQL API, ProcessFunction (low-level) |
+| **Latence** | < 100 ms (true streaming, Chandy-Lamport checkpointing) |
+| **State management** | Managed state (ValueState, ListState, MapState), RocksDB backend |
+| **Event time** | Nativní, watermarky, out-of-order handling |
+| **Batch** | Batch jako bounded stream (stejný runtime) |
+| **Deployment** | YARN, Kubernetes, standalone |
+| **Ekonomika** | Vyšší paměťové nároky (managed state), nutnost pečlivého tuningu |
+
+**Kdy použít Flink:**
+- Fraud detection, real-time bidding, IoT (< 100 ms latence)
+- Komplexní stateful stream processing
+- CDC pipelines
+- Event-driven architektury
+
+### Trino (ex PrestoSQL)
+
+Distribuovaný SQL query engine — federované dotazy napříč různými zdroji.
+
+| Vlastnost | Detail |
+|-----------|--------|
+| **Architektura** | Coordinator + Worker (bez storage, bez scheduleru) |
+| **Konektory** | Iceberg, Delta, Hive, HDFS, S3, GCS, ADLS, PostgreSQL, MySQL, Kafka, Elasticsearch |
+| **Use case** | Interactive SQL, federované dotazy, lakehouse queries |
+| **Verze 2026** | Trino 470+, Iceberg native, Delta Lake connector |
+
+---
+
+## Srovnání Spark vs Flink vs Trino
+
+| Kritérium | Spark | Flink | Trino |
+|-----------|-------|-------|-------|
+| **Primární use case** | Batch + unifying | True streaming | Interactive SQL |
+| **Latence streaming** | 100 ms – 5 s (micro-batch) | < 100 ms (true streaming) | N/A |
+| **Throughput** | Vysoký (batch optimalizace) | Vysoký (pipeline optimalizace) | Střední (ad-hoc) |
+| **State management** | State store (external) | Managed state (embedded) | N/A |
+| **SQL support** | Spark SQL | Flink SQL | ANSI SQL (nejširší) |
+| **ML/AI** | MLlib, SparkML | — | — |
+| **Kubernetes** | Native (production) | Native (production) | Native (production) |
+| **Křivka učení** | Střední | Vysoká | Nízká |
+| **Provozní náročnost** | Střední | Vysoká | Střední |
+
+---
+
+## Orchestrace
+
+| Nástroj | Verze 2026 | Use case |
+|---------|-----------|----------|
+| **Apache Airflow** | 3.0+ (taskflow API, dynamic tasks, deferrable operators) | Univerzální orchestrace, největší ekosystém |
+| **Dagster** | 1.x (asset-oriented, software-defined assets) | Data pipelines, observabilita, asset lineage |
+| **Prefect** | 3.x (native async, workers, blocks) | Python-native, serverless workers |
+| **Kestra** | 1.x (YAML-native, declarative) | Event-driven orchestration |
+| **Apache NiFi** | 2.x (flow-based, visual) | Data ingestion, CDC, streaming |
+
+---
+
+## Lakehouse architektura
+
+Lakehouse kombinuje flexibilitu data lake (object storage) s výkonem a governance data warehouse.
+
+```
+┌──────────────────────────────────────────────────────┐
+│                    Query Engines                      │
+│   Trino    Spark SQL    Flink SQL    Dremio    Athena  │
+└─────────────────────────┬────────────────────────────┘
+                          │
+┌─────────────────────────▼────────────────────────────┐
+│                  Table Format Layer                    │
+│         Apache Iceberg / Delta Lake / Hudi            │
+│       (ACID, time travel, schema evolution)           │
+└─────────────────────────┬────────────────────────────┘
+                          │
+┌─────────────────────────▼────────────────────────────┐
+│                    Storage Layer                       │
+│           S3 / GCS / ADLS / MinIO / HDFS              │
+│               (Parquet / ORC / Avro)                  │
+└──────────────────────────────────────────────────────┘
+```
+
+Detailněji Iceberg viz [DATABASES.md — Apache Iceberg Lakehouse](DATABASES.md#apache-iceberg-lakehouse).
+
+---
+
+## Infrastruktura pro Big Data
+
+### Cluster sizing
+
+| Komponenta | Spark (batch) | Flink (streaming) | Trino (SQL) |
+|------------|--------------|-------------------|-------------|
+| **CPU** | 16–64 cores/node | 16–32 cores/node | 8–32 cores/node |
+| **RAM** | 64–256 GB/node | 64–256 GB/node (včetně managed state) | 64–256 GB/node |
+| **Storage** | HDFS / object storage | Object storage (checkpointy) | Žádná (stateless) |
+| **Network** | 25–100 GbE (shuffle-heavy) | 25–100 GbE (checkpointing) | 25–100 GbE |
+| **Disk** | NVMe (scratch, shuffle) | NVMe (RocksDB state backend) | — |
+| **Cluster velikost** | 5–200+ nodes | 3–100+ nodes | 5–50 nodes |
+
+### Network considerations
+
+- **Spark shuffle** — heavy network traffic mezi uzly; doporučeno 25–100 GbE, ideálně bez oversubscription
+- **Flink checkpointing** — periodický zápis stavu na object storage; vyžaduje stabilní latenci
+- **HDFS rack awareness** — optimalizuje replikaci napříč racky
+- **Data locality** — HDFS: čtení z lokálního disku; object storage: network-bound
+
+### Kubernetes vs YARN
+
+| Kritérium | YARN | Kubernetes |
+|-----------|------|-----------|
+| **Resource isolation** | Cgroups (YARN containers) | Cgroups + namespaces (pods) |
+| **Ecosystem fit** | Hadoop-native (HDFS, Hive, Spark) | Cloud-native, Spark, Flink, Trino |
+| **Operational complexity** | Nižší (jeden cluster manager) | Vyšší (vyžaduje K8s cluster) |
+| **Multi-tenant isolation** | YARN queues (Capacity/Fair Scheduler) | Namespaces, ResourceQuotas, LimitRanges |
+| **Stateful workloads** | Omezená | StatefulSets, PVC, Operators |
+| **2026 trend** | Legacy (klesající) | Standard pro nové projekty |
+
+---
+
+## Nasazení v cloudu
+
+| Cloud | Dávkové zpracování | Streaming | SQL | Managed K8s |
+|-------|-------------------|-----------|-----|-------------|
+| **AWS** | EMR (Spark, Hive, Flink) | Kinesis, MSK (Kafka), EMR Flink | Athena (Trino), Redshift | EKS |
+| **Azure** | HDInsight (Spark, Hive), Synapse | Event Hubs, HDInsight Flink | Synapse SQL, Azure Data Explorer | AKS |
+| **GCP** | Dataproc (Spark, Flink, Hive, Trino) | Pub/Sub, Dataflow (Beam), Dataproc Flink | BigQuery | GKE |
+
+---
+
+## Zdroje
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Poslední revize: 2026-06-18*
--- a/CASSANDRA.en.md
+++ b/CASSANDRA.en.md
@@ -123,7 +123,7 @@ ScyllaDB is advantageous when:

 ## Sources

-References, books, and standards: [sources/databases/sources.md](sources/databases/sources.md)
+References, books, and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md)

 ### Recommended reading

--- a/CICD.en.md
+++ b/CICD.en.md
@@ -637,7 +637,7 @@ New tools: Harness (AI-native CD), GitLab 19.0 (agentic MR workflows, secrets ma

 ## Resources

-Links, books and standards: [sources/cicd/sources.md](sources/cicd/sources.md)
+Links, books and standards: [sources/cicd/sources.en.md](sources/cicd/sources.en.md)

 ### Recommended Reading

--- a/CLOUD.en.md
+++ b/CLOUD.en.md
@@ -144,7 +144,7 @@ Analogues: Azure Well-Architected Framework, GCP Architecture Framework
 | **Storage optimized** | I4i, im4gn | 1:4 + NVMe | Transactional DB, data warehousing, Kafka | i4i.large ~$0.138/h |
 | **GPU / ML** | P5, g5, trn1 | GPU attach | AI training (P5), inference (g5), ML (trn1) | g5.xlarge ~$1.006/h |

-See [GPU.md](GPU.md) for GPU model and configuration details.
+See [GPU.en.md](GPU.en.md) for GPU model and configuration details.

 ### Storage

@@ -287,7 +287,7 @@ Automated checks of architectural characteristics — analogous to tests for arc

 ## Hybrid Cloud Connectivity

-See also: [NETWORKING.md](NETWORKING.md) — network architecture (VPN, BGP, VPC design).
+See also: [NETWORKING.en.md](NETWORKING.en.md) — network architecture (VPN, BGP, VPC design).

 - **Site-to-Site VPN** — IPSec tunnel over the internet
 - **Direct Connect / ExpressRoute / Dedicated Interconnect** — private physical connection
@@ -480,7 +480,7 @@ OpenStack is the dominant open-source platform for building private clouds (IaaS

 ## Resources

-Links, books and standards: [sources/cloud/sources.md](sources/cloud/sources.md)
+Links, books and standards: [sources/cloud/sources.en.md](sources/cloud/sources.en.md)
 - **Cost tagging** — assign tags for chargeback/showback (Environment, Team, Cost Center, Application)
 - **Automated compliance** — AWS Config, Azure Policy, GCP Org Policies for guardrails
 - **Multi-account strategy** — AWS Control Tower, Azure Landing Zones, GCP Resource Hierarchy
--- a/CONNECTIVITY.en.md
+++ b/CONNECTIVITY.en.md
@@ -259,7 +259,7 @@ HPE ProLiant Gen11 (DL360/DL380) supports:

 ## Sources

-Links, books, and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+Links, books, and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)

 ### Recommended literature

--- a/DATABASE-ENGINES.en.md
+++ b/DATABASE-ENGINES.en.md
@@ -90,7 +90,7 @@ Each transaction sees a snapshot of data as of the start time. Old row versions

 ## Resources

-Links, books and standards: [sources/databases/sources.md](sources/databases/sources.md)
+Links, books and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md)

 ### Recommended Reading

--- a/DATABASES.en.md
+++ b/DATABASES.en.md
@@ -6,20 +6,20 @@

 | DB | License | Use Case | Details |
 |----|---------|----------|--------|
-| **PostgreSQL** | Open source | Universal, geospatial, analytics, AI | [POSTGRESQL.md](POSTGRESQL.md) |
-| **MySQL / MariaDB** | Open source | Web, LAMP stack, e-commerce | [MYSQL.md](MYSQL.md) |
+| **PostgreSQL** | Open source | Universal, geospatial, analytics, AI | [POSTGRESQL.en.md](POSTGRESQL.en.md) |
+| **MySQL / MariaDB** | Open source | Web, LAMP stack, e-commerce | [MYSQL.en.md](MYSQL.en.md) |
 | **Microsoft SQL Server** | Proprietary | Enterprise .NET, Windows ecosystem | — |
-| **Oracle DB** | Proprietary | Enterprise, finance, mainframe, RAC cluster | [ORACLE.md](ORACLE.md) |
+| **Oracle DB** | Proprietary | Enterprise, finance, mainframe, RAC cluster | [ORACLE.en.md](ORACLE.en.md) |
 | **Amazon Aurora** | Managed | MySQL/PostgreSQL compatible, cloud-native | — |

 ### NoSQL

 | Type | DB | Use Case | Details |
 |-----|----|----------|--------|
-| **Document** | MongoDB, Couchbase | JSON data, flexible schema | [MONGODB.md](MONGODB.md) |
-| **Key-Value / Cache** | Redis, Memcached, DynamoDB | Cache, session store, real-time | [REDIS.md](REDIS.md) |
-| **Wide-column** | Cassandra, ScyllaDB | Time-series, IoT, big data | [CASSANDRA.md](CASSANDRA.md) |
-| **Vector** | Pinecone, Qdrant, Milvus, pgvector | Embeddings, RAG, semantic search | [VEKTOROVE-DB.md](VEKTOROVE-DB.md) |
+| **Document** | MongoDB, Couchbase | JSON data, flexible schema | [MONGODB.en.md](MONGODB.en.md) |
+| **Key-Value / Cache** | Redis, Memcached, DynamoDB | Cache, session store, real-time | [REDIS.en.md](REDIS.en.md) |
+| **Wide-column** | Cassandra, ScyllaDB | Time-series, IoT, big data | [CASSANDRA.en.md](CASSANDRA.en.md) |
+| **Vector** | Pinecone, Qdrant, Milvus, pgvector | Embeddings, RAG, semantic search | [VECTOR-DBS.en.md](VECTOR-DBS.en.md) |
 | **Graph** | Neo4j, Dgraph | Relationships, recommendations, social graphs | — |

 ### Storage Engines
@@ -258,6 +258,8 @@ Table metadata (.metadata.json)
 | **Hidden partitioning** | Automatic partition filters (user does not need to specify) |
 | **Multi-engine** | Spark, Flink, Trino, Dremio, Snowflake over the same data |

+For a broader overview of the Big Data ecosystem (HDFS, Spark, Flink, Trino, Delta Lake, Hudi) see [BIG-DATA.en.md](BIG-DATA.en.md).
+
 ### When to Use Iceberg

 - Multi-tool access to the same governed data
@@ -305,7 +307,7 @@ Table metadata (.metadata.json)

 ## Resources

-Links, books and standards: [sources/databases/sources.md](sources/databases/sources.md)
+Links, books and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md)

 ### Recommended Reading

--- a/DATABASES.md
+++ b/DATABASES.md
@@ -258,6 +258,8 @@ Table metadata (.metadata.json)
 | **Hidden partitioning** | Automatické partition filtry (uživatel nemusí uvádět) |
 | **Multi-engine** | Spark, Flink, Trino, Dremio, Snowflake nad stejnými daty |

+Detailnější přehled Big Data ekosystému (HDFS, Spark, Flink, Trino, Delta Lake, Hudi) viz [BIG-DATA.md](BIG-DATA.md).
+
 ### Kdy použít Iceberg

 - Multi-tool přístup ke stejným governed datům
--- a/DATACENTERS.en.md
+++ b/DATACENTERS.en.md
@@ -658,6 +658,281 @@ flowchart TD
    CLIM -->|"Cold (SE, NO)"| FC3["Free cooling 7000+ h/year<br/>Air-side economizer<br/>PUE < 1.2"]
 ```

+## Secondary data center topologies
+
+When planning a second DC, the choice of topology is key based on distance, RPO/RTO, and budget.
+
+### Distance classification
+
+| Category | Distance | Latency (round-trip) | Use case |
+|-----------|-----------|---------------------|----------|
+| **Metro (Campus)** | 1–20 km | < 1 ms | Synchronous replication, stretched cluster |
+| **Metro** | 20–100 km | 1–5 ms | Metro cluster, mostly sync replication |
+| **Regional** | 100–500 km | 5–20 ms | Asynchronous replication, warm standby |
+| **Continent** | 500–3000 km | 20–100 ms | Asynchronous replication, cold standby |
+| **Global** | 3000+ km | > 100 ms | Async only, no real-time dependencies |
+
+### Topologies by operational mode
+
+#### Active-Active (Hot-Hot)
+
+```
+DC-A (Primary)                 DC-B (Active)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  App Active        │
+│  DB Active         │◄─sync─►│  DB Active         │
+│  Users → LB → A    │        │  Users → LB → B    │
+└────────────────────┘        └────────────────────┘
+           │                         │
+           └──── Global Load Balancer ────┘
+```
+
+| Parameter | Value |
+|----------|---------|
+| **RTO** | 0–seconds (automatic failover, traffic is redirected) |
+| **RPO** | 0 (sync replication, commit is confirmed only after write to both DCs) |
+| **Max distance** | < 100 km (latency < 5 ms RTT for sync DB replication) |
+| **Operating costs** | 2× (both DCs fully active, both fully equipped) |
+| **Advantages** | Zero downtime, instant switchover, full utilization of both DCs |
+| **Disadvantages** | Requires synchronous replication → distance limit, complex networking, split-brain risk |
+
+**Split-brain solutions**: STONITH (Shoot The Other Node In The Head), watchdog, quorum (3rd node in 3rd location / cloud), fencing, SCSI-3 persistent reservation.
+
+**Use case**: Financial services, telco, payment gateways — where even a minute of downtime = millions.
+
+#### Active-Passive (Hot-Warm, MetroCluster)
+
+```
+DC-A (Primary)                 DC-B (Standby)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  App Standby       │
+│  DB Primary        │──sync──►│  DB Standby        │
+│  Users → LB → A    │        │  ~~~ (waiting) ~~~ │
+│  DNS: A-record     │        │  DNS: health check │
+└────────────────────┘        └────────────────────┘
+```
+
+| Parameter | Value |
+|----------|---------|
+| **RTO** | tens of seconds–minutes (DNS failover + App startup) |
+| **RPO** | 0 (sync) or seconds (async) |
+| **Max distance** | sync < 100 km, async unlimited |
+| **Operating costs** | 1.5–1.8× (second DC has reduced or idle compute) |
+| **MetroCluster** | Specific implementation: FC SAN over DWDM, sync mirror, automatic failover |
+
+**MetroCluster** (NetApp, Dell EMC, HPE):
+- Storage-based cluster with synchronous mirroring between DCs
+- Automatic failover on entire DC failure
+- Requires dedicated DWDM or dark fiber interconnection
+- Typical distance: up to 50 km (for latency < 1 ms RTT)
+- Use case: enterprise storage, primary+secondary DC in metropolitan area
+
+#### Hot-Cold (Warm Standby → Cold)
+
+```
+DC-A (Primary)                 DC-B (Cold Standby)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  ~~~ powered off ~~~│
+│  DB Active         │──async─►│  Backup storage    │
+│  Users → A         │        │  ~~~ no compute ~~~│
+└────────────────────┘        └────────────────────┘
+```
+
+| Parameter | Value |
+|----------|---------|
+| **RTO** | hours–days (purchase/rent HW, restore from backup) |
+| **RPO** | hours (last backup) |
+| **Max distance** | unlimited |
+| **Operating costs** | 1.1–1.3× (only storage and facility, compute only at failover) |
+| **Typical use case** | Low-cost DR, compliance, last resort |
+
+#### Pilot Light
+
+```
+DC-A (Primary)                 DC-B (Pilot Light)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  ~~~ off ~~~       │
+│  DB Active         │──async─►│  DB replica (mini) │
+│  All services      │        │  Core services only│
+│                    │        │  (DNS, LDAP, mon)  │
+└────────────────────┘        └────────────────────┘
+                              On DR: spin-up compute
+                              from IaC, rest from backup
+```
+
+- DC-B runs with minimum compute (only core services and DB replica)
+- Application layer is spun up from IaC (Terraform, Ansible) only during DR
+- Compromise between cost and RTO
+
+### Comparison table
+
+| Topology | RTO | RPO | Cost (× primary) | Max distance | Failover |
+|-----------|-----|-----|-------------------|-------------|----------|
+| **Active-Active** | 0–s | 0 | 2.0× | < 100 km | Auto (traffic) |
+| **MetroCluster** | s–min | 0 | 1.8–2.0× | < 50 km | Auto (storage) |
+| **Active-Passive (sync)** | min | 0 | 1.5–1.8× | < 100 km | Semi-auto |
+| **Active-Passive (async)** | min–h | s–min | 1.3–1.5× | unlimited | Semi-auto |
+| **Pilot Light** | h | min–h | 1.2–1.4× | unlimited | Manual |
+| **Warm Standby** | min–h | s–min | 1.5–1.8× | unlimited | Semi-auto |
+| **Cold Standby** | days | h | 1.1–1.3× | unlimited | Manual |
+
+### Stretched Cluster
+
+```
+┌──── Site A (50 km) ────┐    ┌──── Site B ──────────┐
+│  ┌──────────────────┐   │    │  ┌──────────────────┐ │
+│  │  ESXi / Hyper-V  │   │    │  │  ESXi / Hyper-V  │ │
+│  │  VM               │   │    │  │  VM (complement) │ │
+│  └────────┬─────────┘   │    │  └────────┬─────────┘ │
+│           │             │    │           │            │
+│  ┌────────▼─────────┐  │    │  ┌────────▼─────────┐  │
+│  │  Storage (SAN)   │──┼────┼──│  Storage (SAN)   │  │
+│  │  MetroCluster    │  │    │  │  MetroCluster    │  │
+│  └──────────────────┘  │    │  └──────────────────┘  │
+└────────────────────────┘    └────────────────────────┘
+                │
+          ┌─────▼──────┐
+          │  vCenter / │
+          │  Cluster   │
+          │  (single)  │
+          └────────────┘
+```
+
+- One cluster stretched across two sites (single management domain)
+- VMs can live-migrate between sites (vMotion over distance)
+- Storage synchronously mirrored (MetroCluster, VPLEX, vSAN延伸)
+- **Requirements**: dark fiber / DWDM, low latency (< 5 ms), high link reliability
+- **Risks**: split-brain, brain drain (split-site cluster), network dependency
+- **Use case**: enterprise with own dark fiber between two DCs in a metropolitan area
+
+### Decision tree
+
+```mermaid
+flowchart TD
+    Start(["Secondary DC"]) --> RPO{"Required RPO?"}
+    RPO -->|"0 (no data loss)"| SYNC{"Sync replication possible?"}
+    SYNC -->|"Yes, < 100 km"| ACT{"Want zero downtime?"}
+    ACT -->|"Yes"| AA["Active-Active<br/>RTO=0, RPO=0, 2× cost"]
+    ACT -->|"No"| AP["Active-Passive<br/>RTO=min, RPO=0, 1.5×"]
+    SYNC -->|"No, > 100 km"| ASYNC["Active-Passive (async)<br/>RTO=min, RPO=s, 1.3×"]
+
+    RPO -->|"minutes–hours"| WARM{"Want fast failover?"}
+    WARM -->|"Yes"| PILOT["Pilot Light<br/>RTO=h, RPO=min, 1.2×"]
+    WARM -->|"No"| COLD["Cold Standby<br/>RTO=days, RPO=h, 1.1×"]
+
+    Start --> DIST{"Distance between DCs"}
+    DIST -->|"< 50 km, own fiber"| MC["MetroCluster / Stretched Cluster<br/>Single management, sync storage"]
+    DIST -->|"50–300 km"| REG["Regional DR<br/>Active-Passive, async replication"]
+    DIST -->|"> 300 km"| GLOBAL["Global DR<br/>Cold standby, backup & restore"]
+```
+
+### Physical infrastructure for DC interconnection
+
+| Technology | Bandwidth | Max distance | Latency | Use case |
+|------------|-----------|-------------|---------|----------|
+| **Dark fiber** | 100 GbE–800 GbE | 10–80 km (single-mode) | < 0.1 ms | MetroCluster, stretched cluster |
+| **DWDM** | 400 GbE–1.6 TbE (per lambda) | 80–120 km (without amplifier) | < 0.5 ms | Metro, metro cluster |
+| **CWDM** | 10–25 GbE (per channel) | 10–40 km | < 0.3 ms | Campus, smaller metro |
+| **MPLS L2VPN** | 10–100 GbE | unlimited | 1–10 ms | Regional DR, async replication |
+| **Internet IPsec** | 1–10 GbE | unlimited | 5–50 ms | Cold standby, backup |
+
+### Impact of individual technologies on DC topology selection
+
+Choosing a secondary DC topology is not purely an infrastructure decision — each layer (DB, hypervisor, orchestration, messaging) brings its own constraints.
+
+#### Databases
+
+| DB technology | Sync replication | Max distance | Auto-failover | Split-brain handling | Note |
+|---------------|---------------|-------------|---------------|-------------------|----------|
+| **PostgreSQL** | Synchronous commit (synchronous_standby_names) | < 100 km (latency < 10 ms) | Patroni / repmgr + etcd | Quorum (etcd, 3+ node) | Streaming replication, needs wal_keep_segments |
+| **MySQL** | Group Replication (multi-primary, single-primary) | < 100 km | MySQL InnoDB Cluster + MySQL Router | Paxos (Group Replication, 3+ node) | Semi-sync as compromise |
+| **Oracle** | Data Guard (SYNC/FASTSYNC/ASYNC), RAC extended | sync < 100 km, async unlimited | Data Guard Broker / FSFO (Fast Start Failover) | Observer (3rd node) | Far Sync for remote DCs |
+| **MSSQL** | AlwaysOn Availability Groups (SYNCHRONOUS_COMMIT) | < 100 km | AlwaysOn + Cluster quorum | File share majority / cloud witness | Multi-site cluster support |
+| **MongoDB** | Majority write concern + journaling | < 100 km | Replica set auto-election | Arbitration node (voting member) | Priority-based failover |
+| **Cassandra** | N/A (multi-master, eventual consistency) | unlimited | Yes (peer-to-peer) | None (multi-master, gossip protocol) | Snitch-aware topology, NetworkTopologyStrategy |
+| **Redis** | Redis Sentinel / Redis Cluster (async) | unlimited (async) | Sentinel / Cluster failover | Quorum (Sentinel, majority) | PSYNC replication, replication lag |
+
+Key limitation for **sync replication**: latency < 5 ms RTT (commit must wait for confirmation from both DCs). At 100 km RTT ~1 ms — OK. At 1000 km (~10 ms RTT) sync replication reduces transaction throughput by 80+ %.
+
+Suitable for **Active-Active**:
+- **Cassandra / ScyllaDB** — native multi-DC, eventual consistency, no split-brain
+- **MySQL Group Replication (multi-primary)** — 3+ DC for quorum
+- **CockroachDB / TiDB** — native multi-region, ACID across DCs
+- **Redis Enterprise** — Active-Active (CRDT-based)
+
+Suitable for **Active-Passive**:
+- **PostgreSQL + Patroni** — auto-failover, etcd quorum
+- **Oracle Data Guard** — FSFO, far sync for remote DCs
+- **MSSQL AlwaysOn** — cloud witness
+- **MongoDB Replica Set** — arbitration node in 3rd location
+
+#### Hypervisors
+
+| Hypervisor | Cluster technology | Stretched cluster | Max distance | Split-brain |
+|-----------|-------------------|-------------------|-------------|-------------|
+| **VMware vSphere** | vSAN延伸, Metro vCenter, Site Recovery Manager | Yes (vSAN延伸, Metro Cluster) | < 50 km (vSAN延伸), < 10 ms RTT | Fencing (STONITH), witness host |
+| **Hyper-V** | Storage Replica + Failover Cluster | Yes (Cluster Sets) | < 50 km (sync), unlimited (async) | File share witness / cloud witness |
+| **Proxmox VE** | Proxmox HA + Ceph | Limited (Ceph stretch cluster) | < 50 km (Ceph sync) | Ceph monitor quorum (3+ DC) |
+| **XCP-ng / XenServer** | Xen Orchestra HA + SR (Storage Repository) replication | Limited | depends on storage replication | — |
+| **Nutanix AHV** | Metro Availability (sync), Async DR | Yes (Metro) | < 100 km (sync), unlimited (async) | Witness VM (cloud / 3rd site) |
+| **KVM / oVirt** | oVirt HA + GlusterFS / NFS | Limited | depends on storage replication | — |
+
+**vSAN延伸 specific requirements:**
+- Dedicated vSAN network (25 GbE min., < 5 ms RTT)
+- Witness host in 3rd location (or cloud witness)
+- All VM policies (FTT=1, mirroring striped)
+- Storage policy: `site-A + site-B + witness`
+
+#### Kubernetes and container platforms
+
+| Platform | Multi-cluster DR | Replication | Max distance | Failover |
+|-----------|-----------------|-----------|-------------|----------|
+| **Vanilla K8s** | KubeFed, Cluster API, Velero + Restic | Velero (backup/restore), Rook (Ceph) | unlimited | Manual (Velero restore) |
+| **OpenShift** | ACM (Advanced Cluster Management), Velero | OADP (OpenShift API for Data Protection) | unlimited | ACM failover (subscription) |
+| **Rancher** | Rancher Multi-Cluster App, Velero | Longhorn (sync/async DR), Velero | unlimited | Semi-auto |
+| **Google GKE** | Multi-cluster Services, Backup for GKE | Config Sync, Backup for GKE | unlimited | Manual |
+| **Azure AKS** | Azure ARC + Velero + Azure Traffic Manager | AKS backup (velero), Azure Site Recovery | unlimited | Manual (Velero) |
+| **AWS EKS** | EKS multi-cluster, Velero + S3 cross-region | Velero (S3), Rook (EBS snapshots) | unlimited | Manual |
+
+**Key K8s DR principles:**
+- **Applications must be stateless** (or state externalized to DB/storage)
+- **Velero** — backup/restore entire cluster (PV, resources, helm releases)
+- **Rook/Ceph** — cross-region mirroring RBD volumes
+- **KubeFed / ACM** — subscription-based deploy to multiple clusters
+- **Ingress/Gateway API** — traffic routing between clusters
+- **External DNS** — DNS failover on cluster outage
+
+#### Messaging / streaming
+
+| Platform | Replication | Topology | DR support | Max distance |
+|-----------|-----------|-----------|------------|-------------|
+| **Apache Kafka** | MirrorMaker 2, Confluent Cluster Linking, KRaft quorum | Active-Passive (MM2), Active-Active (Cluster Linking) | MM2: async, Cluster Linking: async | unlimited |
+| **RabbitMQ** | Classic Queue Mirroring, Quorum Queues | Active-Passive (Warm Standby) | Federation / Shovel (async) | unlimited |
+| **Red Hat AMQ** | (Artemis) Cluster + HA | Active-Passive (shared store / replication) | Live-backup pair | < 100 km (sync) |
+| **NATS** | NATS JetStream (cluster + cross-account) | Active-Active (Leaf nodes, cross-account) | Super-cluster, failover | unlimited |
+| **Apache Pulsar** | BookKeeper (bookie rack-aware), geo-replication | Active-Active (geo-replication) | Built-in (cluster-level) | unlimited (async) |
+| **AWS SQS/SNS** | Managed, AWS region pairs | Active-Active (multi-region) | Built-in (AWS managed) | unlimited |
+| **Azure Service Bus** | Managed, paired region | Active-Passive (paired region) | Built-in (geo-recovery) | unlimited |
+| **Oracle Service Bus (OSB)** | Oracle WebLogic Cluster + JDBC store + AQ | Active-Passive (WebLogic Cluster + Data Guard) | OSB/WLS cluster + Oracle RAC/Data Guard sync | < 100 km (Data Guard sync), unlimited (async) |
+
+**Messaging DR recommendations:**
+- **Kafka**: use Cluster Linking for Active-Active, or MirrorMaker 2 for Active-Passive; replicate only critical topics
+- **RabbitMQ**: Quorum Queues + Federation upstream for DR; avoid Classic Queue Mirroring (deprecated)
+- **Pulsar**: native geo-replication, bookie rack-aware for stretched cluster; easiest DR among messaging platforms
+- **OSB**: WebLogic cluster + Oracle RAC/Data Guard; DR depends on DB layer, not on OSB itself
+
+### Per-layer limitations summary table
+
+| Layer | Limiting factor for secondary DC | Max distance for sync | Impact on topology selection |
+|--------|-----------------------------------|----------------------|--------------------------|
+| **Storage** | Sync mirror latency, DWDM cost | < 50 km (MetroCluster) | Stretched cluster only in metro |
+| **Databases** | Commit wait for sync replication | < 100 km (5 ms RTT) | Active-Active only with multi-master DB |
+| **Hypervisor** | Stretched cluster quorum + fencing | < 50 km (vSAN, 5 ms) | MetroCluster / stretched cluster |
+| **Kubernetes** | Velero restore time, Rook mirror latency | unlimited (async) | Active-Passive, cold standby |
+| **Messaging** | Replication lag, offset management | unlimited (async) | Active-Active (Kafka, Pulsar, NATS) or Active-Passive |
+| **Network** | Dark fiber/DWDM cost, latency | < 100 km (metro fiber) | Limits sync replication options |
+| **Application** | Stateful/stateless, connection draining | depends on architecture | Stateless app → any topology |
+
 ## Disk monitoring — S.M.A.R.T.

 Self-Monitoring, Analysis and Reporting Technology — predictive monitoring of HDD/SSD.
@@ -675,7 +950,7 @@ Tools: `smartmontools` (smartctl, smartd), Prometheus exporter (`node_exporter`)

 ## Sources

-Links, books and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)

 ### Recommended literature

@@ -735,7 +1010,7 @@ Best practices: separate auth and recursive resolvers, DNSSEC, split-horizon (in

 ### Monitoring and observability

-See [MONITORING.md](MONITORING.md). Before running first workloads, DC must have:
+See [MONITORING.en.md](MONITORING.en.md). Before running first workloads, DC must have:
 - Metric collection (Prometheus, Zabbix)
 - Centralized logs (Loki, ELK)
 - Alerting (Alertmanager, PagerDuty)
--- a/DATACENTERS.md
+++ b/DATACENTERS.md
@@ -658,6 +658,281 @@ flowchart TD
    CLIM -->|"Chladná (SE, NO)"| FC3["Free cooling 7000+ h/rok<br/>Air-side economizer<br/>PUE < 1.2"]
 ```

+## Topologie sekundárního datového centra
+
+Při plánování druhého DC je klíčová volba topologie podle vzdálenosti, RPO/RTO a rozpočtu.
+
+### Klasifikace vzdáleností
+
+| Kategorie | Vzdálenost | Latence (round-trip) | Use case |
+|-----------|-----------|---------------------|----------|
+| **Metro (Campus)** | 1–20 km | < 1 ms | Synchronní replikace, stretched cluster |
+| **Metro** | 20–100 km | 1–5 ms | Metro cluster, většinou sync replikace |
+| **Regional** | 100–500 km | 5–20 ms | Asynchronní replikace, warm standby |
+| **Continent** | 500–3000 km | 20–100 ms | Asynchronní replikace, cold standby |
+| **Global** | 3000+ km | > 100 ms | Pouze async, žádné real-time závislosti |
+
+### Topologie podle provozního režimu
+
+#### Active-Active (Hot-Hot)
+
+```
+DC-A (Primary)                 DC-B (Active)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  App Active        │
+│  DB Active         │◄─sync─►│  DB Active         │
+│  Users → LB → A    │        │  Users → LB → B    │
+└────────────────────┘        └────────────────────┘
+           │                         │
+           └──── Global Load Balancer ────┘
+```
+
+| Parametr | Hodnota |
+|----------|---------|
+| **RTO** | 0–vteřiny (automatický failover, traffic se přesměruje) |
+| **RPO** | 0 (sync replikace, commit je potvrzen až po zápisu do obou DC) |
+| **Max distance** | < 100 km (latence < 5 ms RTT pro sync DB replikaci) |
+| **Provozní náklady** | 2× (obě DC plně aktivní, obě plně vybavené) |
+| **Výhody** | Nulový výpadek, okamžité přepnutí, plné využití obou DC |
+| **Nevýhody** | Nutná synchronní replikace → limit vzdálenosti, komplexní networking, split-brain risk |
+
+**Split-brain řešení**: STONITH (Shoot The Other Node In The Head), watchdog, quorum (3. node v 3. lokaci / cloud), fencing, SCSI-3 persistent reservation.
+
+**Use case**: Finanční služby, telco, platební brány — kde i minuta výpadku = miliony.
+
+#### Active-Passive (Hot-Warm, MetroCluster)
+
+```
+DC-A (Primary)                 DC-B (Standby)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  App Standby       │
+│  DB Primary        │──sync──►│  DB Standby        │
+│  Users → LB → A    │        │  ~~~ (čeká) ~~~    │
+│  DNS: A-record     │        │  DNS: health check │
+└────────────────────┘        └────────────────────┘
+```
+
+| Parametr | Hodnota |
+|----------|---------|
+| **RTO** | desítky vteřin–minuty (DNS failover + startup App) |
+| **RPO** | 0 (sync) nebo sekundy (async) |
+| **Max distance** | sync < 100 km, async neomezeně |
+| **Provozní náklady** | 1,5–1,8× (druhé DC má zmenšený nebo idle compute) |
+| **MetroCluster** | Specifická implementace: FC SAN přes DWDM, sync mirror, automatický failover |
+
+**MetroCluster** (NetApp, Dell EMC, HPE):
+- Storage-based cluster se synchronním mirroringem mezi DC
+- Automatic failover při selhání celého DC
+- Vyžaduje dedikované DWDM nebo dark fiber propojení
+- Typická vzdálenost: do 50 km (pro latenci < 1 ms RTT)
+- Use case: enterprise storage, primary+secondary DC v metropolitní oblasti
+
+#### Hot-Cold (Warm Standby → Cold)
+
+```
+DC-A (Primary)                 DC-B (Cold Standby)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  ~~~ powered off ~~~│
+│  DB Active         │──async─►│  Backup storage    │
+│  Users → A         │        │  ~~~ no compute ~~~│
+└────────────────────┘        └────────────────────┘
+```
+
+| Parametr | Hodnota |
+|----------|---------|
+| **RTO** | hodiny–dny (nákup/najmutí HW, obnova z backupu) |
+| **RPO** | hodiny (poslední backup) |
+| **Max distance** | neomezena |
+| **Provozní náklady** | 1,1–1,3× (jen storage a facility, compute až při failoveru) |
+| **Typ use case** | Low-cost DR, compliance, poslední záchrana |
+
+#### Pilot Light
+
+```
+DC-A (Primary)                 DC-B (Pilot Light)
+┌────────────────────┐        ┌────────────────────┐
+│  App Active        │        │  ~~~ off ~~~       │
+│  DB Active         │──async─►│  DB replica (mini) │
+│  Všechny služby    │        │  Core services jen │
+│                    │        │  (DNS, LDAP, mon)  │
+└────────────────────┘        └────────────────────┘
+                              Při DR: spin-up compute
+                              z IaC, zbytek z backupu
+```
+
+- DC-B běží s minimem compute (jen core služby a DB replica)
+- Aplikační vrstva se spin-up z IaC (Terraform, Ansible) až při DR
+- Kompromis mezi náklady a RTO
+
+### Srovnávací tabulka
+
+| Topologie | RTO | RPO | Náklady (× primár) | Max distance | Failover |
+|-----------|-----|-----|-------------------|-------------|----------|
+| **Active-Active** | 0–s | 0 | 2,0× | < 100 km | Auto (traffic) |
+| **MetroCluster** | s–min | 0 | 1,8–2,0× | < 50 km | Auto (storage) |
+| **Active-Passive (sync)** | min | 0 | 1,5–1,8× | < 100 km | Polo-auto |
+| **Active-Passive (async)** | min–h | s–min | 1,3–1,5× | neomezena | Polo-auto |
+| **Pilot Light** | h | min–h | 1,2–1,4× | neomezena | Manuální |
+| **Warm Standby** | min–h | s–min | 1,5–1,8× | neomezena | Polo-auto |
+| **Cold Standby** | dny | h | 1,1–1,3× | neomezena | Manuální |
+
+### Stretched Cluster
+
+```
+┌──── Site A (50 km) ────┐    ┌──── Site B ──────────┐
+│  ┌──────────────────┐   │    │  ┌──────────────────┐ │
+│  │  ESXi / Hyper-V  │   │    │  │  ESXi / Hyper-V  │ │
+│  │  VM               │   │    │  │  VM (komplement) │ │
+│  └────────┬─────────┘   │    │  └────────┬─────────┘ │
+│           │             │    │           │            │
+│  ┌────────▼─────────┐  │    │  ┌────────▼─────────┐  │
+│  │  Storage (SAN)   │──┼────┼──│  Storage (SAN)   │  │
+│  │  MetroCluster    │  │    │  │  MetroCluster    │  │
+│  └──────────────────┘  │    │  └──────────────────┘  │
+└────────────────────────┘    └────────────────────────┘
+                │
+          ┌─────▼──────┐
+          │  vCenter / │
+          │  Cluster   │
+          │  (single)  │
+          └────────────┘
+```
+
+- Jeden cluster roztažený přes dvě lokality (single management domain)
+- VM mohou live-migrovat mezi site (vMotion nad vzdálenost)
+- Storage synchronně mirrorovaná (MetroCluster, VPLEX, vSAN延伸)
+- **Požadavky**: dark fiber / DWDM, nízká latence (< 5 ms), vysoká spolehlivost linky
+- **Riziko**: split-brain, brain drain (split-site cluster), závislost na síti
+- **Use case**: enterprise s vlastní dark fiber mezi dvěma DC v metropolitní oblasti
+
+### Rozhodovací strom
+
+```mermaid
+flowchart TD
+    Start(["Sekundární DC"]) --> RPO{"Požadované RPO?"}
+    RPO -->|"0 (žádná ztráta dat)"| SYNC{"Sync replikace možná?"}
+    SYNC -->|"Ano, < 100 km"| ACT{"Chceš nulový výpadek?"}
+    ACT -->|"Ano"| AA["Active-Active<br/>RTO=0, RPO=0, 2× náklady"]
+    ACT -->|"Ne"| AP["Active-Passive<br/>RTO=min, RPO=0, 1,5×"]
+    SYNC -->|"Ne, > 100 km"| ASYNC["Active-Passive (async)<br/>RTO=min, RPO=s, 1,3×"]
+
+    RPO -->|"minuty–hodiny"| WARM{"Chceš rychlý failover?"}
+    WARM -->|"Ano"| PILOT["Pilot Light<br/>RTO=h, RPO=min, 1,2×"]
+    WARM -->|"Ne"| COLD["Cold Standby<br/>RTO=dny, RPO=h, 1,1×"]
+
+    Start --> DIST{"Vzdálenost mezi DC"}
+    DIST -->|"< 50 km, vlastní fiber"| MC["MetroCluster / Stretched Cluster<br/>Single management, sync storage"]
+    DIST -->|"50–300 km"| REG["Regionální DR<br/>Active-Passive, async replikace"]
+    DIST -->|"> 300 km"| GLOBAL["Globální DR<br/>Cold standby, backup & restore"]
+```
+
+### Fyzická infrastruktura pro propojení DC
+
+| Technologie | Bandwidth | Max distance | Latence | Use case |
+|------------|-----------|-------------|---------|----------|
+| **Dark fiber** | 100 GbE–800 GbE | 10–80 km (single-mode) | < 0,1 ms | MetroCluster, stretched cluster |
+| **DWDM** | 400 GbE–1,6 TbE (per lambda) | 80–120 km (bez zesilovače) | < 0,5 ms | Metro, metro cluster |
+| **CWDM** | 10–25 GbE (per channel) | 10–40 km | < 0,3 ms | Campus, menší metro |
+| **MPLS L2VPN** | 10–100 GbE | neomezena | 1–10 ms | Regional DR, async replikace |
+| **Internet IPsec** | 1–10 GbE | neomezena | 5–50 ms | Cold standby, backup |
+
+### Vliv jednotlivých technologií na výběr DC topologie
+
+Volba topologie sekundárního DC není čistě infrastrukturní rozhodnutí — každá vrstva (DB, hypervisor, orchestrace, messaging) přináší vlastní omezení.
+
+#### Databáze
+
+| DB technologie | Sync replikace | Max distance | Auto-failover | Split-brain řešení | Poznámka |
+|---------------|---------------|-------------|---------------|-------------------|----------|
+| **PostgreSQL** | Synchronous commit (synchronous_standby_names) | < 100 km (latence < 10 ms) | Patroni / repmgr + etcd | Quorum (etcd, 3+ node) | Streaming replication, nutné wal_keep_segments |
+| **MySQL** | Group Replication (multi-primary, single-primary) | < 100 km | MySQL InnoDB Cluster + MySQL Router | Paxos (Group Replication, 3+ node) | Semi-sync jako kompromis |
+| **Oracle** | Data Guard (SYNC/FASTSYNC/ASYNC), RAC extended | sync < 100 km, async neomezena | Data Guard Broker / FSFO (Fast Start Failover) | Observer (3. node) | Far Sync pro vzdálená DC |
+| **MSSQL** | AlwaysOn Availability Groups (SYNCHRONOUS_COMMIT) | < 100 km | AlwaysOn + Cluster quorum | File share majority / cloud witness | Multi-site cluster podpora |
+| **MongoDB** | Majority write concern + journaling | < 100 km | Replica set auto-election | Arbitration node (voting member) | Priority-based failover |
+| **Cassandra** | N/A (multi-master, eventual consistency) | neomezena | Ano (peer-to-peer) | Žádné (multi-master, gossip protokol) | Snitch-aware topologie, NetworkTopologyStrategy |
+| **Redis** | Redis Sentinel / Redis Cluster (async) | neomezena (async) | Sentinel / Cluster failover | Quorum (Sentinel, majority) | PSYNC replikace, replication lag |
+
+Klíčové omezení pro **sync replikaci**: latence < 5 ms RTT (commit musí počkat na potvrzení z obou DC). Při 100 km je RTT ~1 ms – v pořádku. Při 1000 km (~10 ms RTT) sync replikace snižuje výkon transakcí o 80+ %.
+
+Pro **Active-Active** jsou vhodné:
+- **Cassandra / ScyllaDB** — nativní multi-DC, eventual consistency, žádný split-brain
+- **MySQL Group Replication (multi-primary)** — 3+ DC pro kvorum
+- **CockroachDB / TiDB** — nativní multi-region, ACID napříč DC
+- **Redis Enterprise** — Active-Active (CRDT-based)
+
+Pro **Active-Passive** jsou vhodné:
+- **PostgreSQL + Patroni** — auto-failover, etcd kvorum
+- **Oracle Data Guard** — FSFO, far sync pro vzdálené DC
+- **MSSQL AlwaysOn** — cloud witness
+- **MongoDB Replica Set** — arbitration node v 3. lokaci
+
+#### Hypervisory
+
+| Hypervisor | Cluster technologie | Stretched cluster | Max distance | Split-brain |
+|-----------|-------------------|-------------------|-------------|-------------|
+| **VMware vSphere** | vSAN延伸, Metro vCenter, Site Recovery Manager | Ano (vSAN延伸, Metro Cluster) | < 50 km (vSAN延伸), < 10 ms RTT | Fencing (STONITH), witness host |
+| **Hyper-V** | Storage Replica + Failover Cluster | Ano (Cluster Sets) | < 50 km (sync), neomezena (async) | File share witness / cloud witness |
+| **Proxmox VE** | Proxmox HA + Ceph | Omezeně (Ceph stretch cluster) | < 50 km (Ceph sync) | Ceph monitor quorum (3+ DC) |
+| **XCP-ng / XenServer** | Xen Orchestra HA + SR (Storage Repository) replication | Omezeně | závisí na storage replikaci | — |
+| **Nutanix AHV** | Metro Availability (sync), Async DR | Ano (Metro) | < 100 km (sync), neomezena (async) | Witness VM (cloud / 3. site) |
+| **KVM / oVirt** | oVirt HA + GlusterFS / NFS | Omezeně | závisí na storage replikaci | — |
+
+**vSAN延伸** specifické požadavky:
+- Dedikovaná síť pro vSAN (25 GbE min., < 5 ms RTT)
+- Witness host v 3. lokaci (nebo cloud witness)
+- Všechny VM protokoly (FTT=1, mirroring striped)
+- Storage policy: `site-A + site-B + witness`
+
+#### Kubernetes a kontejnerové platformy
+
+| Platforma | Multi-cluster DR | Replikace | Max distance | Failover |
+|-----------|-----------------|-----------|-------------|----------|
+| **Vanilla K8s** | KubeFed, Cluster API, Velero + Restic | Velero (backup/restore), Rook (Ceph) | neomezena | Manuální (Velero restore) |
+| **OpenShift** | ACM (Advanced Cluster Management), Velero | OADP (OpenShift API for Data Protection) | neomezena | ACM failover (subscription) |
+| **Rancher** | Rancher Multi-Cluster App, Velero | Longhorn (sync/async DR), Velero | neomezena | Polo-auto |
+| **Google GKE** | Multi-cluster Services, Backup for GKE | Config Sync, Backup for GKE | neomezena | Manuální |
+| **Azure AKS** | Azure ARC + Velero + Azure Traffic Manager | AKS backup (velero), Azure Site Recovery | neomezena | Manuální (Velero) |
+| **AWS EKS** | EKS multi-cluster, Velero + S3 cross-region | Velero (S3), Rook (EBS snapshots) | neomezena | Manuální |
+
+**Klíčové principy K8s DR:**
+- **Aplikace musí být stateless** (nebo state externalizovaný do DB/storage)
+- **Velero** — backup/restore celého clusteru (PV, resources, helm releases)
+- **Rook/Ceph** — cross-region mirroring RBD volumes
+- **KubeFed / ACM** — subscription-based deploy do více clusterů
+- **Ingress/Gateway API** — traffic routing mezi clustery
+- **External DNS** — DNS failover při výpadku clusteru
+
+#### Messaging / streaming
+
+| Platforma | Replikace | Topologie | DR podpora | Max distance |
+|-----------|-----------|-----------|------------|-------------|
+| **Apache Kafka** | MirrorMaker 2, Confluent Cluster Linking, KRaft quorum | Active-Passive (MM2), Active-Active (Cluster Linking) | MM2: async, Cluster Linking: async | neomezena |
+| **RabbitMQ** | Classic Queue Mirroring, Quorum Queues | Active-Passive (Warm Standby) | Federation / Shovel (async) | neomezena |
+| **Red Hat AMQ** | (Artemis) Cluster + HA | Active-Passive (shared store / replication) | Live-backup pair | < 100 km (sync) |
+| **NATS** | NATS JetStream (cluster + cross-account) | Active-Active (Leaf nodes, cross-account) | Super-cluster, failover | neomezena |
+| **Apache Pulsar** | BookKeeper (bookie rack-aware), geo-replication | Active-Active (geo-replication) | Built-in (cluster-level) | neomezena (async) |
+| **AWS SQS/SNS** | Managed, AWS region pairs | Active-Active (multi-region) | Built-in (AWS managed) | neomezena |
+| **Azure Service Bus** | Managed, paired region | Active-Passive (paired region) | Built-in (geo-recovery) | neomezena |
+| **Oracle Service Bus (OSB)** | Oracle WebLogic Cluster + JDBC store + AQ | Active-Passive (WebLogic Cluster + Data Guard) | OSB/WLS cluster + Oracle RAC/Data Guard sync | < 100 km (Data Guard sync), neomezena (async) |
+
+**Doporučení pro DR messagingu:**
+- **Kafka**: použít Cluster Linking pro Active-Active, nebo MirrorMaker 2 pro Active-Passive; replikovat jen kritická témata
+- **RabbitMQ**: Quorum Queues + Federation upstream pro DR; vyhnout se Classic Queue Mirroring (deprecated)
+- **Pulsar**: nativní geo-replication, bookie rack-aware pro stretch cluster; nejjednodušší DR mezi messaging platformami
+- **OSB**: WebLogic cluster + Oracle RAC/Data Guard; DR závisí na DB vrstvě, ne na OSB samotném
+
+### Hlavní omezení per vrstva (shrnující tabulka)
+
+| Vrstva | Omezující faktor pro sekundární DC | Max distance pro sync | Dopad na výběr topologie |
+|--------|-----------------------------------|----------------------|--------------------------|
+| **Storage** | Latence sync mirroru, DWDM náklady | < 50 km (MetroCluster) | Stretched cluster jen v metru |
+| **Databáze** | Commit wait pro sync replikaci | < 100 km (5 ms RTT) | Active-Active jen s DB podporující multi-master |
+| **Hypervisor** | Stretched cluster quorum + fencing | < 50 km (vSAN, 5 ms) | MetroCluster / stretched cluster |
+| **Kubernetes** | Velero restore time, Rook mirror latency | neomezena (async) | Active-Passive, cold standby |
+| **Messaging** | Replication lag, offset management | neomezena (async) | Active-Active (Kafka, Pulsar, NATS) nebo Active-Passive |
+| **Network** | Dark fiber/DWDM náklady, latency | < 100 km (metro fiber) | Omezuje možnosti sync replikace |
+| **Aplikace** | Stateful/stateless, connection draining | závisí na architektuře | Stateless app → libovolná topologie |
+
 ## Monitoring disků — S.M.A.R.T.

 Self-Monitoring, Analysis and Reporting Technology — prediktivní monitoring HDD/SSD.
@@ -785,4 +1060,4 @@ OpenStack přináší do DC softwarovou abstrakční vrstvu, která umožňuje m
 - Akademické / HPC clustery (Ironic, Cyborg, Manila)
 - Government / regulated prostředí (on-prem, audit trail)

-*Poslední revize: 2026-06-03*
+*Poslední revize: 2026-06-12*
--- a/DC-MIGRATION.en.md
+++ b/DC-MIGRATION.en.md
@@ -0,0 +1,246 @@
+# 🏗️ Data Center Migration
+
+## Migration strategies
+
+| Strategy | RTO | RPO | Risk | Cost | Duration | Description |
+|-----------|-----|-----|--------|---------|-------------|-------|
+| **Cold / Big Bang** | hours–days | days | High | Low | days | Shut everything down, move, power up |
+| **Phased / Wave** | minutes (per wave) | minutes | Medium | Medium | weeks–months | Workloads moved in waves |
+| **Rolling** | 0 (live) | 0 | Low | High | months | Live migration per VM/service |
+| **Parallel Run** | 0 | 0 | Low | Very high | months | Both DCs operational, gradual cutover |
+| **Pilot Light** | hours | minutes | Medium | Low | weeks | Critical services in new DC, rest migrates |
+| **Lift & Shift** | hours | minutes | Medium | Low | weeks | VMs/servers moved without configuration changes |
+| **Re-platform** | hours | minutes | Low | Medium | months | Optimization during migration (OS upgrade, resize) |
+| **Re-architect** | 0 | 0 | Low | High | months–years | Application redesigned for new platform |
+
+---
+
+## Decision tree
+
+```mermaid
+flowchart TD
+    Start(["DC Migration"]) --> APP{"Application\nstateful?"}
+    APP -->|"Yes"| DOWNTIME{"Tolerates\ndowntime?"}
+    APP -->|"No"| ROLLING["Rolling / Parallel Run"]
+
+    DOWNTIME -->|"Yes, hours+"| COLD["Cold / Big Bang\nSimplest, cheapest\nRisk: all at once"]
+    DOWNTIME -->|"Yes, minutes"| PHASED["Phased / Wave\nBy application / business unit"]
+    DOWNTIME -->|"No (zero downtime)"| SYNC{"Sync replication\npossible?"}
+
+    SYNC -->|"Yes, < 100 km"| ROLLING
+    SYNC -->|"No"| PARALLEL["Parallel Run\nBoth DCs active, gradual cutover"]
+
+    ROLLING --> ROLL_HA{"VMware,\nHyper-V?"}
+    ROLL_HA -->|"Yes"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"]
+    ROLL_HA -->|"No"| ROLL_REPL["Storage + DB replication\nGradual workload migration"]
+```
+
+---
+
+## Migration phases
+
+### 1. Discovery and assessment
+
+| Task | Tools | Output |
+|------|----------|--------|
+| HW and SW inventory | RVTools, NetBox, CMDB | Server, VM, and service list |
+| Dependency mapping | ServiceNow, AppDynamics, manual | Application dependency graph |
+| Traffic analysis | NetFlow, sFlow, vRNI | Bandwidth, latency, peak usage |
+| Performance baseline | Prometheus, Zabbix, vRealize | CPU/RAM/disk/network per workload |
+| License audit | Flexera, SAM | Licenses, support, compliance |
+
+**Output:** workload list with RTO/RPO, dependencies, and criticality.
+
+### 2. Planning
+
+- **Wave plan** — workload division into migration waves (10–50 VMs per wave)
+- **Dependency ordering** — DNS, NTP, LDAP, PKI first
+- **Cutover window** — time window for switching (typically weekend)
+- **Rollback plan** — conditions and procedure for reversal
+- **Test plan** — what and how to test post-migration
+- **Communication plan** — who, when, how is informed
+
+### 3. New DC preparation
+
+- **Infrastructure** — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (see [DATACENTERS.en.md](DATACENTERS.en.md) — deployment order)
+- **Network** — BGP peering, VXLAN/VLAN, firewall rules, load balancers
+- **Storage** — SAN zoning, NAS exports, Ceph cluster
+- **Virtualization** — vCenter, Hyper-V cluster, Proxmox
+
+### 4. Replication and synchronization
+
+| Layer | Method | Tools |
+|--------|--------|----------|
+| **Storage (block)** | SAN sync/async mirror, LUN replication | NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster |
+| **Storage (file)** | DFS-R, rsync, robocopy | Windows DFS, Rsync |
+| **Storage (object)** | Cross-region replication | MinIO replication, S3 CRR |
+| **Databases** | Log shipping, CDC, streaming replication | PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication |
+| **VM** | Storage vMotion, replication | VMware vSphere Replication, Hyper-V Replica, Zerto |
+| **Kubernetes** | Velero + Restic, Rook Ceph mirror | Velero, Rook |
+
+### 5. Workload migration
+
+#### Wave migration (recommended for medium/large DCs)
+
+```mermaid
+gantt
+    title Wave migration
+    dateFormat  YYYY-MM-DD
+    section Wave 1 - Core
+    DNS, NTP, LDAP    :done, w1a, 2026-07-01, 3d
+    Monitoring + logging :done, w1b, after w1a, 2d
+    section Wave 2 - Network
+    Load balancers     :active, w2a, 2026-07-06, 2d
+    Firewalls          :active, w2b, 2026-07-08, 2d
+    section Wave 3 - Storage
+    NAS migration      :w3a, 2026-07-10, 5d
+    SAN replication    :w3b, 2026-07-10, 3d
+    section Wave 4 - Dev/Test
+    Dev VMs            :w4a, 2026-07-15, 5d
+    section Wave 5 - Prod tier 3
+    Internal apps      :w5a, 2026-07-22, 5d
+    section Wave 6 - Prod tier 2
+    Business apps      :w6a, 2026-07-29, 5d
+    section Wave 7 - Prod tier 1
+    Critical apps      :w7a, 2026-08-05, 5d
+```
+
+#### Typical single wave procedure:
+
+1. **Day -7**: Sync data replication (initial seed)
+2. **Day -1**: Incremental sync, final test
+3. **Day 0 (cutover)**:
+   - Stop application in source DC
+   - Final sync (last delta)
+   - Start application in target DC
+   - DNS/Traffic switch
+   - Smoke test
+4. **Day +1**: Monitoring (performance, errors, lag)
+5. **Day +7**: Rollback window end (success confirmation)
+
+### 6. Network strategies
+
+#### IP re-addressing
+
+| Approach | Description | Pros | Cons |
+|---------|-------|--------|----------|
+| **Keep IP** | Same IPs, BGP anycast or stretch VLAN | No application config changes | Stretched VLAN/L2 limitations |
+| **Change IP** | New IP range, DNS/BGP routing change | Clean architecture | Config changes, DNS TTL |
+| **NAT translation** | NAT between old and new IP space | No application changes | Latency, troubleshooting complexity |
+
+**Keep IP** is only possible with:
+- L2 stretch between DCs (VXLAN, OTV) — distance limited
+- BGP anycast for VIPs (load balancers)
+- Applications tolerant to ARP cache changes
+
+#### DNS cutover
+
+```
+1. Lower TTL to 60–300 s (one week ahead)
+2. At cutover, change A/AAAA records to new IPs
+3. Wait for propagation (per TTL)
+4. Monitor traffic
+```
+
+#### Traffic steering
+
+| Technique | Use case |
+|----------|----------|
+| **BGP** | Change AS path / local pref for traffic steering |
+| **DNS** | Lower TTL, change A records |
+| **Load balancer** | Change pool members, health check |
+| **GSLB** | Global Server Load Balancing (F5 GTM, NSX ALB) |
+| **Cloud DNS** | AWS Route53, Azure Traffic Manager, Google Cloud DNS |
+
+### 7. Database migration
+
+See individual DB files for details. Summary table:
+
+| DB | Method | RPO | RTO | Note |
+|----|--------|-----|-----|----------|
+| **PostgreSQL** | Streaming replication + Patroni switchover | 0 (sync) / ~MB (async) | min | Patroni auto-failover |
+| **MySQL** | Group Replication / async replication | 0 (sync) / seconds | min | InnoDB Cluster |
+| **Oracle** | Data Guard switchover | 0 (sync) | min | Far sync for remote DCs |
+| **MSSQL** | AlwaysOn AG failover | 0 (sync) | min | Cloud witness |
+| **MongoDB** | Replica set election | seconds | < 1 min | Priority-based failover |
+| **Cassandra** | Multi-DC replication | eventual | 0 | Native multi-master |
+
+### 8. Testing
+
+| Phase | What to test | Method |
+|------|-------------|--------|
+| **Pre-migration** | Application in new DC (isolated) | Dry run on replicated data |
+| **Cutover** | Functionality, availability, latency | Smoke test, synthetic transactions |
+| **Post-migration** | Performance, integration, monitoring | A/B comparison with baseline, canary traffic |
+| **Rollback** | Return to old DC | Tested rollback plan |
+
+### 9. Rollback plan
+
+Each wave must have a defined rollback:
+
+| Condition | Action |
+|----------|------|
+| Application fails to start in new DC | DNS switch back, stop replication |
+| Performance worse than baseline (> 20 %) | Rollback, root cause analysis |
+| Integration failure (API timeout, DB connection) | Rollback, dependency check |
+| Security incident | Rollback, forensic analysis |
+
+Rollback must be tested **before** the real cutover.
+
+---
+
+## Special cases
+
+### Mainframe migration
+
+- **IBM z/OS** — GDPS (Geographically Dispersed Parallel Sysplex)
+- HyperSwap for storage mirroring
+- Cross-system coupling facility (XCF)
+- Often the last migrated component
+
+### COTS applications (Oracle EBS, SAP)
+
+- Require vendor-specific migration procedures
+- Oracle EBS: Autoconfig, cloning (ADXLC)
+- SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM
+- License re-licensing on HW change
+
+### Cloud migration (On-prem → Cloud)
+
+See [CLOUD.en.md](CLOUD.en.md) — migration strategies (6 Rs):
+
+| Strategy | Description |
+|-----------|-------|
+| **Re-host (Lift & Shift)** | VM → Cloud VM (AWS MGN, Azure Migrate) |
+| **Re-platform** | OS upgrade, managed DB (RDS, Cloud SQL) |
+| **Re-architect** | Application rewritten as cloud-native |
+| **Retire** | Decommission unnecessary applications |
+| **Retain** | Application stays on-prem (review later) |
+| **Repurchase** | SaaS replacement |
+
+---
+
+## Recommended approach per DC size
+
+| DC Size | VM Count | Recommended strategy | Duration | Team |
+|-------------|----------|---------------------|-------------|-----|
+| **Small** | < 50 | Big Bang (weekend) | 2–4 days | 3–5 people |
+| **Medium** | 50–500 | Phased (5–10 waves) | 2–8 weeks | 5–10 people |
+| **Large** | 500–5000 | Phased + Rolling | 3–12 months | 10–30 people |
+| **Enterprise** | 5000+ | Parallel Run / Rolling | 12–36 months | 30+ people |
+
+---
+
+## Related
+
+- [DATACENTERS.en.md](DATACENTERS.en.md) — DC topologies, secondary DC, deployment order
+- [CLOUD.en.md](CLOUD.en.md) — cloud migration strategies (6 Rs)
+- [DR.en.md](DR.en.md) — disaster recovery, RTO/RPO
+- [NETWORKING.en.md](NETWORKING.en.md) — BGP, DNS, VXLAN, traffic steering
+- [STORAGE.en.md](STORAGE.en.md) — storage replication
+
+## Sources
+
+Links, books, and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
+
+*Last revision: 2026-06-12*
--- a/DC-MIGRATION.md
+++ b/DC-MIGRATION.md
@@ -0,0 +1,246 @@
+# 🏗️ Migrace datových center
+
+## Strategie migrace
+
+| Strategie | RTO | RPO | Riziko | Náklady | Doba trvání | Popis |
+|-----------|-----|-----|--------|---------|-------------|-------|
+| **Cold / Big Bang** | hodiny–dny | dny | Vysoké | Nízké | dny | Vše najednou vypnout, přesunout, zapnout |
+| **Phased / Wave** | minuty (per wave) | minuty | Střední | Střední | týdny–měsíce | Workloady po vlnách |
+| **Rolling** | 0 (live) | 0 | Nízké | Vysoké | měsíce | Live migration per VM/služba |
+| **Parallel Run** | 0 | 0 | Nízké | Velmi vysoké | měsíce | Oba DC v provozu, postupný přechod |
+| **Pilot Light** | hodiny | minuty | Střední | Nízké | týdny | Kritické služby v novém DC, ostatní se přesouvají |
+| **Lift & Shift** | hodiny | minuty | Střední | Nízké | týdny | VM/servery přesunuty bez změny konfigurace |
+| **Re-platform** | hodiny | minuty | Nízké | Střední | měsíce | Optimalizace během migrace (OS upgrade, resize) |
+| **Re-architect** | 0 | 0 | Nízké | Vysoké | měsíce–roky | Aplikace přepracována pro novou platformu |
+
+---
+
+## Rozhodovací strom
+
+```mermaid
+flowchart TD
+    Start(["Migrace DC"]) --> APP{"Aplikace\nstateful?"}
+    APP -->|"Ano"| DOWNTIME{"Toleruje\nvýpadek?"}
+    APP -->|"Ne"| ROLLING["Rolling / Parallel Run"]
+
+    DOWNTIME -->|"Ano, hodiny+"| COLD["Cold / Big Bang\nNejjednodušší, nejlevnější\nRiziko: vše najednou"]
+    DOWNTIME -->|"Ano, minuty"| PHASED["Phased / Wave\nPo aplikacích / byznys jednotkách"]
+    DOWNTIME -->|"Ne (zero downtime)"| SYNC{"Sync replikace\nmožná?"}
+
+    SYNC -->|"Ano, < 100 km"| ROLLING
+    SYNC -->|"Ne"| PARALLEL["Parallel Run\nOba DC aktivní, postupný cutover"]
+
+    ROLLING --> ROLL_HA{"VMware,\nHyper-V?"}
+    ROLL_HA -->|"Ano"| VMOTION["vMotion / Storage vMotion\nLive migration, 0 downtime"]
+    ROLL_HA -->|"Ne"| ROLL_REPL["Storage + DB replikace\nPostupný přesun workloadů"]
+```
+
+---
+
+## Fáze migrace
+
+### 1. Discovery a assessment
+
+| Úkol | Nástroje | Výstup |
+|------|----------|--------|
+| Inventarizace HW a SW | RVTools, NetBox, CMDB | Seznam všech serverů, VM, služeb |
+| Dependency mapping | ServiceNow, AppDynamics, manual | Aplikační dependency graf |
+| Traffic analysis | NetFlow, sFlow, vRNI | BANDWIDTH, latency, peak usage |
+| Výkonnostní baseline | Prometheus, Zabbix, vRealize | CPU/RAM/disk/network per workload |
+| Licenční audit | Flexera, SAM | Licence, support, compliance |
+
+**Výstupem je:** seznam workloadů s RTO/RPO, závislostmi a kritičností. Bez toho nelze naplánovat migraci.
+
+### 2. Plánování
+
+- **Wave plán** — rozdělení workloadů do migračních vln (10–50 VM na vlnu)
+- **Závislostní řazení** — DNS, NTP, LDAP, PKI musí být první
+- **Cutover okno** — časové okno pro přepnutí (typicky víkend)
+- **Rollback plán** — podmínky a postup pro vrácení
+- **Testovací plán** — co a jak testovat po migraci
+- **Komunikační plán** — kdo, kdy, jak je informován
+
+### 3. Příprava nového DC
+
+- **Infrastruktura** — DNS, NTP, DHCP, LDAP/AD, PKI, monitoring (viz [DATACENTERS.md](DATACENTERS.md) — deployment order)
+- **Network** — BGP peering, VXLAN/VLAN, firewall pravidla, load balancery
+- **Storage** — SAN zoning, NAS exports, Ceph cluster
+- **Virtualizace** — vCenter, Hyper-V cluster, Proxmox
+
+### 4. Replikace a synchronizace
+
+| Vrstva | Metoda | Nástroje |
+|--------|--------|----------|
+| **Storage (block)** | SAN sync/async mirror, LUN replication | NetApp SnapMirror, Dell EMC RecoverPoint, Pure ActiveCluster |
+| **Storage (file)** | DFS-R, rsync, robocopy | Windows DFS, Rsync |
+| **Storage (object)** | Cross-region replication | MinIO replication, S3 CRR |
+| **Databáze** | Log shipping, CDC, streaming replication | PostgreSQL Patroni, Oracle Data Guard, MSSQL AlwaysOn, MySQL Group Replication |
+| **VM** | Storage vMotion, replication | VMware vSphere Replication, Hyper-V Replica, Zerto |
+| **Kubernetes** | Velero + Restic, Rook Ceph mirror | Velero, Rook |
+
+### 5. Migrace workloadů
+
+#### Wave migrace (doporučeno pro střední/větší DC)
+
+```mermaid
+gantt
+    title Wave migrace
+    dateFormat  YYYY-MM-DD
+    section Wave 1 - Core
+    DNS, NTP, LDAP    :done, w1a, 2026-07-01, 3d
+    Monitoring + logging :done, w1b, after w1a, 2d
+    section Wave 2 - Network
+    Load balancers     :active, w2a, 2026-07-06, 2d
+    Firewalls          :active, w2b, 2026-07-08, 2d
+    section Wave 3 - Storage
+    NAS migrace        :w3a, 2026-07-10, 5d
+    SAN replication    :w3b, 2026-07-10, 3d
+    section Wave 4 - Dev/Test
+    Dev VMs            :w4a, 2026-07-15, 5d
+    section Wave 5 - Prod tier 3
+    Internal apps      :w5a, 2026-07-22, 5d
+    section Wave 6 - Prod tier 2
+    Business apps      :w6a, 2026-07-29, 5d
+    section Wave 7 - Prod tier 1
+    Critical apps      :w7a, 2026-08-05, 5d
+```
+
+#### Typický postup jedné vlny:
+
+1. **Den -7**: Sync replikace dat (initial seed)
+2. **Den -1**: Incremental sync, final test
+3. **Den 0 (cutover)**:
+   - Zastavení aplikace ve zdrojovém DC
+   - Final sync (poslední delta)
+   - Start aplikace v cílovém DC
+   - DNS/Traffic switch
+   - Smoke test
+4. **Den +1**: Monitorování (výkon, chyby, lag)
+5. **Den +7**: Rollback window end (potvrzení úspěchu)
+
+### 6. Síťové strategie
+
+#### IP re-addressing
+
+| Přístup | Popis | Výhody | Nevýhody |
+|---------|-------|--------|----------|
+| **Keep IP** | Stejné IP, BGP anycast nebo stretch VLAN | Není třeba měnit konfiguraci aplikací | Stretched VLAN/L2 omezení |
+| **Change IP** | Nový IP rozsah, DNS/BGP routing změna | Čistá architektura | Změny konfigurací, DNS TTL |
+| **NAT překlad** | NAT mezi starým a novým IP spacem | Bez změny aplikací | Latence, komplexita troubleshooting |
+
+**Keep IP** je možný jen:
+- L2 stretch mezi DC (VXLAN, OTV) — omezeno vzdáleností
+- BGP anycast pro VIP (load balancery)
+- Aplikace tolerující ARP cache změny
+
+#### DNS cutover
+
+```
+1. Snížit TTL na 60–300 s (týden předem)
+2. Při cutoveru změnit A/AAAA záznamy na nové IP
+3. Počkat na propagaci (dle TTL)
+4. Monitorovat traffic
+```
+
+#### Traffic steering
+
+| Technika | Use case |
+|----------|----------|
+| **BGP** | Změna AS path / local pref pro přesměrování trafficu |
+| **DNS** | Snížení TTL, change A records |
+| **Load balancer** | Změna pool members, health check |
+| **GSLB** | Global Server Load Balancing (F5 GTM, NSX ALB) |
+| **Cloud DNS** | AWS Route53, Azure Traffic Manager, Google Cloud DNS |
+
+### 7. Databázová migrace
+
+Viz detail v jednotlivých DB souborech. Tabulka shrnutí:
+
+| DB | Metoda | RPO | RTO | Poznámka |
+|----|--------|-----|-----|----------|
+| **PostgreSQL** | Streaming replication + Patroni switchover | 0 (sync) / ~MB (async) | min | Patroni auto-failover |
+| **MySQL** | Group Replication / async replication | 0 (sync) / sekundy | min | InnoDB Cluster |
+| **Oracle** | Data Guard switchover | 0 (sync) | min | Far sync pro vzdálené DC |
+| **MSSQL** | AlwaysOn AG failover | 0 (sync) | min | Cloud witness |
+| **MongoDB** | Replica set election | sekundy | < 1 min | Priority-based failover |
+| **Cassandra** | Multi-DC replication | eventual | 0 | Nativní multi-master |
+
+### 8. Testování
+
+| Fáze | Co testovat | Metoda |
+|------|-------------|--------|
+| **Pre-migrace** | Aplikace v novém DC (izolovaně) | Dry run na replikovaných datech |
+| **Cutover** | Funkčnost, dostupnost, latence | Smoke test, synthetic transactions |
+| **Post-migrace** | Výkon, integrace, monitoring | A/B comparison s baseline, canary traffic |
+| **Rollback** | Návrat ke starému DC | Testovaný rollback plán |
+
+### 9. Rollback plán
+
+Každá vlna musí mít definovaný rollback:
+
+| Podmínka | Akce |
+|----------|------|
+| Aplikace nestartuje v novém DC | Přepnutí DNS zpět, zastavení replikace |
+| Výkon horší než baseline (o > 20 %) | Rollback, analýza příčiny |
+| Integrační selhání (API timeout, DB connection) | Rollback, dependency check |
+| Bezpečnostní incident | Rollback, forenzní analýza |
+
+Rollback by měl být otestován **před** reálným cutoverem.
+
+---
+
+## Speciální případy
+
+### Mainframe migrace
+
+- **IBM z/OS** — GDPS (Geographically Dispersed Parallel Sysplex)
+- HyperSwap pro storage mirroring
+- Cross-system coupling facility (XCF)
+- Často poslední migrovaná komponenta
+
+### COTS aplikace (Oracle EBS, SAP)
+
+- Vyžadují specifické migrační postupy výrobce
+- Oracle EBS: Autoconfig, cloning (ADXLC)
+- SAP: System Copy (Homogeneous / Heterogeneous), SWPM, SUM
+- Licenční re-licensing při změně HW
+
+### Cloud migrace (On-prem → Cloud)
+
+Viz [CLOUD.md](CLOUD.md) — migrační strategie (6 Rs):
+
+| Strategie | Popis |
+|-----------|-------|
+| **Re-host (Lift & Shift)** | VM → Cloud VM (AWS MGN, Azure Migrate) |
+| **Re-platform** | OS upgrade, managed DB (RDS, Cloud SQL) |
+| **Re-architect** | Aplikace přepsána na cloud-native |
+| **Retire** | Zastavení nepotřebných aplikací |
+| **Retain** | Aplikace zůstává on-prem (revize později) |
+| **Repurchase** | SaaS náhrada |
+
+---
+
+## Doporučený postup per velikost DC
+
+| Velikost DC | Počet VM | Doporučená strategie | Doba trvání | Tým |
+|-------------|----------|---------------------|-------------|-----|
+| **Small** | < 50 | Big Bang (víkend) | 2–4 dny | 3–5 lidí |
+| **Medium** | 50–500 | Phased (5–10 wave) | 2–8 týdnů | 5–10 lidí |
+| **Large** | 500–5000 | Phased + Rolling | 3–12 měsíců | 10–30 lidí |
+| **Enterprise** | 5000+ | Parallel Run / Rolling | 12–36 měsíců | 30+ lidí |
+
+---
+
+## Související
+
+- [DATACENTERS.md](DATACENTERS.md) — DC topologie, sekundární DC, deployment order
+- [CLOUD.md](CLOUD.md) — cloud migrační strategie (6 Rs)
+- [DR.md](DR.md) — disaster recovery, RTO/RPO
+- [NETWORKING.md](NETWORKING.md) — BGP, DNS, VXLAN, traffic steering
+- [STORAGE.md](STORAGE.md) — storage replikace
+
+## Zdroje
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Poslední revize: 2026-06-12*
--- a/DR.en.md
+++ b/DR.en.md
@@ -0,0 +1,336 @@
+# 🔄 Disaster Recovery and Business Continuity
+
+## Terminology
+
+| Abbreviation | Meaning | Description |
+|---------|--------|-------|
+| **RTO** | Recovery Time Objective | Maximum time from outage to service recovery |
+| **RPO** | Recovery Point Objective | Maximum acceptable data loss (time since last backup) |
+| **MTD** | Maximum Tolerable Downtime | Total outage duration an organization can survive |
+| **WRT** | Work Recovery Time | Time needed for full operations recovery after IT restoration |
+| **MTBF** | Mean Time Between Failures | Mean time between failures |
+| **MTTR** | Mean Time To Repair | Mean time to repair |
+| **SLA** | Service Level Agreement | Contractual availability commitment |
+| **SLO** | Service Level Objective | Internal availability target |
+| **SLI** | Service Level Indicator | Measured availability value |
+
+### Relationship between RTO, RPO, MTD, WRT
+
+```
+Outage ──── RPO ────► Data restored ──── RTO ────► Service running ──── WRT ────► Full operations
+             │                      │                            │
+             ▼                      ▼                            ▼
+       Lost data          Time without service               Time to full capacity
+
+       MTD = RTO + WRT (max. time the business tolerates)
+```
+
+---
+
+## Uptime calculation
+
+### Nines table
+
+| Level | Uptime | Downtime / year | Downtime / month | Downtime / week |
+|--------|--------|---------------|------------------|------------------|
+| 90 % (one nine) | 0.9 | 36.5 days | 72 h | 16.8 h |
+| 99 % (two nines) | 0.99 | 3.65 days | 7.2 h | 1.68 h |
+| 99.5 % | 0.995 | 1.83 days | 3.6 h | 50.4 min |
+| 99.9 % (three nines) | 0.999 | 8.76 h | 43.2 min | 10.1 min |
+| 99.95 % | 0.9995 | 4.38 h | 21.6 min | 5.04 min |
+| 99.99 % (four nines) | 0.9999 | 52.6 min | 4.32 min | 1.01 min |
+| 99.995 % | 0.99995 | 26.3 min | 2.16 min | 30.2 s |
+| 99.999 % (five nines) | 0.99999 | 5.26 min | 25.9 s | 6.05 s |
+| 99.9999 % (six nines) | 0.999999 | 31.6 s | 2.59 s | 0.605 s |
+
+### Calculation
+
+```
+Availability = (Total time - Downtime) / Total time × 100 %
+
+Example:
+  Year = 365 × 24 × 60 = 525,600 minutes
+  Target: 99.9 % → allowed downtime = 525,600 × (1 - 0.999) = 525.6 minutes = 8.76 h
+
+Combined availability (chain of dependencies):
+  A_web = 99.9 % (3 nines)
+  A_api  = 99.99 % (4 nines)
+  A_db   = 99.999 % (5 nines)
+
+  A_total = 0.999 × 0.9999 × 0.99999 = 0.99889 ≈ 99.89 % (less than 3 nines!)
+
+Parallel availability (redundancy):
+  A_total = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)
+
+  Example: 2 servers with 99% availability
+  A_total = 1 - (1-0.99) × (1-0.99) = 1 - 0.01 × 0.01 = 0.9999 (99.99 %)
+```
+
+### Calculator
+
+```python
+def uptime_percent_to_downtime(pct, period_days=365):
+    """Convert uptime percentage to downtime in given period."""
+    total_minutes = period_days * 24 * 60
+    allowed_downtime = total_minutes * (1 - pct / 100)
+    return allowed_downtime  # minutes
+
+def downtime_to_uptime_percent(downtime_minutes, period_days=365):
+    """Convert downtime in minutes to uptime percentage."""
+    total_minutes = period_days * 24 * 60
+    return (1 - downtime_minutes / total_minutes) * 100
+
+def combined_availability(availabilities):
+    """Combined availability (series-connected components)."""
+    result = 1.0
+    for a in availabilities:
+        result *= a
+    return result
+
+def redundant_availability(availabilities):
+    """Redundant availability (parallel components)."""
+    result = 1.0
+    for a in availabilities:
+        result *= (1 - a)
+    return 1 - result
+```
+
+### Calculation fallacies
+
+- **Combined availability is not a sum** — adding another dependency always reduces total availability
+- **Redundancy is not free** — adding a standby component requires failure detection + failover (MTTR does not improve automatically)
+- **SLA is not a guarantee** — providers often calculate SLA as a monthly average, not per-incident
+- **Measurement is key** — without SLI, SLO cannot be verified; "unmeasured availability does not exist"
+- **Planned maintenance** — sometimes counted as uptime, sometimes not (depends on SLA definition)
+
+---
+
+## DR scenarios
+
+### Classification
+
+| Category | Scenario | Typical RTO | Typical RPO | Frequency |
+|-----------|--------|-------------|-------------|-----------|
+| **Site** | Entire DC / region outage | hours | minutes | Low |
+| **Infrastructure** | HW failure (storage, switch, server) | minutes–hours | seconds | Medium |
+| **Software** | OS, application, DB failure | minutes | seconds | High |
+| **Data** | Data corruption, deletion, cryptolocker | hours | backup point | Low–medium |
+| **Human** | Wrong deployment, config change | minutes–hours | seconds | Medium |
+| **Security** | Attack, breach, ransomware | days | before attack | Low |
+| **Network** | Connectivity outage, DDoS | minutes–hours | N/A | Medium |
+| **Cloud provider** | Regional outage (AWS, Azure, GCP) | hours | minutes | Very low |
+
+### Scenario details
+
+#### Site / Region failure
+
+| Aspect | Description |
+|--------|-------|
+| **Cause** | Blackout, fire, flood, earthquake, cloud provider outage |
+| **Prevention** | Multi-AZ architecture, multi-region deployment, active-active |
+| **Mitigation** | Automatic DNS failover (Route53, Azure Traffic Manager), replica in DR region |
+| **Testing** | Game day: shut down primary region, verify automatic failover |
+
+#### Data corruption / human error
+
+| Aspect | Description |
+|--------|-------|
+| **Cause** | Wrong SQL command (DELETE without WHERE), accidentally deleted bucket, bad migration |
+| **Prevention** | RBAC, MFA for destructive operations, change management, SQL peer review |
+| **Mitigation** | Point-in-time recovery (PITR), transaction log replay, immutable backups |
+| **Testing** | Restore backup to isolated environment, verify data integrity |
+
+#### Ransomware / cyber attack
+
+| Aspect | Description |
+|--------|-------|
+| **Cause** | Attack on production systems, data encryption, exfiltration |
+| **Prevention** | Immutable backups (object lock), air-gapped backups, network segmentation |
+| **Mitigation** | Restore from clean backup, rebuild infrastructure from IaC |
+| **Testing** | Regular restore in isolated network, verify backup is not infected |
+
+---
+
+## Prevention — strategies
+
+### Backup strategies
+
+| Approach | Description | Use case |
+|---------|-------|----------|
+| **3-2-1 rule** | 3 copies, 2 different media, 1 off-site | Universal |
+| **3-2-1-0** | + 0 errors after restore (testing) | Enterprise, compliance |
+| **GFS (Grandfather-Father-Son)** | Daily, weekly, monthly rotation | Long-term archive |
+| **Incremental forever** | Full backup 1×, then only changes | Large data volumes |
+| **Reverse incremental** | Full + incremental, full is always current | Fast recovery |
+
+### Backup methods
+
+| Method | RPO | RTO | Storage | Suitable for |
+|--------|-----|-----|----------|------------|
+| **Full backup** | Last full | Full restore time | Large | Small data, weekly |
+| **Incremental** | Last incremental | Full + all incrementals | Small | Large data, daily |
+| **Differential** | Last diff | Full + last diff | Medium | Compromise |
+| **Snapshot** | Snapshot point-in-time | seconds | Copy-on-write | VM, storage array |
+| **Continuous (CDC)** | < 1 s | Seconds | Log stream | DB (binlog, WAL) |
+| **PITR** | Any point in time | Depends on volume | Full + WAL | RDS, PostgreSQL, SQL Server |
+
+### Backup immutability
+
+Key protection against ransomware:
+
+| Technique | Description |
+|----------|-------|
+| **Object Lock (WORM)** | Backup cannot be deleted or overwritten for a defined retention period (S3 Object Lock, Azure Blob Immutable) |
+| **Air gap** | Backup is physically separated from the production network (offline disk, tape, cloud without VPN) |
+| **Isolated backup network** | Backup traffic goes through a dedicated network without access from production VLAN |
+| **Out-of-band access** | Backup management console is not accessible from the production network |
+
+---
+
+## DR architectures
+
+### Multi-AZ (Single region)
+
+```
+Region ┌────────────────────────────────────┐
+       │  AZ-1              AZ-2            │
+       │  ┌──────────┐     ┌──────────┐     │
+       │  │  App      │     │  App      │     │
+       │  └─────┬────┘     └─────┬────┘     │
+       │        │                │          │
+       │  ┌─────▼────────────────▼─────┐    │
+       │  │  Load Balancer (cross-AZ)  │    │
+       │  └─────────────┬──────────────┘    │
+       │                │                   │
+       │  ┌─────────────▼──────────────┐    │
+       │  │  DB Primary (AZ-1)         │    │
+       │  │  DB Standby (AZ-2)         │    │
+       │  │  Synchronous replication   │    │
+       │  └────────────────────────────┘    │
+       └────────────────────────────────────┘
+```
+
+- RTO: minutes (automatic failover)
+- RPO: 0 (sync replication)
+- Protection: against AZ failure, not region failure
+
+### Multi-Region
+
+```
+Region A (Primary)                    Region B (DR)
+┌─────────────────────┐              ┌─────────────────────┐
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  App + DB     │  │              │  │  App + DB     │  │
+│  │  Active       │──┼──Async───────┼─►│  Standby      │  │
+│  └───────────────┘  │  replication │  └───────────────┘  │
+│         │           │              │         │           │
+│  ┌──────▼───────┐  │              │  ┌──────▼───────┐  │
+│  │  DNS / GSLB  │  │              │  │  DNS / GSLB  │  │
+│  └──────┬───────┘  │              │  └──────┬───────┘  │
+└─────────┼──────────┘              └─────────┼──────────┘
+          │                                    │
+          └──────────── Traffic Manager ───────┘
+```
+
+| Variant | RTO | RPO | Cost | Failover |
+|----------|-----|-----|---------|----------|
+| **Active-Passive** | minutes–hours | seconds | Medium | Manual / auto |
+| **Active-Active** | seconds | < 1 s | High | Automatic (DNS) |
+| **Pilot Light** | tens of minutes | minutes | Low | Manual scaling |
+| **Warm Standby** | minutes | seconds | High | Auto (reduced copy) |
+| **Backup & Restore** | hours | 24 h | Low | Manual |
+
+### On-prem → Cloud DR (Hybrid)
+
+```
+On-prem DC                              Cloud (DR)
+┌─────────────────────┐              ┌─────────────────────┐
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  Application  │  │              │  │  VM / App     │  │
+│  │  + DB         │  │              │  │  + DB replica │  │
+│  └───────┬───────┘  │              │  └───────┬───────┘  │
+│          │          │              │          │          │
+│  ┌───────▼───────┐  │  site-to-site│  ┌───────▼───────┐  │
+│  │  Backup proxy │──┼────VPN───────┼─►│  Backup store │  │
+│  └───────────────┘  │              │  └───────────────┘  │
+│                     │              │                     │
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  Tape / NAS   │  │              │  │  Veeam / Zerto│  │
+│  └───────────────┘  │              │  └───────────────┘  │
+└─────────────────────┘              └─────────────────────┘
+```
+
+- **RTO**: tens of minutes (depends on VM startup)
+- **RPO**: minutes–hours (depends on replication tool)
+- **Tools**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
+- **Use case**: enterprise with on-prem DC that needs DR without a second DC
+
+---
+
+## DR testing
+
+### Test types
+
+| Type | Description | Frequency | Risk |
+|-----|-------|-----------|--------|
+| **Tabletop exercise** | Manual scenario walkthrough, no impact on production | Monthly | None |
+| **Walkthrough** | Runbook verification, ensure everyone knows what to do | Quarterly | None |
+| **Component test** | Test of a single component (e.g., restore one DB) | Monthly | Low |
+| **Integrated test** | Test of the entire stack in isolated environment | Quarterly | Low |
+| **Full failover test** | Production failover to DR site | Annually | High |
+| **Chaos experiment** | Targeted fault injection into production | Continuous | Medium |
+
+### Runbook structure
+
+Each DR scenario should have a runbook:
+
+```yaml
+scenario: "Region A failure"
+triggers:
+  - "CloudWatch alarm: Region A health check 5× timeout"
+  - "PagerDuty incident P0"
+decision_tree: |
+  1. Verify: is Region A really unavailable? (check from 3 different locations)
+  2. Decide: is RTO at risk? If < 30 % RTO remaining → failover
+  3. Failover: run playbook `dr-failover-region-b`
+  4. Verification: smoke tests in Region B
+  5. Communication: status page + stakeholders
+rollback: |
+  1. After Region A recovery → replicate changes from B back to A
+  2. Repoint DNS to A
+  3. Verify data consistency
+  4. Shut down Region B (or keep as hot standby)
+contacts:
+  primary: "on-call@example.com"
+  escalation: "infra-lead@example.com"
+  management: "vp-engineering@example.com"
+```
+
+---
+
+## Best practices
+
+- **Test recovery, not backup** — a backup without tested recovery is not a backup
+- **Automate DR** — Terraform / Ansible for DR environment spin-up, DNS failover
+- **Document runbooks** — every scenario, contact, decision tree
+- **Expect failure** — design for failure, don't expect everything to work
+- **Don't underestimate WRT** — service recovery does not mean full operations (data warming, cache, connections)
+- **Align RTO/RPO with business** — technical capabilities must match business requirements
+- **Monitor SLI** — without data, SLO cannot be verified
+- **DR is not just IT** — communication, PR, legal, compliance
+
+---
+
+## Related
+
+- [CLOUD.en.md](CLOUD.en.md) — cloud DR strategy, AWS/Azure/GCP specific
+- [DATACENTERS.en.md](DATACENTERS.en.md) — DC redundancy, Tier classification
+- [MONITORING.en.md](MONITORING.en.md) — alerting, SLI/SLO/SLA
+- [CICD.en.md](CICD.en.md) — deployment strategy, rollback
+- [STORAGE.en.md](STORAGE.en.md) — backup storage, replication
+
+## Sources
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
+
+*Last revised: 2026-06-11*
--- a/DR.md
+++ b/DR.md
@@ -0,0 +1,336 @@
+# 🔄 Disaster Recovery a Business Continuity
+
+## Terminologie
+
+| Zkratka | Význam | Popis |
+|---------|--------|-------|
+| **RTO** | Recovery Time Objective | Maximální doba od výpadku do obnovení služby |
+| **RPO** | Recovery Point Objective | Maximální přípustná ztráta dat (čas od poslední zálohy) |
+| **MTD** | Maximum Tolerable Downtime | Celková doba výpadku, kterou organizace přežije |
+| **WRT** | Work Recovery Time | Čas potřebný k plnému obnovení provozu po obnovení IT |
+| **MTBF** | Mean Time Between Failures | Střední doba mezi poruchami |
+| **MTTR** | Mean Time To Repair | Střední doba opravy |
+| **SLA** | Service Level Agreement | Smluvní závazek dostupnosti |
+| **SLO** | Service Level Objective | Interní cíl dostupnosti |
+| **SLI** | Service Level Indicator | Naměřená hodnota dostupnosti |
+
+### Vztah RTO, RPO, MTD, WRT
+
+```
+Výpadek ──── RPO ────► Obnova dat ──── RTO ────► Služba běží ──── WRT ────► Plný provoz
+             │                      │                            │
+             ▼                      ▼                            ▼
+       Ztracená data          Čas bez služby               Čas do plného výkonu
+
+       MTD = RTO + WRT (max. doba, kterou firma toleruje)
+```
+
+---
+
+## Výpočet uptimu
+
+### Tabulka devítek
+
+| Úroveň | Uptime | Downtime / rok | Downtime / měsíc | Downtime / týden |
+|--------|--------|---------------|------------------|------------------|
+| 90 % (jedna devítka) | 0.9 | 36,5 dne | 72 h | 16,8 h |
+| 99 % (dvě devítky) | 0.99 | 3,65 dne | 7,2 h | 1,68 h |
+| 99,5 % | 0.995 | 1,83 dne | 3,6 h | 50,4 min |
+| 99,9 % (tři devítky) | 0.999 | 8,76 h | 43,2 min | 10,1 min |
+| 99,95 % | 0.9995 | 4,38 h | 21,6 min | 5,04 min |
+| 99,99 % (čtyři devítky) | 0.9999 | 52,6 min | 4,32 min | 1,01 min |
+| 99,995 % | 0.99995 | 26,3 min | 2,16 min | 30,2 s |
+| 99,999 % (pět devítek) | 0.99999 | 5,26 min | 25,9 s | 6,05 s |
+| 99,9999 % (šest devítek) | 0.999999 | 31,6 s | 2,59 s | 0,605 s |
+
+### Výpočet
+
+```
+Dostupnost = (Celkový čas - Downtime) / Celkový čas × 100 %
+
+Příklad:
+  Rok = 365 × 24 × 60 = 525 600 minut
+  Cíl: 99,9 % → povolený downtime = 525 600 × (1 - 0,999) = 525,6 minut = 8,76 h
+
+Složená dostupnost (řetězec závislostí):
+  A_web = 99,9 % (3 devítky)
+  A_api  = 99,99 % (4 devítky)
+  A_db   = 99,999 % (5 devítek)
+
+  A_celkem = 0,999 × 0,9999 × 0,99999 = 0,99889 ≈ 99,89 % (méně než 3 devítky!)
+
+Paralelní dostupnost (redundance):
+  A_celkem = 1 - (1 - A_1) × (1 - A_2) × ... × (1 - A_n)
+
+  Příklad: 2 servery s 99% dostupností
+  A_celkem = 1 - (1-0,99) × (1-0,99) = 1 - 0,01 × 0,01 = 0,9999 (99,99 %)
+```
+
+### Kalkulačka
+
+```python
+def uptime_percent_to_downtime(pct, period_days=365):
+    """Převede procento uptimu na downtime v daném období."""
+    total_minutes = period_days * 24 * 60
+    allowed_downtime = total_minutes * (1 - pct / 100)
+    return allowed_downtime  # minutes
+
+def downtime_to_uptime_percent(downtime_minutes, period_days=365):
+    """Převede downtime v minutách na procento uptimu."""
+    total_minutes = period_days * 24 * 60
+    return (1 - downtime_minutes / total_minutes) * 100
+
+def combined_availability(availabilities):
+    """Složená dostupnost (sériově zapojené komponenty)."""
+    result = 1.0
+    for a in availabilities:
+        result *= a
+    return result
+
+def redundant_availability(availabilities):
+    """Paralelní dostupnost (redundantní komponenty)."""
+    result = 1.0
+    for a in availabilities:
+        result *= (1 - a)
+    return 1 - result
+```
+
+### Fallacies výpočtu
+
+- **Složená dostupnost není součet** — přidání další závislosti vždy snižuje celkovou dostupnost
+- **Redundance není zadarmo** — přidání standby komponenty vyžaduje detekci selhání + failover (MTTR se nezlepší automaticky)
+- **SLA není garance** — poskytovatelé často počítají SLA jako měsíční průměr, ne per-incident
+- **Měření je klíčové** — bez SLI nelze ověřit SLO; "nedoměřená dostupnost neexistuje"
+- **Plánovaná odstávka** — někdy se počítá do uptimu, někdy ne (záleží na definici SLA)
+
+---
+
+## DR scénáře
+
+### Klasifikace
+
+| Kategorie | Scénář | Typický RTO | Typické RPO | Frekvence |
+|-----------|--------|-------------|-------------|-----------|
+| **Site** | Výpadek celého DC / regionu | hodiny | minuty | Nízká |
+| **Infrastructure** | Selhání HW (storage, switch, server) | minuty–hodiny | sekundy | Střední |
+| **Software** | Selhání OS, aplikace, DB | minuty | vteřiny | Vysoká |
+| **Data** | Poškození dat, delete, cryptolocker | hodiny | okamžik zálohy | Nízká–střední |
+| **Human** | Chybný deployment, config change | minuty–hodiny | vteřiny | Střední |
+| **Security** | Útok, breach, ransomware | dny | před útokem | Nízká |
+| **Network** | Výpadek konektivity, DDoS | minuty–hodiny | N/A | Střední |
+| **Cloud provider** | Regionální výpadek (AWS, Azure, GCP) | hodiny | minuty | Velmi nízká |
+
+### Detail scénářů
+
+#### Site / Region failure
+
+| Aspekt | Popis |
+|--------|-------|
+| **Příčina** | Blackout, požár, povodeň, zemětřesení, výpadek cloud providera |
+| **Prevence** | Multi-AZ architektura, multi-region deployment, active-active |
+| **Mitigace** | Automatický DNS failover (Route53, Azure Traffic Manager), replica v DR regionu |
+| **Testování** | Game day: vypnout primární region, ověřit automatický failover |
+
+#### Data corruption / human error
+
+| Aspekt | Popis |
+|--------|-------|
+| **Příčina** | Chybný SQL příkaz (DELETE bez WHERE), omylem smazaný bucket, chybná migrace |
+| **Prevence** | RBAC, MFA pro destructive operace, change management, peer review SQL |
+| **Mitigace** | Point-in-time recovery (PITR), transaction log replay, immutable backups |
+| **Testování** | Obnova zálohy do izolovaného prostředí, ověření integrity dat |
+
+#### Ransomware / cyber attack
+
+| Aspekt | Popis |
+|--------|-------|
+| **Příčina** | Útok na produkční systémy, zašifrování dat, exfiltrace |
+| **Prevence** | Immutable backups (object lock), air-gapped backups, network segmentation |
+| **Mitigace** | Obnova z čisté zálohy, re-build infrastructure from IaC |
+| **Testování** | Pravidelná obnova v izolované síti, ověření že backup není infikován |
+
+---
+
+## Prevence — strategie
+
+### Backup strategie
+
+| Aproach | Popis | Use case |
+|---------|-------|----------|
+| **3-2-1 pravidlo** | 3 kopie, 2 různá média, 1 off-site | Univerzální |
+| **3-2-1-0** | + 0 chyb po obnově (testování) | Enterprise, compliance |
+| **GFS (Grandfather-Father-Son)** | Denní, týdenní, měsíční rotace | Dlouhodobý archiv |
+| **Incremental forever** | Plná záloha 1×, pak jen změny | Velké objemy dat |
+| **Reverse incremental** | Plná + inkrementální, plná je vždy aktuální | Rychlá obnova |
+
+### Zálohovací metody
+
+| Metoda | RPO | RTO | Úložiště | Vhodné pro |
+|--------|-----|-----|----------|------------|
+| **Full backup** | Poslední full | Doba obnovy full | Velké | Malá data, weekly |
+| **Incremental** | Poslední inkrement | Full + všechny inkrementy | Malé | Velká data, daily |
+| **Differential** | Poslední diff | Full + poslední diff | Střední | Kompromis |
+| **Snapshot** | Okamžik snapshotu | vteřiny | Copy-on-write | VM, storage array |
+| **Continuous (CDC)** | < 1 s | Sekundy | Log stream | DB (binlog, WAL) |
+| **PITR** | Libovolný bod v čase | Dle objemu | Full + WAL | RDS, PostgreSQL, SQL Server |
+
+### Imunabilita backupů
+
+Klíčová ochrana proti ransomwaru:
+
+| Technika | Popis |
+|----------|-------|
+| **Object Lock (WORM)** | Backup nelze smazat ani přepsat po defined retention period (S3 Object Lock, Azure Blob Immutable) |
+| **Air gap** | Backup je fyzicky oddělený od produkční sítě (offline disk, tape, cloud bez VPN) |
+| **Isolated backup network** | Backup traffic jde přes dedikovanou síť bez přístupu z produkční VLAN |
+| **Out-of-band access** | Backup management console není dostupná z produkční sítě |
+
+---
+
+## DR architektury
+
+### Multi-AZ (Single region)
+
+```
+Region ┌────────────────────────────────────┐
+       │  AZ-1              AZ-2            │
+       │  ┌──────────┐     ┌──────────┐     │
+       │  │  App      │     │  App      │     │
+       │  └─────┬────┘     └─────┬────┘     │
+       │        │                │          │
+       │  ┌─────▼────────────────▼─────┐    │
+       │  │  Load Balancer (cross-AZ)  │    │
+       │  └─────────────┬──────────────┘    │
+       │                │                   │
+       │  ┌─────────────▼──────────────┐    │
+       │  │  DB Primary (AZ-1)         │    │
+       │  │  DB Standby (AZ-2)         │    │
+       │  │  Synchronous replication   │    │
+       │  └────────────────────────────┘    │
+       └────────────────────────────────────┘
+```
+
+- RTO: minuty (automatický failover)
+- RPO: 0 (sync replication)
+- Ochrana: proti selhání AZ, nikoliv regionu
+
+### Multi-Region
+
+```
+Region A (Primary)                    Region B (DR)
+┌─────────────────────┐              ┌─────────────────────┐
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  App + DB     │  │              │  │  App + DB     │  │
+│  │  Active       │──┼──Async───────┼─►│  Standby      │  │
+│  └───────────────┘  │   replikace  │  └───────────────┘  │
+│         │           │              │         │           │
+│  ┌──────▼───────┐  │              │  ┌──────▼───────┐  │
+│  │  DNS / GSLB  │  │              │  │  DNS / GSLB  │  │
+│  └──────┬───────┘  │              │  └──────┬───────┘  │
+└─────────┼──────────┘              └─────────┼──────────┘
+          │                                    │
+          └──────────── Traffic Manager ───────┘
+```
+
+| Varianta | RTO | RPO | Náklady | Failover |
+|----------|-----|-----|---------|----------|
+| **Active-Passive** | minuty–hodiny | sekundy | Střední | Manuální / auto |
+| **Active-Active** | sekundy | < 1 s | Vysoké | Automatický (DNS) |
+| **Pilot Light** | desítky minut | minuty | Nízké | Manuální škálování |
+| **Warm Standby** | minuty | sekundy | Vysoké | Auto (zmenšená kopie) |
+| **Backup & Restore** | hodiny | 24 h | Nízké | Manuální |
+
+### On-prem → Cloud DR (Hybrid)
+
+```
+On-prem DC                              Cloud (DR)
+┌─────────────────────┐              ┌─────────────────────┐
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  Aplikace     │  │              │  │  VM / Aplikace│  │
+│  │  + DB         │  │              │  │  + DB replica │  │
+│  └───────┬───────┘  │              │  └───────┬───────┘  │
+│          │          │              │          │          │
+│  ┌───────▼───────┐  │  site-to-site│  ┌───────▼───────┐  │
+│  │  Backup proxy │──┼────VPN───────┼─►│  Backup store │  │
+│  └───────────────┘  │              │  └───────────────┘  │
+│                     │              │                     │
+│  ┌───────────────┐  │              │  ┌───────────────┐  │
+│  │  Tape / NAS   │  │              │  │  Veeam / Zerto│  │
+│  └───────────────┘  │              │  └───────────────┘  │
+└─────────────────────┘              └─────────────────────┘
+```
+
+- **RTO**: desítky minut (závisí na startup VM)
+- **RPO**: minuty–hodiny (závisí na replikačním nástroji)
+- **Nástroje**: Veeam, Zerto, Azure Site Recovery, AWS MGN, Commvault
+- **Use case**: enterprise s on-prem DC, které potřebuje DR bez druhého DC
+
+---
+
+## DR testování
+
+### Typy testů
+
+| Typ | Popis | Frekvence | Riziko |
+|-----|-------|-----------|--------|
+| **Tabletop exercise** | Manuální procházení scénáře, žádný dopad na produkci | Měsíčně | Žádné |
+| **Walkthrough** | Verifikace runbooku, kontrola že všichni ví co dělat | Kvartálně | Žádné |
+| **Component test** | Test jedné komponenty (např. obnova jedné DB) | Měsíčně | Nízké |
+| **Integrated test** | Test celého stacku v izolovaném prostředí | Kvartálně | Nízké |
+| **Full failover test** | Produkční failover do DR site | Ročně | Vysoké |
+| **Chaos experiment** | Cílené vnášení poruch do produkce | Průběžně | Střední |
+
+### Runbook struktura
+
+Každý DR scénář by měl mít runbook:
+
+```yaml
+scenario: "Region A failure"
+triggers:
+  - "CloudWatch alarm: Region A health check 5× timeout"
+  - "PagerDuty incident P0"
+decision_tree: |
+  1. Ověřit: je Region A opravdu nedostupný? (check z 3 různých lokací)
+  2. Rozhodnout: je RTO v ohrožení? Pokud zbývá < 30 % RTO → failover
+  3. Failover: spustit playbook `dr-failover-region-b`
+  4. Verifikace: smoke testy v Region B
+  5. Komunikace: status page + stakeholders
+rollback: |
+  1. Po obnovení Region A → replikace změn z B zpět do A
+  2. Repoint DNS na A
+  3. Ověřit konzistenci dat
+  4. Vypnout Region B (nebo ponechat jako hot standby)
+contacts:
+  primary: "on-call@example.com"
+  escalation: "infra-lead@example.com"
+  management: "vp-engineering@example.com"
+```
+
+---
+
+## Best practices
+
+- **Testuj obnovu, ne zálohu** — backup bez testované obnovy není backup
+- **Automatizuj DR** — Terraform / Ansible pro spin-up DR prostředí, DNS failover
+- **Dokumentuj runbooky** — každý scénář, kontakt, rozhodovací strom
+- **Počítej se selháním** — design for failure, nečekej že všechno poběží
+- **Nepodceňuj WRT** — obnova služby neznamená plný provoz (data warming, cache, connections)
+- **Slaď RTO/RPO s businessem** — technické možnosti musí odpovídat obchodním požadavkům
+- **Monitoruj SLI** — bez dat nelze ověřit SLO
+- **DR není jen IT** — komunikace, PR, právní, regulace
+
+---
+
+## Související
+
+- [CLOUD.md](CLOUD.md) — cloud DR strategie, AWS/Azure/GCP specific
+- [DATACENTERS.md](DATACENTERS.md) — DC redundance, Tier klasifikace
+- [MONITORING.md](MONITORING.md) — alerting, SLI/SLO/SLA
+- [CICD.md](CICD.md) — deployment strategie, rollback
+- [STORAGE.md](STORAGE.md) — backup storage, replication
+
+## Zdroje
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Poslední revize: 2026-06-11*
--- a/GPU.en.md
+++ b/GPU.en.md
@@ -112,6 +112,12 @@ NVLink topologie (GPU direct)   PCIe topologie (CPU mediated)
 - **Denoising**: AI-accelerated denoising on GPU
 - **Farm rendering**: Deadline, Qube! (job scheduler)

+## GPU pricing
+
+Detailed pricing comparisons (purchase price, cloud on-demand, $/M token inference cost, $/GB HBM, price trends 2024→2026) see:
+
+- [AI-INFRASTRUCTURE.en.md — GPU pricing and price/performance](AI-INFRASTRUCTURE.en.md#gpu-pricing-and-priceperformance)
+
 ## GPU server form factors

 | Form factor | GPU count | Power | Cooling | Example |
@@ -144,6 +150,6 @@ Cyborg is an OpenStack service for managing accelerators (GPU, FPGA, DPU, NPU).

 ## Sources

-Links, books and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)

 *Last revision: 2026-06-03*
--- a/GPU.md
+++ b/GPU.md
@@ -112,6 +112,12 @@ NVLink topologie (GPU direct)   PCIe topologie (CPU mediated)
 - **Denoising**: AI-accelerated denoising na GPU
 - **Farm rendering**: Deadline, Qube! (job scheduler)

+## Ceny GPU
+
+Detailní cenová srovnání (nákupní cena, cloud on-demand, $/M token inferenčních nákladů, $/GB HBM, cenový vývoj 2024→2026) viz:
+
+- [AI-INFRASTRUCTURE.md — Ceny GPU a poměr cena/výkon](AI-INFRASTRUCTURE.md#ceny-gpu-a-poměr-cenavýkon)
+
 ## GPU server form factors

 | Form factor | GPU count | Power | Cooling | Příklad |
--- a/HARDWARE.en.md
+++ b/HARDWARE.en.md
@@ -4,9 +4,9 @@ This file has been split into separate areas:

 | Area | File |
 |--------|--------|
-| 🔧 Server hardware — components and architecture | [SERVER-HW.md](SERVER-HW.md) |
-| 🎮 GPU — architecture, models, virtualization | [GPU.md](GPU.md) |
-| ⚙️ Server configuration — best practices by workload | [SERVER-CONFIG.md](SERVER-CONFIG.md) |
-| 📦 Provisioning — boot, installation, server management | [PROVISIONING.md](PROVISIONING.md) |
+| 🔧 Server hardware — components and architecture | [SERVER-HW.en.md](SERVER-HW.en.md) |
+| 🎮 GPU — architecture, models, virtualization | [GPU.en.md](GPU.en.md) |
+| ⚙️ Server configuration — best practices by workload | [SERVER-CONFIG.en.md](SERVER-CONFIG.en.md) |
+| 📦 Provisioning — boot, installation, server management | [PROVISIONING.en.md](PROVISIONING.en.md) |

 *Last revision: 2026-06-03*
--- a/HYPERVISORS.en.md
+++ b/HYPERVISORS.en.md
@@ -24,7 +24,7 @@
 - **VM — Virtual Machine** — full virtualization, own kernel
 - **Container** — shared host kernel, lighter (Docker, LXC)
 - **Paravirtualization** — guest OS knows it runs in a VM (better I/O performance)
- **NUMA** — Non-Uniform Memory Access, CPU/memory allocation optimization (see [SERVER-HW.md](SERVER-HW.md#numa))
+- **NUMA** — Non-Uniform Memory Access, CPU/memory allocation optimization (see [SERVER-HW.en.md](SERVER-HW.en.md#numa))
 - **Overcommit** — allocating more vCPU/RAM than physically available (ratio management)
 - **Live Migration** — moving a running VM between hosts (vSphere vMotion, Hyper-V Live Migration)
 - **HA (High Availability)** — VM restart on another host upon failure
@@ -86,20 +86,22 @@ According to Foundry/CIO.com survey (2025): **56%** of organizations plan to red

 #### Target Platforms — Comparison

-| Criterion | Proxmox VE | Nutanix AHV | Microsoft Hyper-V | Red Hat OpenShift Virtualization |
-|-----------|-----------|-------------|-------------------|----------------------------------|
-| **Hypervisor** | KVM + LXC | KVM (fork) | Hyper-V | KVM (KubeVirt) |
-| **License** | Open source (free), support ~€500/host/year | Per node subscription (30–60% savings vs VCF) | Windows Server license (Standard/Datacenter) | OpenShift subscription (core-based) |
-| **Live Migration** | Live Migration (Proxmox 8+) | AHV Live Migration | Live Migration (SMB/RDMA) | KubeVirt (VMI live migration) |
-| **HA** | Proxmox HA (watchdog, fencing) | Built-in HA (Prism) | Hyper-V HA (WS Failover Cluster) | OpenShift HA (self-healing) |
-| **Storage** | ZFS, Ceph, LVM | AOS (hybrid/SSD, erasure coding) | S2D, CSV, ReFS | OCS, Ceph, LSO |
-| **Backup** | Proxmox Backup Server (free) | Native snapshot + DR | Windows Server Backup / Veeam | OpenShift APIs + OADP |
-| **Price (3 years, 3 hosts)** | $0 + support $1,500 | ~$45,000–60,000 | $0 (Hyper-V Server free) or Windows Server license | ~$90,000+ (OpenShift) |
-| **Price (3 years, 10 hosts)** | $0 + support $5,000 | ~$150,000–200,000 | Windows Server Datacenter for unlimited VMs | ~$300,000+ (OpenShift) |
-| **Migration difficulty** | Medium (VMDK → QCOW2, VirtIO drivers) | Low (Nutanix Move tool) | Medium (V2V converter, SCVMM) | High (Kubernetes learning curve) |
-| **Linux support** | Excellent (native KVM) | Excellent (KVM-based) | Good (LIS drivers) | Excellent (KVM + OpenShift) |
-| **Windows support** | Good (VirtIO drivers) | Excellent (ALAS drivers, svpd) | Excellent (native) | Good (KubeVirt + VirtIO) |
-| **GPU passthrough** | VFIO (excellent) | GPU passthrough | DDA (Direct Device Assignment) | VFIO + GPU Operator |
+| Criterion | Proxmox VE | Nutanix AHV | Microsoft Hyper-V | Red Hat OpenShift Virtualization | **Sangfor aSV (HCI)** |
+|-----------|-----------|-------------|-------------------|----------------------------------|----------------------|
+| **Hypervisor** | KVM + LXC | KVM (fork) | Hyper-V | KVM (KubeVirt) | **KVM (aSV)** |
+| **License** | Open source (free), support ~€500/host/year | Per node subscription (30–60% savings vs VCF) | Windows Server license (Standard/Datacenter) | OpenShift subscription (core-based) | **Per node (Enterprise Pro), all-inclusive** |
+| **Live Migration** | Live Migration (Proxmox 8+) | AHV Live Migration | Live Migration (SMB/RDMA) | KubeVirt (VMI live migration) | **Yes** |
+| **HA** | Proxmox HA (watchdog, fencing) | Built-in HA (Prism) | Hyper-V HA (WS Failover Cluster) | OpenShift HA (self-healing) | **Built-in HA** |
+| **Storage** | ZFS, Ceph, LVM | AOS (hybrid/SSD, erasure coding) | S2D, CSV, ReFS | OCS, Ceph, LSO | **aSAN (distributed SDS, locality-aware)** |
+| **Backup** | Proxmox Backup Server (free) | Native snapshot + DR | Windows Server Backup / Veeam | OpenShift APIs + OADP | **Built-in backup + CDP** |
+| **Price (3 years, 3 hosts)** | $0 + support $1,500 | ~$45,000–60,000 | $0 (Hyper-V Server free) or Windows Server license | ~$90,000+ (OpenShift) | **~$15,000–25,000** |
+| **Price (3 years, 10 hosts)** | $0 + support $5,000 | ~$150,000–200,000 | Windows Server Datacenter for unlimited VMs | ~$300,000+ (OpenShift) | **~$50,000–80,000** |
+| **Migration difficulty** | Medium (VMDK → QCOW2, VirtIO drivers) | Low (Nutanix Move tool) | Medium (V2V converter, SCVMM) | High (Kubernetes learning curve) | **Low (VMware import tool)** |
+| **Linux support** | Excellent (native KVM) | Excellent (KVM-based) | Good (LIS drivers) | Excellent (KVM + OpenShift) | **Excellent (KVM-based)** |
+| **Windows support** | Good (VirtIO drivers) | Excellent (ALAS drivers, svpd) | Excellent (native) | Good (KubeVirt + VirtIO) | **Good (VirtIO drivers)** |
+| **GPU passthrough** | VFIO (excellent) | GPU passthrough | DDA (Direct Device Assignment) | VFIO + GPU Operator | **vGPU support (standard)** |
+| **Integrated security** | — | — | — | — | **Yes (NGFW, IPS, WAF, EDR — aSEC)** |
+| **Min. cluster (3 copies)** | 3 (Ceph) | 3 | 2–3 | 3 | **3** |

 #### Migration Tools

@@ -112,8 +114,47 @@ According to Foundry/CIO.com survey (2025): **56%** of organizations plan to red
 | **virt-v2v** | VMware ESXi, Xen, Hyper-V | KVM (libvirt) | Open source CLI tool, disk + driver conversion (virtio), suitable for bulk migration |
 | **Windows Admin Center VM Conversion Extension** | VMware ESXi | Hyper-V | Microsoft WAC extension, free, GUI-based, bulk migration |
 | **Platform9 vJailbreak** | VMware ESXi | OpenStack / KVM | In-place migration (no swing gear), open source |
+| **Sangfor VMware Import Tool** | VMware ESXi | Sangfor aSV (HCI) | VMware import tool, disk + driver conversion, can retain network config |

-#### TCO Comparison — Example: 3 hosts (2× 20C CPU), 50 VMs
+#### Cross-Hypervisor Migration Matrix
+
+Comprehensive overview of all source→target pairs with methods, tools, limitations, and complexity.
+
+| Source → Target | Method | Tools | Complexity | Limitations |
+|-------------|--------|----------|-----------|---------|
+| **VMware → Proxmox** | Disk conversion VMDK→QCOW2, driver reinstall | Proxmox Import Wizard, Veeam, StarWind, virt-v2v | Medium | VirtIO drivers required, UEFI not supported in Import Wizard (< 8.1), snapshots must be removed |
+| **VMware → Hyper-V** | Disk conversion VMDK→VHDX, driver reinstall | StarWind, WAC Converter, SCVMM, Microsoft MTC | Medium | Integration Services required, network config differences (VMXNET3 → Hyper-V Synthetic) |
+| **VMware → KVM/XCP-ng** | Disk conversion VMDK→raw/QCOW2, driver swap | virt-v2v, StarWind | Medium | VirtIO drivers, UEFI support (OVMF), host passthrough compatibility |
+| **VMware → Nutanix AHV** | Automated migration via Move appliance | Nutanix Move, Veeam | Low | AHV is also KVM — minimal issues, retain IP/MAC, UEFI support |
+| **VMware → Sangfor aSV** | Import via VMware Import Tool, disk + driver conversion | Sangfor VMware Import Tool | Low | Built-in tool, retain network config, UEFI support |
+| **VMware → OpenStack** | In-place or swing | Platform9 vJailbreak, virt-v2v + Glance | High | Network redesign (Neutron), storage (Cinder), image format (Glance) required |
+| **Hyper-V → VMware** | Disk conversion VHDX→VMDK, driver reinstall | StarWind, virt-v2v, VMware vCenter Converter (standalone) | Medium | VMware Tools required, network driver change (VMXNET3), UEFI/secure boot issues |
+| **Hyper-V → Proxmox** | Disk conversion VHDX→QCOW2, driver swap | StarWind, virt-v2v, qemu-img | Medium–High | VirtIO drivers, integration services → guest agent, secure boot issues |
+| **Hyper-V → KVM/XCP-ng** | Disk conversion VHDX→raw/QCOW2 | virt-v2v, qemu-img | Medium | VirtIO drivers, Linux generic drivers usually work |
+| **Hyper-V → Nutanix AHV** | Automated migration | Nutanix Move | Low–Medium | Similar to VMware→Nutanix, UEFI support, retain IP |
+| **Proxmox → VMware** | Export OVF/OVA, qemu-img convert | qemu-img (QCOW2→VMDK), ovftool, manual OVF export | High | VMware Tools required, storage format differences, no live migration, downtime required |
+| **Proxmox → Hyper-V** | qemu-img convert, driver reinstall | qemu-img, manual VHDX conversion | High | Hyper-V Integration Services required, no automated tool, edge case |
+| **Proxmox → KVM/XCP-ng** | Direct QCOW2 (same format), XML edit | libvirt, virsh dumpxml/define | Medium | libvirt XML/QEMU args differences (storage pool, network), validation required |
+| **Proxmox → Nutanix AHV** | qemu-img + manual import | qemu-img, Nutanix Image Service CLI | High | No hot tool, conversion + manual VM reconfiguration required |
+| **XCP-ng → VMware** | Disk conversion VHD→VMDK | qemu-img, StarWind, virt-v2v | High | VMware Tools required, paravirtualization differences (Xen PV vs VMware) |
+| **XCP-ng → Proxmox** | Disk conversion or direct VHD | qemu-img, manual import | Medium | Disk conversion, VHD format not native in Proxmox |
+| **XCP-ng → Hyper-V** | Disk conversion VHD→VHDX (direct) | StarWind, qemu-img | Medium | VHD/VHDX compatible, Integration Services required |
+| **Nutanix AHV → VMware** | Export + conversion | qemu-img, Nutanix Export, VMware vCenter Converter | High | VMware Tools, AHV is KVM → usually easier than Hyper-V→VMware |
+| **Nutanix AHV → Proxmox** | qemu-img + manual import | qemu-img, Nutanix self-service restore | Medium | AFS disks → QCOW2, metadata must be reconstructed |
+| **Nutanix AHV → Hyper-V** | qemu-img + manual | qemu-img, StarWind | High | Edge case, no hot tool |
+| **OpenStack → (any)** | Glance export + qemu-img | glance image-download, qemu-img, ovftool | Medium–High | Image format (raw/QCOW2), metadata (flavor, security groups) must be recreated |
+| **Sangfor aSV → (any)** | qemu-img conversion + manual | qemu-img, manual OVF/OVA export | Medium–High | KVM-based → conversion to QCOW2/VMDK/VHDX via qemu-img, metadata must be recreated |
+| **(any) → Sangfor aSV** | aSV API import + VMware Import Tool | Sangfor VMware Import Tool (for VMware), manual qemu-img import for others | Medium | KVM-based → standard formats supported, import tool for VMware only |
+
+**Migration success keys:**
+
+- **Drivers** — each platform requires its own paravirtual drivers (VMware Tools, VirtIO, Hyper-V Integration Services, Xen Tools). Always swap after migration.
+- **UEFI / Secure Boot** — not all combinations support UEFI (Proxmox Import Wizard < 8.1 does not). Test UEFI VMs before migration.
+- **Snapshots** — snapshots must be removed (merged) before migration. Most tools only migrate flat disks.
+- **Network** — MAC addresses, IP addresses, VLAN tagging — verify after migration. Some tools (Nutanix Move, VMware Converter) can retain MAC.
+- **Storage format** — VMDK ↔ VHDX ↔ QCOW2 ↔ raw are inter-convertible via `qemu-img`, but metadata differs (snapshots, backing files).
+- **Live migration** — no live migration exists between different hypervisors. Downtime is always required (minutes to hours depending on VM size).
+- **Migration temperature** — the "colder" the VM (fewer changes), the easier the migration. Real-time database applications require a separate DB migration plan.

 | Platform | Year 1 | 3 Years Total | Note |
 |-----------|--------|---------------|----------|
@@ -123,6 +164,7 @@ According to Foundry/CIO.com survey (2025): **56%** of organizations plan to red
 | **Nutanix AHV** (average) | ~$18,000 | ~$54,000 | Per node subscription, estimate |
 | **Hyper-V** (Windows Server Datacenter) | $12,400 | $37,200 | One-time license per core, without SA |
 | **Hyper-V** (Azure Stack HCI) | ~$7,200 | ~$21,600 | ~$10/core/month, 120 cores |
+| **Sangfor HCI** (Enterprise Pro) | ~$5,000–8,000 | ~$15,000–25,000 | Per node, all-inclusive, 3 nodes |

 **Real-world example from Spiceworks (2026)**: A user reports VMware Essentials+ increasing from $1,900/year to $14,000/year (VVF) — a 7.4× increase.

@@ -142,8 +184,9 @@ According to Foundry/CIO.com survey (2025): **56%** of organizations plan to red
 3. Select target platform (1-2 candidates)
   ├─ Proxmox: lowest TCO, Linux-heavy shops
   ├─ Nutanix: enterprise HCI, low migration difficulty
-   ├─ Hyper-V: Windows-centric, Azure hybrid
-   └─ OpenShift: Kubernetes-first, platform engineering
+    ├─ Hyper-V: Windows-centric, Azure hybrid
+    ├─ Sangfor: HCI all-in-one, security-first, VMware exit (SMB/mid-market)
+    └─ OpenShift: Kubernetes-first, platform engineering

 4. Plan migration phases
   ├─ Wave 1: non-critical (dev/test, 1-2 months)
@@ -269,9 +312,71 @@ Hardware ──> QEMU (I/O emulation) + KVM (kernel module, virtualization)
 - Load KVM modules: `kvm`, `kvm_intel`/`kvm_amd`, `vfio-pci`
 - Optimize storage: raw/LVM (avoid qcow2 for performance workloads)

+## Sangfor aSV (HCI)
+
+[Chinese vendor](https://www.sangfor.com) — KVM-based hypervisor, part of Sangfor HCI stack (aSV + aSAN + aNet + aSEC). Distributed through partners in EMEA.
+
+### Stack architecture
+
+| Component | Role |
+|-----------|------|
+| **aSV** | Hypervisor (KVM-based) |
+| **aSAN** | Distributed SDS (locality-aware, data tiering, dedup, compression) |
+| **aNet** | Network virtualization (distributed switches and routers, WYDIWYG visual editor) |
+| **aSEC** | Security (NGFW, IPS, WAF, EDR, east-west segmentation) |
+| **Sangfor Cloud Platform** | Management orchestrator, unified dashboard |
+
+### Key features
+
+| Feature | Detail |
+|-----------|--------|
+| **Hypervisor** | KVM (aSV) — custom fork with HCI extensions |
+| **License** | Enterprise Pro — per node, all-inclusive (compute + storage + network + security) |
+| **Min. cluster** | 3 nodes (3 data copies) |
+| **Live Migration** | Yes |
+| **HA** | Built-in HA |
+| **Storage** | aSAN — locality-aware, data tiering (SSD + HDD), dedup, compression, erasure coding |
+| **Backup** | Built-in backup + CDP — no 3rd party needed |
+| **Security** | Integrated NGFW, IPS, WAF, EDR — no external appliances |
+| **VDI** | aDesk — integrated VDI solution |
+| **Kubernetes** | SKE (Sangfor Kubernetes Engine) |
+| **Migration** | Sangfor VMware Import Tool (from vCenter), qemu-img for others |
+| **vGPU** | Standard support (no extra license) |
+
+### Comparison with VMware
+
+| Feature | Sangfor | VMware |
+|---------|---------|--------|
+| **License** | Per node, all-inclusive | Multi-tier (vSphere + vSAN + NSX + Aria) |
+| **vGPU** | Included (standard) | Enterprise Plus only |
+| **Backup + CDP** | Built-in | 3rd party or extra license |
+| **Security (NGFW, IPS, WAF)** | Built-in (aSEC) | NSX + 3rd party |
+| **Network management** | WYDIWYG visual editor | NSX Manager (more complex) |
+| **Min. cluster (3 copies)** | 3 nodes | 5 nodes (vSAN) |
+| **Data locality** | Yes | No |
+| **SSD life prediction** | Yes | No |
+
+### Use case
+
+- **VMware exit** — VMware replacement for SMB and mid-market
+- **Greenfield HCI** — new DCs, branch offices, remote sites
+- **VDI** — aDesk integrated with HCI
+- **Security-first** — organizations requiring integrated security
+- **Asia-Pacific / EMEA** — strongest in Asia, expanding to Europe
+
+### Risks and limitations
+
+| Risk | Detail |
+|--------|--------|
+| **Geopolitical** | Chinese vendor — possible regulatory restrictions (GDPR, EU, NATO, government) |
+| **Ecosystem** | Smaller community than VMware/Proxmox, less documentation and ISV certifications |
+| **Support** | Primary support from Asia, local partner critical |
+| **Vendor lock-in** | Closed ecosystem (aSV + aSAN + aNet + aSEC), harder to mix with 3rd party |
+| **References in CZ/EU** | Very limited — pilot required before production |
+
 ## Storage in Hypervisors

-See also: [STORAGE.md](STORAGE.md) — detailed overview of storage protocols and configurations.
+See also: [STORAGE.en.md](STORAGE.en.md) — detailed overview of storage protocols and configurations.

 | Type | Description | Protocols |
 |-----|-------|-----------|
@@ -443,7 +548,7 @@ For telco, large private clouds, MANO/NFVI environments.

 ## Resources

-Links, books and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)

 ### Recommended Reading

--- a/HYPERVISORS.md
+++ b/HYPERVISORS.md
@@ -86,20 +86,22 @@ Dle Foundry/CIO.com průzkumu (2025): **56 %** organizací plánuje snížit vyu

 #### Cílové platformy — srovnání

-| Kritérium | Proxmox VE | Nutanix AHV | Microsoft Hyper-V | Red Hat OpenShift Virtualization |
-|-----------|-----------|-------------|-------------------|----------------------------------|
-| **Hypervisor** | KVM + LXC | KVM (fork) | Hyper-V | KVM (KubeVirt) |
-| **Licence** | Open source (free), support ~€500/host/rok | Per node subscription (30–60 % savings oproti VCF) | Windows Server license (Standard/Datacenter) | OpenShift subscription (core-based) |
-| **Live Migration** | Live Migration (Proxmox 8+) | AHV Live Migration | Live Migration (SMB/RDMA) | KubeVirt (VMI live migration) |
-| **HA** | Proxmox HA (watchdog, fencing) | Built-in HA (Prism) | Hyper-V HA (WS Failover Cluster) | OpenShift HA (self-healing) |
-| **Storage** | ZFS, Ceph, LVM | AOS (hybrid/SSD, erasure coding) | S2D, CSV, ReFS | OCS, Ceph, LSO |
-| **Backup** | Proxmox Backup Server (free) | Native snapshot + DR | Windows Server Backup / Veeam | OpenShift APIs + OADP |
-| **Cena (3 roky, 3 hosty)** | $0 + support $1 500 | ~$45 000–60 000 | $0 (Hyper-V Server zdarma) nebo Windows Server lic. | ~$90 000+ (OpenShift) |
-| **Cena (3 roky, 10 hostů)** | $0 + support $5 000 | ~$150 000–200 000 | Windows Server Datacenter pro neomezené VM | ~$300 000+ (OpenShift) |
-| **Náročnost migrace** | Střední (VMDK → QCOW2, VirtIO drivery) | Nízká (Nutanix Move tool) | Střední (V2V converter, SCVMM) | Vysoká (Kubernetes learning curve) |
-| **Linux podpora** | Výborná (nativní KVM) | Výborná (KVM-based) | Dobrá (LIS drivers) | Výborná (KVM + OpenShift) |
-| **Windows podpora** | Dobrá (VirtIO drivers) | Výborná (ALAS drivers, svpd) | Výborná (nativní) | Dobrá (KubeVirt + VirtIO) |
-| **GPU passthrough** | VFIO (výborná) | GPU passthrough | DDA (Direct Device Assignment) | VFIO + GPU Operator |
+| Kritérium | Proxmox VE | Nutanix AHV | Microsoft Hyper-V | Red Hat OpenShift Virtualization | **Sangfor aSV (HCI)** |
+|-----------|-----------|-------------|-------------------|----------------------------------|----------------------|
+| **Hypervisor** | KVM + LXC | KVM (fork) | Hyper-V | KVM (KubeVirt) | **KVM (aSV)** |
+| **Licence** | Open source (free), support ~€500/host/rok | Per node subscription (30–60 % savings oproti VCF) | Windows Server license (Standard/Datacenter) | OpenShift subscription (core-based) | **Per node (Enterprise Pro), vše v ceně** |
+| **Live Migration** | Live Migration (Proxmox 8+) | AHV Live Migration | Live Migration (SMB/RDMA) | KubeVirt (VMI live migration) | **Ano** |
+| **HA** | Proxmox HA (watchdog, fencing) | Built-in HA (Prism) | Hyper-V HA (WS Failover Cluster) | OpenShift HA (self-healing) | **Built-in HA** |
+| **Storage** | ZFS, Ceph, LVM | AOS (hybrid/SSD, erasure coding) | S2D, CSV, ReFS | OCS, Ceph, LSO | **aSAN (distribuovaný SDS, locality-aware)** |
+| **Backup** | Proxmox Backup Server (free) | Native snapshot + DR | Windows Server Backup / Veeam | OpenShift APIs + OADP | **Built-in backup + CDP (Continuous Data Protection)** |
+| **Cena (3 roky, 3 hosty)** | $0 + support $1 500 | ~$45 000–60 000 | $0 (Hyper-V Server zdarma) nebo Windows Server lic. | ~$90 000+ (OpenShift) | **~$15 000–25 000** |
+| **Cena (3 roky, 10 hostů)** | $0 + support $5 000 | ~$150 000–200 000 | Windows Server Datacenter pro neomezené VM | ~$300 000+ (OpenShift) | **~$50 000–80 000** |
+| **Náročnost migrace** | Střední (VMDK → QCOW2, VirtIO drivery) | Nízká (Nutanix Move tool) | Střední (V2V converter, SCVMM) | Vysoká (Kubernetes learning curve) | **Nízká (nástroje pro VMware import)** |
+| **Linux podpora** | Výborná (nativní KVM) | Výborná (KVM-based) | Dobrá (LIS drivers) | Výborná (KVM + OpenShift) | **Výborná (KVM-based)** |
+| **Windows podpora** | Dobrá (VirtIO drivers) | Výborná (ALAS drivers, svpd) | Výborná (nativní) | Dobrá (KubeVirt + VirtIO) | **Dobrá (VirtIO drivers)** |
+| **GPU passthrough** | VFIO (výborná) | GPU passthrough | DDA (Direct Device Assignment) | VFIO + GPU Operator | **vGPU support (standard)** |
+| **Integrovaná bezpečnost** | — | — | — | — | **Ano (NGFW, IPS, WAF, EDR — aSEC)** |
+| **Min. cluster (3 kopie)** | 3 (Ceph) | 3 | 2–3 | 3 | **3** |

 #### Migrační nástroje

@@ -112,6 +114,47 @@ Dle Foundry/CIO.com průzkumu (2025): **56 %** organizací plánuje snížit vyu
 | **virt-v2v** | VMware ESXi, Xen, Hyper-V | KVM (libvirt) | Open source CLI nástroj, konverze disků + driverů (virtio), vhodný pro hromadnou migraci |
 | **Windows Admin Center VM Conversion Extension** | VMware ESXi | Hyper-V | Microsoft WAC extension, free, GUI-based, hromadná migrace |
 | **Platform9 vJailbreak** | VMware ESXi | OpenStack / KVM | In-place migration (bez swing gear), open source |
+| **Sangfor VMware Import Tool** | VMware ESXi | Sangfor aSV (HCI) | Nástroj pro import VM z vCenter, konverze disků + driverů, možnost retain network config |
+
+#### Matice migrací napříč hypervisory
+
+Komplexní přehled všech dvojic zdroj → cíl s metodami, nástroji, omezeními a obtížností.
+
+| Zdroj → Cíl | Metoda | Nástroje | Obtížnost | Omezení |
+|-------------|--------|----------|-----------|---------|
+| **VMware → Proxmox** | Disk konverze VMDK→QCOW2, reinstalace driverů | Proxmox Import Wizard, Veeam, StarWind, virt-v2v | Střední | Nutné VirtIO drivery, UEFI nepodporováno v Import Wizard (< 8.1), nutno odstranit snapshoty |
+| **VMware → Hyper-V** | Disk konverze VMDK→VHDX, reinstalace driverů | StarWind, WAC Converter, SCVMM, Microsoft MTC | Střední | Integration Services nutné, rozdíly v síťové konfiguraci (VMXNET3 → Hyper-V Synthetic) |
+| **VMware → KVM/XCP-ng** | Disk konverze VMDK→raw/QCOW2, driver swap | virt-v2v, StarWind | Střední | VirtIO drivers, UEFI support (OVMF), host passthrough musí být kompatibilní |
+| **VMware → Nutanix AHV** | Automatizovaná migrace přes Move appliance | Nutanix Move, Veeam | Nízká | AHV je také KVM – minimální problémy, retain IP/MAC, podpora UEFI |
+| **VMware → Sangfor aSV** | Import přes VMware Import Tool, konverze disků + driverů | Sangfor VMware Import Tool | Nízká | Built-in nástroj, retain network config, support UEFI |
+| **VMware → OpenStack** | In-place nebo swing | Platform9 vJailbreak, virt-v2v + Glance | Vysoká | Nutný redesign networking (Neutron), storage (Cinder), image format (Glance) |
+| **Hyper-V → VMware** | Disk konverze VHDX→VMDK, reinstalace driverů | StarWind, virt-v2v, VMware vCenter Converter (standalone) | Střední | VMware Tools nutné, síťový driver change (VMXNET3), UEFI/secure boot issues |
+| **Hyper-V → Proxmox** | Disk konverze VHDX→QCOW2, driver swap | StarWind, virt-v2v, qemu-img | Střední–Vysoká | VirtIO drivers, integration services → guest agent, secure boot issues |
+| **Hyper-V → KVM/XCP-ng** | Disk konverze VHDX→raw/QCOW2 | virt-v2v, qemu-img | Střední | VirtIO drivers, Linux generické drivery obvykle fungují |
+| **Hyper-V → Nutanix AHV** | Automatizovaná migrace | Nutanix Move | Nízká–Střední | Obdobné jako VMware→Nutanix, support UEFI, retain IP |
+| **Proxmox → VMware** | Export OVF/OVA, qemu-img convert | qemu-img (QCOW2→VMDK), ovftool, manuální OVF export | Vysoká | VMware Tools nutné, rozdíly v storage formátech, bez live migration, nutný downtime |
+| **Proxmox → Hyper-V** | qemu-img convert, reinstalace driverů | qemu-img, manuální VHDX konverze | Vysoká | Hyper-V Integration Services nutné, žádný automatizovaný nástroj, edge case |
+| **Proxmox → KVM/XCP-ng** | Direct QCOW2 (stejný formát), úprava XML | libvirt, virsh dumpxml/define | Střední | Rozdíly v libvirt XML/QEMU args (storage pool, síť), nutná validace |
+| **Proxmox → Nutanix AHV** | qemu-img + manuální import | qemu-img, Nutanix Image Service CLI | Vysoká | Žádný hot nástroj, nutná konverze + manuální rekonfigurace VM |
+| **XCP-ng → VMware** | Disk konverze VHD→VMDK | qemu-img, StarWind, virt-v2v | Vysoká | VMware Tools nutné, rozdíly v paravirtualizaci (Xen PV vs VMware) |
+| **XCP-ng → Proxmox** | Disk konverze nebo direct VHD | qemu-img, manuální import | Střední | Konverze disků, formát VHD není nativní v Proxmox |
+| **XCP-ng → Hyper-V** | Disk konverze VHD→VHDX (přímá) | StarWind, qemu-img | Střední | VHD/VHDX kompatibilní, nutné Integration Services |
+| **Nutanix AHV → VMware** | Export + konverze | qemu-img, Nutanix Export, VMware vCenter Converter | Vysoká | VMware Tools, AHV je KVM → obvykle jednodušší než Hyper-V→VMware |
+| **Nutanix AHV → Proxmox** | qemu-img + manuální import | qemu-img, Nutanix self-service restore | Střední | Disky z AFS → QCOW2, metadata nutno rekonstruovat |
+| **Nutanix AHV → Hyper-V** | qemu-img + manuální | qemu-img, StarWind | Vysoká | Edge case, žádný hot nástroj |
+| **OpenStack → (any)** | Glance export + qemu-img | glance image-download, qemu-img, ovftool | Střední–Vysoká | Image formát (raw/QCOW2), metadata (flavor, security groups) nutno znovu vytvořit |
+| **Sangfor aSV → (any)** | qemu-img konverze + manuální | qemu-img, manuální OVF/OVA export | Střední–Vysoká | KVM-based → konverze do QCOW2/VMDK/VHDX přes qemu-img, metadata nutno znovu vytvořit |
+| **(any) → Sangfor aSV** | aSV API import + VMware Import Tool | Sangfor VMware Import Tool (pro VMware), manuální qemu-img import pro ostatní | Střední | KVM-based → podpora standardních formátů, import tool jen pro VMware |
+
+**Klíče k úspěšné migraci:**
+
+- **Drivery** — každá platforma vyžaduje vlastní paravirtual drivers (VMware Tools, VirtIO, Hyper-V Integration Services, Xen Tools). Po migraci vždy vyměnit.
+- **UEFI / Secure Boot** — ne všechny kombinace podporují UEFI (Proxmox Import Wizard < 8.1 nepodporuje). Při migraci UEFI VM raději testovat.
+- **Snapshoty** — snapshots musí být před migrací odstraněny (sloučeny). Většina nástrojů migruje jen flat disky.
+- **Síť** — MAC adresy, IP adresy, VLAN tagging — po migraci zkontrolovat. Některé nástroje (Nutanix Move, VMware Converter) umí retain MAC.
+- **Storage format** — VMDK ↔ VHDX ↔ QCOW2 ↔ raw jsou vzájemně konvertovatelné přes `qemu-img`, ale liší se v metadatech (snapshots, backing files).
+- **Live migration** — mezi různými hypervisory neexistuje live migration. Vždy je potřeba downtime (minuty až hodiny podle velikosti VM).
+- **Teplota migrace** — čím "chladnější" VM (méně změn), tím snazší migrace. Aplikace s databází v reálném čase vyžadují samostatný DB migrační plán.

 #### TCO srovnání — příklad: 3 hosty (2× 20C CPU), 50 VM

@@ -123,6 +166,7 @@ Dle Foundry/CIO.com průzkumu (2025): **56 %** organizací plánuje snížit vyu
 | **Nutanix AHV** (průměr) | ~$18 000 | ~$54 000 | Per node subscription, odhad |
 | **Hyper-V** (Windows Server Datacenter) | $12 400 | $37 200 | Jednorázová licence per core, bez SA |
 | **Hyper-V** (Azure Stack HCI) | ~$7 200 | ~$21 600 | ~$10/core/měsíc, 120 cores |
+| **Sangfor HCI** (Enterprise Pro) | ~$5 000–8 000 | ~$15 000–25 000 | Per node, vše v ceně, 3 uzly |

 **Reálný příklad ze Spiceworks (2026)**: Uživatel hlásí navýšení VMware Essentials+ z $1 900/rok na $14 000/rok (VVF) — nárůst 7.4×.

@@ -142,8 +186,9 @@ Dle Foundry/CIO.com průzkumu (2025): **56 %** organizací plánuje snížit vyu
 3. Vyber cílovou platformu (1-2 kandidáty)
   ├─ Proxmox: nejnižší TCO, Linux-heavy shops
   ├─ Nutanix: enterprise HCI, nízká náročnost migrace
-   ├─ Hyper-V: Windows-centric, Azure hybrid
-   └─ OpenShift: Kubernetes-first, platform engineering
+    ├─ Hyper-V: Windows-centric, Azure hybrid
+    ├─ Sangfor: HCI all-in-one, security-first, VMware exit (SMB/mid-market)
+    └─ OpenShift: Kubernetes-first, platform engineering

 4. Naplánuj migrační fáze
   ├─ Wave 1: non-critical (dev/test, 1-2 měsíce)
@@ -269,6 +314,72 @@ Hardware ──> QEMU (emulace I/O) + KVM (kernel module, virtualization)
 - Naložit KVM moduly: `kvm`, `kvm_intel`/`kvm_amd`, `vfio-pci`
 - Optimalizovat storage: raw/LVM (vyhnout se qcow2 u výkonových workloadů)

+## Sangfor aSV (HCI)
+
+[Čínský vendor](https://www.sangfor.com) — KVM-based hypervisor, součást Sangfor HCI stacku (aSV + aSAN + aNet + aSEC). V ČR distribuován přes partnery.
+
+### Architektura stacku
+
+| Komponenta | Role |
+|-----------|------|
+| **aSV** | Hypervisor (KVM-based) |
+| **aSAN** | Distributed SDS (locality-aware, data tiering, dedup, compression) |
+| **aNet** | Network virtualization (distribuované switche a routery, WYDIWYG vizuální editor) |
+| **aSEC** | Bezpečnost (NGFW, IPS, WAF, EDR, east-west segmentation) |
+| **Sangfor Cloud Platform** | Management orchestrator, unified dashboard |
+
+### Klíčové vlastnosti
+
+| Vlastnost | Detail |
+|-----------|--------|
+| **Hypervisor** | KVM (aSV) — vlastní fork s rozšířeními pro HCI |
+| **Licence** | Enterprise Pro — per node, vše v ceně (compute + storage + network + security) |
+| **Min. cluster** | 3 uzly (3 kopie dat) |
+| **Live Migration** | Ano |
+| **HA** | Built-in HA |
+| **Storage** | aSAN — locality-aware (data locality), data tiering (SSD + HDD), dedup, compression, erasure coding |
+| **Backup** | Built-in backup + CDP (Continuous Data Protection) — bez nutnosti 3rd party |
+| **Security** | Integrated NGFW, IPS, WAF, EDR — bez externích appliance |
+| **VDI** | aDesk — integrované VDI řešení |
+| **Kubernetes** | SKE (Sangfor Kubernetes Engine) |
+| **Migrace** | Sangfor VMware Import Tool (z vCenter), qemu-img pro ostatní |
+| **vGPU** | Standardní podpora (bez extra licence) |
+
+### Srovnání s VMware
+
+| Feature | Sangfor | VMware |
+|---------|---------|--------|
+| **Licence** | Per node, vše v ceně | Vícestupňová (vSphere + vSAN + NSX + Aria) |
+| **vGPU** | V ceně (standard) | Jen v Enterprise Plus |
+| **Backup + CDP** | Built-in | 3rd party nebo extra licence |
+| **Security (NGFW, IPS, WAF)** | Built-in (aSEC) | NSX + 3rd party (Palo Alto, Check Point) |
+| **Network management** | WYDIWYG vizuální editor | NSX Manager (složitější) |
+| **Min. cluster (3 kopie)** | 3 uzly | 5 uzlů (vSAN) |
+| **Data locality** | Ano | Ne |
+| **SSD life prediction** | Ano | Ne |
+
+### Use case
+
+- **VMware exit** — náhrada za VMware v SMB a mid-market
+- **Greenfield HCI** — nové DC, branch offices, remote sites
+- **VDI** — aDesk integrovaný s HCI
+- **Security-first** — organizace vyžadující integrovanou bezpečnost (NGFW, IPS, WAF)
+- **Asie-Pacific / EMEA** — nejsilnější v Asii, expanding do Evropy
+
+### Rizika a omezení
+
+| Riziko | Detail |
+|--------|--------|
+| **Geopolitické** | Čínský vendor — možné regulatory restrictions (GDPR, EU, NATO, government) |
+| **Ekosystém** | Menší komunita než VMware/Proxmox, méně dokumentace a ISV certifikací |
+| **Support** | Support primárně z Asie, lokální partner kritický |
+| **Vendor lock-in** | Uzavřený ekosystém (aSV + aSAN + aNet + aSEC), těžší mix s 3rd party |
+| **Reference v ČR** | Velmi omezené — nutný pilot před produkcí |
+
+### Migrace na/z Sangfor
+
+Viz matice migrací výše v této sekci. Pro VMware → Sangfor existuje dedikovaný import nástroj. Pro ostatní hypervisory standardní qemu-img.
+
 ## Storage v hypervizorech

 Viz také: [STORAGE.md](STORAGE.md) — detailní přehled storage protokolů a konfigurací.
--- a/INFRASTRUCTURE.en.md
+++ b/INFRASTRUCTURE.en.md
@@ -4,9 +4,9 @@ This file has been split into separate areas:

 | Area | File |
 |--------|--------|
-| 🖥️ Hypervisors and virtualization | [HYPERVISORS.md](HYPERVISORS.md) |
-| 🏭 Data centers | [DATACENTERS.md](DATACENTERS.md) |
-| 💾 Storage | [STORAGE.md](STORAGE.md) |
-| 🔧 Hardware and servers | [HARDWARE.md](HARDWARE.md) |
+| 🖥️ Hypervisors and virtualization | [HYPERVISORS.en.md](HYPERVISORS.en.md) |
+| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) |
+| 💾 Storage | [STORAGE.en.md](STORAGE.en.md) |
+| 🔧 Hardware and servers | [HARDWARE.en.md](HARDWARE.en.md) |

 *Last revision: 2026-06-03*
--- a/KUBERNETES.en.md
+++ b/KUBERNETES.en.md
@@ -0,0 +1,299 @@
+# ☸ Kubernetes — architecture, platforms, Cluster API
+
+## Overview
+
+Kubernetes (K8s) is an open-source container orchestrator — the de facto standard for deploying, scaling, and managing containerized applications. Built on declarative configuration and control loops (reconciliation).
+
+## Kubernetes deployment methods
+
+| Method | Description | Control plane | Best for |
+|--------|-------------|--------------|----------|
+| **kubeadm** | Official K8s cluster bootstrap tool | Self-managed (stacked/external etcd) | On-prem, lab, learning |
+| **K3s** | Lightweight K8s (Rancher), single binary, embedded etcd/SQLite | Self-managed | Edge, IoT, low-resource, HA with embedded etcd |
+| **RKE2** | Rancher Kubernetes Engine 2, CIS-hardened, FIPS-ready | Self-managed | Enterprise on-prem, air-gapped, regulatory |
+| **OpenShift** | Red Hat enterprise K8s + operator lifecycle + SDN + routing | Self-managed (RHCOS) | Enterprise, multicluster, platform engineering |
+| **Vanilla K8s (CAPI)** | Cluster API — declarative provisioning and lifecycle management | Self-managed (CAPI managed) | Fleet management, GitOps, multi-provider |
+| **EKS** (AWS) | Managed K8s | AWS managed | AWS cloud-native, least ops |
+| **AKS** (Azure) | Managed K8s | Azure managed | Azure cloud-native |
+| **GKE** (GCP) | Managed K8s, auto-pilot, autopilot modes | GCP managed | GCP cloud-native |
+| **SKE** (Sangfor) | Managed K8s on Sangfor HCI | Vendor managed | Sangfor HCI ecosystem |
+
+---
+
+## Cluster API (CAPI)
+
+### What is Cluster API
+
+Cluster API is a Kubernetes sub-project (SIG Cluster-Lifecycle) that brings declarative APIs for provisioning, upgrading, and operating Kubernetes clusters. Instead of Terraform scripts or manual `kubeadm`, you define clusters as Kubernetes Custom Resources — `Cluster`, `Machine`, `MachineDeployment`, etc.
+
+Core principle: **A Kubernetes cluster that manages Kubernetes clusters.**
+
+### Architecture
+
+```
+┌─────────────────────────────────────────┐
+│           Management Cluster            │
+│                                         │
+│  ┌──────────────────────────────────┐   │
+│  │        CAPI Controllers          │   │
+│  │  ┌──────┐ ┌──────┐ ┌─────────┐  │   │
+│  │  │ Infra│ │Bootstrap│ │Control  │  │   │
+│  │  │ Prov │ │ Prov   │ │Plane Pr │  │   │
+│  │  └──────┘ └──────┘ └─────────┘  │   │
+│  └──────────────────────────────────┘   │
+│                                         │
+│  CR: Cluster, Machine, MachineDeployment│
+│  ...                                    │
+└────────────────┬────────────────────────┘
+                 │ CAPI controller
+                 │ creates / manages
+        ┌────────┴────────┐
+        ▼                 ▼
+┌───────────────┐  ┌───────────────┐
+│ Workload      │  │ Workload      │
+│ Cluster (dev) │  │ Cluster (prod)│
+│ ┌───┐ ┌───┐   │  │ ┌───┐ ┌───┐   │
+│ │ CP│ │ W │   │  │ │ CP│ │ W │   │
+│ └───┘ └───┘   │  │ └───┘ └───┘   │
+└───────────────┘  └───────────────┘
+```
+
+- **Management cluster** — a Kubernetes cluster running CAPI controllers. Can be a dedicated small admin cluster.
+- **Workload (managed) cluster** — Kubernetes clusters managed by CAPI; each is a CRD inside the management cluster.
+- **Machine** — abstraction of a compute unit (VM, bare metal) that becomes a K8s node.
+
+### Key CRDs (Custom Resource Definitions)
+
+| CRD | API group | Purpose |
+|-----|-----------|---------|
+| **Cluster** | `cluster.x-k8s.io` | Cluster representation (infra ref, control plane ref, networking) |
+| **Machine** | `cluster.x-k8s.io` | Individual node (VM/BM instance) |
+| **MachineDeployment** | `cluster.x-k8s.io` | Declarative scaling and rolling update of workers |
+| **MachineSet** | `cluster.x-k8s.io` | Replica set for Machines (lower-level) |
+| **MachineHealthCheck** | `cluster.x-k8s.io` | Auto-remediation (replace unhealthy nodes) |
+| **ClusterClass** | `cluster.x-k8s.io` | Cluster template for reuse |
+| **KubeadmControlPlane** | `controlplane.cluster.x-k8s.io` | Kubeadm-managed control plane (stacked/external etcd) |
+| **KubeadmConfig / KubeadmConfigTemplate** | `bootstrap.cluster.x-k8s.io` | Bootstrap configuration (kubeadm init/join) |
+
+### Provider model
+
+CAPI uses a three-layer provider model:
+
+#### 1. Infrastructure Provider
+Creates and manages infrastructure (VM, networks, LB, storage).
+
+| Provider | Platform | Status |
+|----------|----------|--------|
+| **AWS (CAPA)** | AWS EC2, VPC, ELB, EKS | Stable, SIG-sponsored |
+| **Azure (CAPZ)** | Azure VM, VNet, LB, AKS | Stable, SIG-sponsored |
+| **GCP (CAPG)** | GCP Compute, VPC, GKE | Beta |
+| **vSphere (CAPV)** | VMware vSphere | Stable |
+| **OpenStack (CAPO)** | OpenStack compute/network | Stable |
+| **Metal3** | Bare metal (Ironic) | Stable |
+| **Docker (CAPD)** | Docker containers (development) | Tilt/Dev only |
+| **Akamai (Linode)** | Linode | Community |
+| **Azure Stack HCI** | Azure Stack HCI | Community |
+| **cloudscale** | cloudscale.ch | Community |
+| **Exoscale** | Exoscale | Community |
+| **IBM Cloud** | IBM Cloud | Community |
+| **Equinix Metal** | Equinix (ex Packet) | Community |
+| **Hetzner** | Hetzner Cloud | Community |
+| **OpenNebula** | OpenNebula | Community |
+
+#### 2. Bootstrap Provider
+Handles K8s initialization on a node (kubeadm init/join, TLS certs, tokens).
+
+| Provider | Description |
+|----------|-------------|
+| **Kubeadm** (built-in) | Standard kubeadm init/join, supports stacked/external etcd |
+| **EKS** | Bootstrap for EKS managed control plane (AWS) |
+| **K3s** | Lightweight K8s bootstrap (edge, IoT) |
+| **RKE2** | Rancher K8s bootstrap, CIS-hardened |
+| **Talos** | API-driven bootstrap (Sidero Labs), immutable OS |
+| **k0smotron** | K0s-based bootstrap + hosted control plane |
+| **MicroK8s** | Canonical MicroK8s bootstrap |
+| **Canonical Kubernetes** | Canonical K8s (snap-based) |
+
+#### 3. Control Plane Provider
+Manages control plane nodes.
+
+| Provider | Description |
+|----------|-------------|
+| **KubeadmControlPlane** (built-in) | Kubeadm-managed CP, stacked/external etcd |
+| **EKS** | AWS EKS managed control plane |
+| **Kamaji** | Hosted control plane (CP runs as deployment in management cluster) |
+| **K3s** | K3s control plane (edge-optimized) |
+| **RKE2** | RKE2 control plane |
+| **Talos** | Talos control plane, API-based management |
+| **k0smotron** | Hosted control plane (k0s-based) |
+| **Nested** | Nested virtualization control plane |
+
+### ClusterClass and Managed Topologies
+
+ClusterClass (stable since CAPI v1beta1, CAPI v1.0+) allows defining a **cluster template**:
+
+```yaml
+apiVersion: cluster.x-k8s.io/v1beta1
+kind: ClusterClass
+metadata:
+  name: standard-aws-cluster
+spec:
+  controlPlane:
+    ref:
+      apiVersion: controlplane.cluster.x-k8s.io/v1beta1
+      kind: KubeadmControlPlaneTemplate
+      name: aws-cp-tmpl
+    machineInfrastructure:
+      ref:
+        kind: AWSMachineTemplate
+        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
+        name: aws-cp-machine-tmpl
+  workers:
+    machineDeployments:
+    - class: default-worker
+      template:
+        bootstrap:
+          ref:
+            apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
+            kind: KubeadmConfigTemplate
+            name: aws-worker-bootstrap-tmpl
+        infrastructure:
+          ref:
+            apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
+            kind: AWSMachineTemplate
+            name: aws-worker-machine-tmpl
+  variables:
+    - name: instanceType
+      required: true
+      schema:
+        openAPIV3Schema:
+          type: string
+          enum: ["t3.large", "m5.large", "m5.xlarge"]
+```
+
+Then create a cluster with variable overrides:
+
+```yaml
+apiVersion: cluster.x-k8s.io/v1beta1
+kind: Cluster
+metadata:
+  name: dev-team-alpha
+  namespace: clusters
+spec:
+  topology:
+    class: standard-aws-cluster
+    version: v1.30.2
+    controlPlane:
+      replicas: 1
+    workers:
+      machineDeployments:
+      - class: default-worker
+        name: md-0
+        replicas: 2
+    variables:
+      - name: instanceType
+        value: "m5.xlarge"
+```
+
+### Cluster lifecycle with CAPI
+
+| Phase | Action | CAPI mechanism |
+|-------|--------|----------------|
+| **Create** | `kubectl apply -f cluster.yaml` | Controller creates infra (VM, network), runs kubeadm init/join bootstrap |
+| **Scale** | Update `replicas` in MachineDeployment | Controller creates/removes Machine → VM → node join/drain |
+| **Upgrade** | Change `version` in KubeadmControlPlane / MachineDeployment | Rolling update: new CP node → upgrade → old drain & delete. Workers: MachineDeployment rolling update |
+| **Health check** | MachineHealthCheck | If node unhealthy > timeout, controller creates replacement Machine |
+| **Delete** | `kubectl delete cluster` | Controller drains, deletes VMs, cleans up infrastructure |
+| **Template update** | Change AWSMachineTemplate / KubeadmConfigTemplate | New Machines use the new template; existing Machines only affected via rolling update |
+
+### Auto-remediation (MachineHealthCheck)
+
+```yaml
+apiVersion: cluster.x-k8s.io/v1beta1
+kind: MachineHealthCheck
+metadata:
+  name: prod-mhc
+  namespace: clusters
+spec:
+  clusterName: prod-us-east
+  selector:
+    matchLabels:
+      cluster.x-k8s.io/deployment-name: prod-us-east-workers
+  unhealthyConditions:
+  - type: Ready
+    status: "False"
+    timeout: 5m
+  - type: Ready
+    status: Unknown
+    timeout: 5m
+  maxUnhealthy: "40%"
+  nodeStartupTimeout: 10m
+```
+
+### CAPI + GitOps
+
+CAPI integrates naturally with GitOps:
+
+- **ArgoCD** — Cluster and MachineDeployment manifests in Git repo, ArgoCD applies them to the management cluster
+- **Flux** — `Kustomization` + `OCIRepository` for CAPI objects
+- **Crossplane** — can be combined: Crossplane provisions cloud resources (VPC, subnets), CAPI manages K8s clusters on top
+
+Pattern: a dedicated "fleet management" cluster running CAPI + ArgoCD. All workload clusters are defined as YAML in Git.
+
+### CAPI for on-prem
+
+| Provider | Use case | Note |
+|----------|----------|------|
+| **Metal3** (Ironic) | Bare metal provisioning (PXE, IPMI, Redfish) | Automatically provisions BM servers as K8s nodes |
+| **CAPV (vSphere)** | VMware VMs as K8s nodes | Most common enterprise on-prem |
+| **CAPO (OpenStack)** | OpenStack VMs as K8s nodes | OpenStack-native |
+| **Nutanix (CAPNX)** | Nutanix AHV/Prism | Community provider |
+
+### CAPI for edge
+
+| Provider | Use case | Note |
+|----------|----------|------|
+| **K3s bootstrap + control plane** | Lightweight K8s on edge devices | Single binary, SQLite/embedded etcd |
+| **RKE2 bootstrap + control plane** | Enterprise edge, air-gapped | CIS-hardened, FIPS |
+| **Talos** | Immutable OS, API-driven | Minimal footprint, no SSH |
+| **k0smotron** | Hosted control plane for edge clusters | CP runs in management cluster, worker on edge |
+
+### CAPI vs alternatives
+
+| Tool | Approach | CAPI advantage | CAPI disadvantage |
+|------|----------|----------------|-------------------|
+| **Terraform/Pulumi** | Imperative/declarative IaC | CAPI is K8s-native — same tool for apps and clusters; GitOps ready | Terraform has broader non-K8s resource support |
+| **kubeadm** | Manual or scripted | CAPI automates full lifecycle including upgrades and remediation | Higher complexity, requires management cluster |
+| **Rancher** | Web UI + API for K8s cluster management | CAPI is open-source, vendor-neutral | Rancher has GUI, monitoring, app catalog |
+| **OpenShift Hive/ACM** | Red Hat Advanced Cluster Management | CAPI is standard (SIG) — wider provider ecosystem | ACM has governance, policy, compliance |
+
+### Limitations and maturity
+
+- **Management cluster is SPOF** — needs its own HA and backup (etcd snapshots, certificates)
+- **CAPI is not a cluster autoscaler** — it handles cluster lifecycle, not pod auto-scaling within a cluster (use Cluster Autoscaler separately)
+- **Provider maturity varies** — AWS/Azure/vSphere stable, GCP/OpenStack beta, some community providers alpha
+- **etcd backup is not built-in** — must be handled externally (Velero, etcd snapshot)
+- **CAPI does not handle applications** — only K8s cluster lifecycle (monitoring, logging, ingress is user-managed)
+- **Learning curve** — requires understanding management cluster, provider model, CRDs
+- **CAPI v1.13+ (2026)** — stable release, v1beta1 API is GA, ClusterClass stable, EKS/AKS/GKE managed control plane support
+
+### Recommended production CAPI stack
+
+| Component | Recommendation |
+|-----------|---------------|
+| **Management cluster** | K3s (small footprint) or kubeadm (3 nodes HA) |
+| **Infra provider** | CAPA (AWS) / CAPV (vSphere) / CAPO (OpenStack) — based on platform |
+| **Bootstrap/CP provider** | Kubeadm or RKE2 |
+| **GitOps** | ArgoCD or Flux |
+| **Backup** | Velero + restic/Ceph |
+| **Cluster autoscaler** | Cluster Autoscaler (via CAPI integration) |
+| **Network** | Cilium (CAPI-native, support) |
+| **Secrets** | External Secrets Operator / Sealed Secrets |
+| **Monitoring** | Prometheus + Grafana (kube-prometheus-stack) |
+| **Ingress** | ingress-nginx / Kong / Traefik |
+
+## Sources
+
+Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
+
+*Last revision: 2026-06-18*
--- a/KUBERNETES.md
+++ b/KUBERNETES.md
@@ -0,0 +1,299 @@
+# ☸ Kubernetes — architektura, platformy, Cluster API
+
+## Přehled
+
+Kubernetes (K8s) je open-source orchestrátor kontejnerů — de facto standard pro nasazování, škálování a správu containerizovaných aplikací. Postaven na modelu deklarativní konfigurace a control loopů (reconciliation).
+
+## Způsoby nasazení Kubernetes
+
+| Metoda | Popis | Správa control plane | Vhodné pro |
+|--------|-------|---------------------|------------|
+| **kubeadm** | Oficiální nástroj pro bootstrap K8s clusteru | Self-managed (stacked/external etcd) | On-prem, lab, learning |
+| **K3s** | Lightweight K8s (Rancher), single binary, embedded etcd/SQLite | Self-managed | Edge, IoT, low-resource, HA s embedded etcd |
+| **RKE2** | Rancher Kubernetes Engine 2, CIS-hardened, FIPS-ready | Self-managed | Enterprise on-prem, air-gapped, regulatory |
+| **OpenShift** | Red Hat enterprise K8s + operator lifecycle + SDN + routing | Self-managed (RHCOS) | Enterprise, multicluster, platform engineering |
+| **Vanilla K8s (CAPI)** | Cluster API — deklarativní provisioning a lifecycle management | Self-managed (CAPI managed) | Fleet management, GitOps, multi-provider |
+| **EKS** (AWS) | Managed K8s | AWS managed | AWS cloud-native, nejméně ops |
+| **AKS** (Azure) | Managed K8s | Azure managed | Azure cloud-native |
+| **GKE** (GCP) | Managed K8s, auto-pilot, autopilot modes | GCP managed | GCP cloud-native |
+| **SKE** (Sangfor) | Managed K8s on Sangfor HCI | Vendor managed | Sangfor HCI ekosystém |
+
+---
+
+## Cluster API (CAPI)
+
+### Co je Cluster API
+
+Cluster API je Kubernetes sub-projekt (SIG Cluster-Lifecycle), který přináší deklarativní API pro provisioning, upgrade a operace Kubernetes clusterů. Místo Terraform skriptů nebo manuálního `kubeadm` definujete cluster jako Kubernetes Custom Resources — `Cluster`, `Machine`, `MachineDeployment` atd.
+
+Princip: **Kubernetes cluster, který spravuje Kubernetes clustery.**
+
+### Architektura
+
+```
+┌─────────────────────────────────────────┐
+│           Management Cluster            │
+│                                         │
+│  ┌──────────────────────────────────┐   │
+│  │        CAPI Controllers          │   │
+│  │  ┌──────┐ ┌──────┐ ┌─────────┐  │   │
+│  │  │ Infra│ │Bootstrap│ │Control  │  │   │
+│  │  │ Prov │ │ Prov   │ │Plane Pr │  │   │
+│  │  └──────┘ └──────┘ └─────────┘  │   │
+│  └──────────────────────────────────┘   │
+│                                         │
+│  CR: Cluster, Machine, MachineDeployment│
+│  ...                                    │
+└────────────────┬────────────────────────┘
+                 │ CAPI controller
+                 │ vytváří / spravuje
+        ┌────────┴────────┐
+        ▼                 ▼
+┌───────────────┐  ┌───────────────┐
+│ Workload      │  │ Workload      │
+│ Cluster (dev) │  │ Cluster (prod)│
+│ ┌───┐ ┌───┐   │  │ ┌───┐ ┌───┐   │
+│ │ CP│ │ W │   │  │ │ CP│ │ W │   │
+│ └───┘ └───┘   │  │ └───┘ └───┘   │
+└───────────────┘  └───────────────┘
+```
+
+- **Management cluster** — Kubernetes cluster, kde běží CAPI controllery. Může to být vyhrazený "admin" cluster (často velmi malý).
+- **Workload (managed) cluster** — Kubernetes clustery, které CAPI spravuje. Každý je reprezentován jako CRD v management clusteru.
+- **Machine** — abstrakce compute jednotky (VM, bare metal), která se stane K8s uzlem.
+
+### Klíčové CRD (Custom Resource Definitions)
+
+| CRD | API skupina | Účel |
+|-----|------------|------|
+| **Cluster** | `cluster.x-k8s.io` | Reprezentace clusteru (infra reference, control plane ref, networking) |
+| **Machine** | `cluster.x-k8s.io` | Jednotlivý uzel (VM/BM instance) |
+| **MachineDeployment** | `cluster.x-k8s.io` | Deklarativní škálování a rolling update workerů |
+| **MachineSet** | `cluster.x-k8s.io` | Replica set pro Machiny (lower-level) |
+| **MachineHealthCheck** | `cluster.x-k8s.io` | Auto-remediaci (automatické nahrazení unhealthy uzlu) |
+| **ClusterClass** | `cluster.x-k8s.io` | Šablona pro vytváření clusterů |
+| **KubeadmControlPlane** | `controlplane.cluster.x-k8s.io` | Control plane managed kubeadm (stacked/external etcd) |
+| **KubeadmConfig / KubeadmConfigTemplate** | `bootstrap.cluster.x-k8s.io` | Bootstrap konfigurace (kubeadm init/join) |
+
+### Provider model
+
+CAPI používá třívrstvý provider model:
+
+#### 1. Infrastructure Provider
+Vytváří a spravuje infrastrukturu (VM, sítě, LB, storage).
+
+| Provider | Platforma | Status |
+|----------|-----------|--------|
+| **AWS (CAPA)** | AWS EC2, VPC, ELB, EKS | Stable, SIG-sponsored |
+| **Azure (CAPZ)** | Azure VM, VNet, LB, AKS | Stable, SIG-sponsored |
+| **GCP (CAPG)** | GCP Compute, VPC, GKE | Beta |
+| **vSphere (CAPV)** | VMware vSphere | Stable |
+| **OpenStack (CAPO)** | OpenStack compute/network | Stable |
+| **Metal3** | Bare metal (Ironic) | Stable |
+| **Docker (CAPD)** | Docker containers (development) | Tilt/Dev only |
+| **Akamai (Linode)** | Linode | Community |
+| **Azure Stack HCI** | Azure Stack HCI | Community |
+| **cloudscale** | cloudscale.ch | Community |
+| **Exoscale** | Exoscale | Community |
+| **IBM Cloud** | IBM Cloud | Community |
+| **Equinix Metal** | Equinix (ex Packet) | Community |
+| **Hetzner** | Hetzner Cloud | Community |
+| **OpenNebula** | OpenNebula | Community |
+
+#### 2. Bootstrap Provider
+Zajišťuje inicializaci K8s na node (kubeadm init/join, TLS certs, tokeny).
+
+| Provider | Popis |
+|----------|-------|
+| **Kubeadm** (vestavěný) | Standardní kubeadm init/join, podpora stacked/external etcd |
+| **EKS** | Bootstrap pro EKS managed control plane (AWS) |
+| **K3s** | Lightweight K8s bootstrap (edge, IoT) |
+| **RKE2** | Rancher K8s bootstrap, CIS-hardened |
+| **Talos** | API-driven bootstrap (Sidero Labs), immutable OS |
+| **k0smotron** | K0s-based bootstrap + hosted control plane |
+| **MicroK8s** | Canonical MicroK8s bootstrap |
+| **Canonical Kubernetes** | Canonical K8s (snap-based) |
+
+#### 3. Control Plane Provider
+Spravuje control plane uzly.
+
+| Provider | Popis |
+|----------|-------|
+| **KubeadmControlPlane** (vestavěný) | Kubeadm-managed CP, stacked/external etcd |
+| **EKS** | AWS EKS managed control plane |
+| **Kamaji** | Hosted control plane (CP běží jako deployment v management clusteru) |
+| **K3s** | K3s control plane (edge-optimized) |
+| **RKE2** | RKE2 control plane |
+| **Talos** | Talos control plane, API-based management |
+| **k0smotron** | Hosted control plane (k0s-based) |
+| **Nested** | Nested virtualization control plane |
+
+### ClusterClass a Managed Topologies
+
+ClusterClass (stabilní od CAPI v1beta1, CAPI v1.0+) umožňuje definovat **šablonu clusteru**:
+
+```yaml
+apiVersion: cluster.x-k8s.io/v1beta1
+kind: ClusterClass
+metadata:
+  name: standard-aws-cluster
+spec:
+  controlPlane:
+    ref:
+      apiVersion: controlplane.cluster.x-k8s.io/v1beta1
+      kind: KubeadmControlPlaneTemplate
+      name: aws-cp-tmpl
+    machineInfrastructure:
+      ref:
+        kind: AWSMachineTemplate
+        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
+        name: aws-cp-machine-tmpl
+  workers:
+    machineDeployments:
+    - class: default-worker
+      template:
+        bootstrap:
+          ref:
+            apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
+            kind: KubeadmConfigTemplate
+            name: aws-worker-bootstrap-tmpl
+        infrastructure:
+          ref:
+            apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
+            kind: AWSMachineTemplate
+            name: aws-worker-machine-tmpl
+  variables:
+    - name: instanceType
+      required: true
+      schema:
+        openAPIV3Schema:
+          type: string
+          enum: ["t3.large", "m5.large", "m5.xlarge"]
+```
+
+Pak lze vytvořit cluster s přetížením proměnných:
+
+```yaml
+apiVersion: cluster.x-k8s.io/v1beta1
+kind: Cluster
+metadata:
+  name: dev-team-alpha
+  namespace: clusters
+spec:
+  topology:
+    class: standard-aws-cluster
+    version: v1.30.2
+    controlPlane:
+      replicas: 1
+    workers:
+      machineDeployments:
+      - class: default-worker
+        name: md-0
+        replicas: 2
+    variables:
+      - name: instanceType
+        value: "m5.xlarge"
+```
+
+### Životní cyklus clusteru s CAPI
+
+| Fáze | Akce | CAPI mechanismus |
+|------|------|------------------|
+| **Create** | `kubectl apply -f cluster.yaml` | Controller vytvoří infra (VM, network), provede bootstrap kubeadm init/join |
+| **Scale** | Upravit `replicas` v MachineDeployment | Controller vytvoří/odstraní Machine → VM → node join/drain |
+| **Upgrade** | Změnit `version` v KubeadmControlPlane / MachineDeployment | Rolling update: nový CP node → upgrade → starý drain a delete. Workers: MachineDeployment rolling update |
+| **Health check** | MachineHealthCheck | Pokud node unhealthy > timeout, controller vytvoří náhradní Machine |
+| **Delete** | `kubectl delete cluster` | Controller provede drain, delete VMs, cleanup infrastruktury |
+| **Template update** | Změna AWSMachineTemplate / KubeadmConfigTemplate | Stroj se vytvoří s novou šablonou; stávající Machiny se dotýká jen přes rolling update |
+
+### Auto-remediace (MachineHealthCheck)
+
+```yaml
+apiVersion: cluster.x-k8s.io/v1beta1
+kind: MachineHealthCheck
+metadata:
+  name: prod-mhc
+  namespace: clusters
+spec:
+  clusterName: prod-us-east
+  selector:
+    matchLabels:
+      cluster.x-k8s.io/deployment-name: prod-us-east-workers
+  unhealthyConditions:
+  - type: Ready
+    status: "False"
+    timeout: 5m
+  - type: Ready
+    status: Unknown
+    timeout: 5m
+  maxUnhealthy: "40%"
+  nodeStartupTimeout: 10m
+```
+
+### CAPI + GitOps
+
+CAPI se přirozeně integruje s GitOps:
+
+- **ArgoCD** — Cluster a MachineDeployment manifesty v Git repozitáři, ArgoCD je aplikuje na management cluster
+- **Flux** — `Kustomization` + `OCIRepository` pro CAPI objekty
+- **Crossplane** — lze kombinovat: Crossplane pro provisioning cloud resources (VPC, subnets), CAPI pro K8s cluster na nich
+
+Vzor: vyhrazený "fleet management" cluster, na kterém běží CAPI + ArgoCD. Všechny workload clustery jsou definované jako YAML v Gitu.
+
+### CAPI pro on-prem
+
+| Provider | Use case | Poznámka |
+|----------|----------|----------|
+| **Metal3** (Ironic) | Bare metal provisioning (PXE, IPMI, Redfish) | Automatické provisionování BM serverů jako K8s nodes |
+| **CAPV (vSphere)** | VMware VM jako K8s nodes | Většina enterprise on-prem |
+| **CAPO (OpenStack)** | OpenStack VM jako K8s nodes | OpenStack-native |
+| **Nutanix (CAPNX)** | Nutanix AHV/Prism | Community provider |
+
+### CAPI pro edge
+
+| Provider | Use case | Poznámka |
+|----------|----------|----------|
+| **K3s bootstrap + control plane** | Lightweight K8s na edge zařízeních | Single binary, SQLite/embedded etcd |
+| **RKE2 bootstrap + control plane** | Enterprise edge, air-gapped | CIS-hardened, FIPS |
+| **Talos** | Immutable OS, API-driven | Minimal footprint, no SSH |
+| **k0smotron** | Hosted control plane pro edge clustery | CP běží v management clusteru, worker na edge |
+
+### CAPI vs alternativy
+
+| Nástroj | Přístup | CAPI výhoda | CAPI nevýhoda |
+|---------|---------|-------------|---------------|
+| **Terraform/Pulumi** | Imperativní/declarativní IaC | CAPI je K8s-native — stejný nástroj pro appky i clustery; GitOps ready | Terraform má širší podporu non-K8s resources |
+| **kubeadm** | Manuální nebo skriptovaný | CAPI automatizuje celý lifecycle včetně upgradů a remediací | Vyšší komplexita, nutný management cluster |
+| **Rancher** | Web UI + API pro správu K8s clusterů | CAPI je open-source, vendor-neutral | Rancher má GUI, monitoring, katalog appek |
+| **OpenShift Hive/ACM** | Red Hat Advanced Cluster Management | CAPI je standardní (SIG) — širší provider ecosystem | ACM má governance, policy, compliance |
+
+### Limitations a maturity
+
+- **Management cluster je SPOF** — musí mít vlastní HA a backup (etcd zálohy, certifikáty)
+- **CAPI není cluster autoscaler** — řeší lifecycle clusterů, ne auto-scaling podů v rámci clusteru (používá se Cluster Autoscaler samostatně)
+- **Provider maturity se liší** — AWS/Azure/vSphere stabilní, GCP/OpenStack beta, některé community providers alpha
+- **etcd backup není built-in** — nutné řešit externě (Velero, etcd snapshot)
+- **CAPI neřeší aplikace** — pouze lifecycle K8s clusterů (monitoring, logging, ingress si řídí uživatel)
+- **Learning curve** — nutnost management clusteru, pochopení provider modelu, CRDs
+- **CAPI v1.13+ (2026)** — stable release, v1beta1 API je GA, ClusterClass stable, EKS/AKS/GKE managed control plane podpora
+
+### Doporučený stack pro CAPI v produkci
+
+| Komponenta | Doporučení |
+|------------|------------|
+| **Management cluster** | K3s (malý footprint) nebo kubeadm (3 nodes HA) |
+| **Infra provider** | CAPA (AWS) / CAPV (vSphere) / CAPO (OpenStack) — dle platformy |
+| **Bootstrap/CP provider** | Kubeadm nebo RKE2 |
+| **GitOps** | ArgoCD nebo Flux |
+| **Backup** | Velero + restic/Ceph |
+| **Cluster autoscaler** | Cluster Autoscaler (přes CAPI integration) |
+| **Network** | Cilium (CAPI-native, podpora) |
+| **Secrets** | External Secrets Operator / Sealed Secrets |
+| **Monitoring** | Prometheus + Grafana (kube-prometheus-stack) |
+| **Ingress** | ingress-nginx / Kong / Traefik |
+
+## Zdroje
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Poslední revize: 2026-06-18*
--- a/MESSAGING.en.md
+++ b/MESSAGING.en.md
@@ -0,0 +1,275 @@
+# 📨 Messaging and streaming platforms
+
+## Platform overview
+
+| Platform | Type | Language | Protocol | Persistence | Use case |
+|-----------|-----|-------|----------|-------------|----------|
+| **Apache Kafka** | Distributed event store | Java/Scala | Binary (TCP) | Disk (log) | Event streaming, data pipeline, log aggregation |
+| **RabbitMQ** | Message broker | Erlang | AMQP 0-9-1, MQTT, STOMP | Disk / RAM | Application messaging, task queue, RPC |
+| **Apache Pulsar** | Distributed messaging + streaming | Java | Binary (TCP) + REST | Disk (segmented log) | Streaming + queue in one, multi-tenant |
+| **NATS** | Lightweight messaging | Go | NATS protocol (TCP) | Memory / JetStream (disk) | Microservices, IoT, edge, low-latency |
+| **AWS SQS** | Managed queue | — | HTTPS | Managed | Decoupling services, serverless |
+| **AWS SNS** | Managed pub/sub | — | HTTPS, SQS, Lambda, email | Managed | Push notifications, fanout |
+| **Azure Service Bus** | Managed messaging | — | AMQP, HTTPS | Managed | Enterprise messaging, sessions, transactions |
+| **Google Pub/Sub** | Managed streaming | — | gRPC, REST | Managed | Event-driven, data pipeline |
+| **Red Hat AMQ 7** (Artemis) | Message broker | Java | AMQP, MQTT, STOMP, OpenWire | Disk | Enterprise, JMS, high-availability |
+| **Oracle Service Bus (OSB)** | Enterprise ESB | Java | HTTP/S, JMS, SOAP, REST, MQ, FTP, AQ | Managed (WebLogic) | Enterprise integration, SOA, protocol mediation, routing |
+
+---
+
+## Platform details
+
+### Apache Kafka
+
+**Architecture:**
+
+```
+Producer ──► Topic ──► Partition ──► Consumer Group
+                │
+                ├── Partition 0 (Leader) ──► Broker 1
+                ├── Partition 1 (Follower) ──► Broker 2
+                └── Partition 2 (Follower) ──► Broker 3
+```
+
+| Concept | Description |
+|---------|-------|
+| **Topic** | Logical message category |
+| **Partition** | Append-only log, ordered sequence of messages |
+| **Broker** | Server in Kafka cluster |
+| **Producer** | Publishes messages to topic |
+| **Consumer** | Reads messages from partition (within consumer group) |
+| **Consumer Group** | Group of consumers sharing topic reading |
+| **Offset** | Position in partition (tracked by consumer) |
+| **KRaft** | Controller quorum (replaces Zookeeper from Kafka 3.x) |
+
+**Replication and HA:**
+
+| Parameter | Value |
+|----------|---------|
+| Replication factor | 2–3 (typically 3 for production) |
+| ISR (In-Sync Replicas) | Number of replicas keeping up with leader |
+| Min ISR | Minimum ISR for acknowledging writes (acks=all) |
+| acks=0 | Fire-and-forget (fastest, possible data loss) |
+| acks=1 | Write acknowledged by leader (compromise) |
+| acks=all | Write acknowledged by all ISR (safest) |
+| Leader failover | Automatic election of new leader from ISR |
+
+**Important configuration:**
+
+```properties
+# Production
+replication.factor=3
+min.insync.replicas=2
+default.replication.factor=3
+
+# Retention
+log.retention.hours=168     # 7 days
+log.retention.bytes=-1      # unlimited (or limit)
+log.segment.bytes=1073741824 # 1 GB per segment
+
+# Performance
+num.partitions=3            # adjust per need (scale-out)
+compression.type=snappy     # (snappy, gzip, lz4, zstd)
+```
+
+**Partitioning strategies:**
+
+| Strategy | Key | Advantage | Disadvantage |
+|----------|------|--------|----------|
+| Round-robin | null | Even distribution | Per-key ordering lost |
+| Key-based | user_id, order_id | Same key → same partition | Uneven distribution (hot keys) |
+| Custom partitioner | Custom logic | Per use-case optimization | More complex maintenance |
+
+### RabbitMQ
+
+**Architecture:**
+
+```
+Producer ──► Exchange ──► Binding ──► Queue ──► Consumer
+                  │
+      ┌───────────┼───────────┐
+      ▼           ▼           ▼
+  Direct      Topic      Fanout
+  Exchange   Exchange   Exchange
+```
+
+| Concept | Description |
+|---------|-------|
+| **Exchange** | Receives messages from producer, routes to queue |
+| **Binding** | Exchange → queue link with routing key |
+| **Queue** | FIFO message queue (consumed by consumer) |
+| **Virtual Host (vhost)** | Tenant isolation within a single cluster |
+| **Publisher Confirm** | Broker acknowledges message receipt |
+| **Consumer Ack** | Consumer acknowledges message processing |
+
+**Exchange types:**
+
+| Type | Routing | Use case |
+|-----|---------|----------|
+| **Direct** | routing_key = binding_key | Task queue, point-to-point |
+| **Topic** | routing_key match binding pattern (wildcard `*`, `#`) | Pub/sub with filtering |
+| **Fanout** | All bound queues | Broadcast, event notification |
+| **Headers** | AMQP headers match | Complex routing (not routing key dependent) |
+
+**Queue types:**
+
+```properties
+# Classic Queue (deprecated in production)
+x-queue-type: classic
+
+# Quorum Queue (recommended for production)
+x-queue-type: quorum
+x-quorum-initial-group-size: 3
+x-dead-letter-exchange: dlx
+
+# Stream Queue (for large backlogs)
+x-queue-type: stream
+x-max-length-bytes: 1073741824
+```
+
+**HA and clustering:**
+
+| Mode | Description | Use case |
+|-------|-------|----------|
+| **Quorum Queues** | Raft-based replication (3–5 node), auto failover | Production, HA messaging |
+| **Federation** | Async message forwarding between independent RabbitMQ clusters | Multi-region, DR |
+| **Shovel** | Point-to-point message forwarding (Federation at queue level) | Migration, specific routing |
+| **Warm Standby (DR)** | Secondary cluster, started on failover | Cold DR |
+
+### Apache Pulsar
+
+**Unique architecture (compute/storage separation):**
+
+```
+┌──────────────┐    ┌──────────────┐    ┌──────────────┐
+│  Producer    │    │  Consumer    │    │  Consumer    │
+└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
+       │                   │                   │
+┌──────▼───────────────────▼───────────────────▼──────┐
+│               Broker (stateless)                    │
+│         Subscription: Exclusive / Shared / Failover │
+└──────────────────────┬──────────────────────────────┘
+                       │
+┌──────────────────────▼──────────────────────────────┐
+│           BookKeeper (stateful storage)              │
+│  ├── Bookie 1  ├── Bookie 2  ├── Bookie 3  ├── ... │
+│  └── Ledger (append-only, segmented log)            │
+└─────────────────────────────────────────────────────┘
+```
+
+| Concept | Description |
+|---------|-------|
+| **Topic** | Logical category (partitioned or non-partitioned) |
+| **Subscription** | Delivery mode (Exclusive, Shared, Failover, Key_Shared) |
+| **Ledger** | Storage unit in BookKeeper (append-only) |
+| **Bookie** | Storage node (BookKeeper) |
+| **Managed Ledger** | Segmented log with cache and retention |
+
+**Advantages over Kafka:**
+- Compute/storage separation — independent scaling
+- Geo-replication built-in (native)
+- Multi-tenant (namespaces, isolation)
+- TTL, retry, dead letter topic (built-in)
+- Read-at-least-once / effectively-once
+
+### NATS
+
+| Feature | Description |
+|---------|-------|
+| **Core NATS** | Pub/sub, request-reply, < 1 ms latency |
+| **JetStream** | Persistence, exactly-once, key-value store, object store |
+| **Leaf nodes** | Hierarchical cluster connection |
+| **Super-cluster** | Multi-region clustering (global) |
+
+**Use case:** IoT, edge computing, microservices communication, low-latency messaging.
+
+### Oracle Service Bus (OSB)
+
+Part of Oracle SOA Suite, runs on WebLogic Server. Enterprise service bus for integration in Oracle-heavy environments.
+
+| Concept | Description |
+|---------|-------|
+| **Proxy Service** | Inbound endpoint (HTTP, JMS, MQ, SOAP, REST) |
+| **Business Service** | Target backend service |
+| **Pipeline** | Message processing — routing, transformation, validation |
+| **Split-Join** | Parallel/sequential orchestration of multiple services |
+| **Reporting** | Message tracking, SLA monitoring |
+
+**Key features:**
+- **Protocol mediation** — translation between SOAP/REST/JMS/MQ/FTP
+- **Message transformation** — XSLT, XQuery, MFL (non-XML)
+- **Throttling, SLA, alerting** — built-in
+- **Oracle AQ (Advanced Queuing)** — integration with Oracle DB queues
+- **XPath, XQuery, XSLT 2.0/3.0** — native support
+- **Error handling** — fault policies, error queues, retry
+
+**Use case:** Enterprise SOA, Oracle DB → Kafka bridging, legacy mainframe wrapping, B2B integration.
+
+**Alternatives:** IBM Integration Bus (IIB), MuleSoft Anypoint, WSO2 EI, Apache Camel / ServiceMix.
+
+---
+
+## Platform comparison
+
+### Performance and scaling
+
+| Platform | Max throughput | Latency (P99) | Messages/s (1 broker) | Scaling |
+|-----------|--------------|---------------|-------------------------|-----------|
+| **Kafka** | > 1 GB/s | 2–10 ms | ~1,000,000 | Partitions (horizontal) |
+| **Pulsar** | > 1 GB/s | 5–15 ms | ~1,000,000 | Brokers + Bookies |
+| **RabbitMQ** | ~100 MB/s | < 1 ms (RAM) | ~100,000 | Clustering (node) |
+| **NATS** | > 10 GB/s | < 0.5 ms | ~10,000,000 | Clustering + Leaf nodes |
+| **OSB** | < 1 GB/s | 10–100 ms | ~10,000 | Vertical (WebLogic cluster)
+
+### Delivery guarantees
+
+| Platform | At most once | At least once | Exactly once | Ordering |
+|-----------|-------------|---------------|-------------|----------|
+| **Kafka** | Yes | Yes (acks=all + min.insync) | Yes (idempotent + transactional) | Per partition |
+| **Pulsar** | Yes | Yes | Yes (dedup + transactional) | Per partition |
+| **RabbitMQ** | Yes | Yes (Publisher Confirm + Consumer Ack) | Limited | Per queue |
+| **NATS** | Yes | Yes (JetStream) | Limited | Per subject |
+| **OSB** | Yes | Yes (XA transactions, exactly-once delivery) | Yes (XA + WS-AT) | Per pipeline |
+
+### When to use what
+
+| Use case | Recommended platform | Reasoning |
+|----------|---------------------|------------|
+| **Event sourcing / audit log** | Kafka, Pulsar | Append-only log, high throughput, replay |
+| **CDC (Change Data Capture)** | Kafka (Kafka Connect + Debezium) | Connector ecosystem |
+| **Task queue (job processing)** | RabbitMQ, SQS | Dead letter, retry, priority, scheduling |
+| **API messaging / microservices** | NATS, RabbitMQ | Low latency, simplicity |
+| **Data pipeline (ETL)** | Kafka (KSQL, Kafka Streams) | Stream processing in platform |
+| **IoT / Edge** | NATS, MQTT (RabbitMQ) | Lightweight, leaf nodes |
+| **Enterprise SOA / EAI** | OSB, IBM IIB, MuleSoft | Protocol mediation, XA, B2B, legacy wrapping |
+| **Multi-tenant cloud** | Pulsar | Native multi-tenant, geo-replication |
+| **Serverless / event-driven** | SQS/SNS, Pub/Sub | Managed, auto-scaling |
+
+---
+
+## DR and high availability
+
+See [DATACENTERS.en.md](DATACENTERS.en.md) — section "Impact of individual technologies on DC topology selection" for detailed DR mapping per platform.
+
+### Best practices
+
+- **Don't lose messages in queue** — prefer acknowledgement-based consumption (not auto-ack)
+- **Dead letter queue** — every main queue has a DLQ for undeliverable messages
+- **Monitor lag** — consumer lag is a key metric (Kafka: `kafka.consumer:consumer_lag`)
+- **Idempotent consumer** — same message may be delivered twice
+- **Retry with backoff** — exponential backoff on processing failure
+- **Schema registry** — avoid deserialization errors (Avro, Protobuf, JSON Schema)
+- **Encryption** — TLS in transit, encryption at rest (Kafka: cluster-side + topic-level)
+
+---
+
+## Related
+
+- [DATACENTERS.en.md](DATACENTERS.en.md) — DR topology, per-platform mapping
+- [CLOUD.en.md](CLOUD.en.md) — managed messaging (SQS, SNS, Service Bus, Pub/Sub)
+
+## Sources
+
+Links, books, and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
+
+*Last revision: 2026-06-12*
--- a/MESSAGING.md
+++ b/MESSAGING.md
@@ -0,0 +1,275 @@
+# 📨 Messaging a streaming platformy
+
+## Přehled platformem
+
+| Platforma | Typ | Jazyk | Protokol | Persistence | Use case |
+|-----------|-----|-------|----------|-------------|----------|
+| **Apache Kafka** | Distributed event store | Java/Scala | Binary (TCP) | Disk (log) | Event streaming, data pipeline, log aggregation |
+| **RabbitMQ** | Message broker | Erlang | AMQP 0-9-1, MQTT, STOMP | Disk / RAM | Aplikační messaging, task queue, RPC |
+| **Apache Pulsar** | Distributed messaging + streaming | Java | Binary (TCP) + REST | Disk (segmented log) | Streaming + queue v jednom, multi-tenant |
+| **NATS** | Lightweight messaging | Go | NATS protocol (TCP) | Memory / JetStream (disk) | Microservices, IoT, edge, low-latency |
+| **AWS SQS** | Managed queue | — | HTTPS | Managed | Decoupling services, serverless |
+| **AWS SNS** | Managed pub/sub | — | HTTPS, SQS, Lambda, email | Managed | Push notifications, fanout |
+| **Azure Service Bus** | Managed messaging | — | AMQP, HTTPS | Managed | Enterprise messaging, sessions, transactions |
+| **Google Pub/Sub** | Managed streaming | — | gRPC, REST | Managed | Event-driven, data pipeline |
+| **Red Hat AMQ 7** (Artemis) | Message broker | Java | AMQP, MQTT, STOMP, OpenWire | Disk | Enterprise, JMS, high-availability |
+| **Oracle Service Bus (OSB)** | Enterprise ESB | Java | HTTP/S, JMS, SOAP, REST, MQ, FTP, AQ | Managed (WebLogic) | Enterprise integration, SOA, protocol mediation, routing |
+
+---
+
+## Detail platformem
+
+### Apache Kafka
+
+**Architektura:**
+
+```
+Producer ──► Topic ──► Partition ──► Consumer Group
+                │
+                ├── Partition 0 (Leader) ──► Broker 1
+                ├── Partition 1 (Follower) ──► Broker 2
+                └── Partition 2 (Follower) ──► Broker 3
+```
+
+| Koncept | Popis |
+|---------|-------|
+| **Topic** | Logická kategorie zpráv |
+| **Partition** | Append-only log, ordered sequence of messages |
+| **Broker** | Server v Kafka clusteru |
+| **Producer** | Publikuje zprávy do topicu |
+| **Consumer** | Čte zprávy z partition (v rámci consumer group) |
+| **Consumer Group** | Skupina consumerů sdílejících čtení topicu |
+| **Offset** | Pozice v partition (sledovaná consumerem) |
+| **KRaft** | Controller quorum (nahrazuje Zookeeper od Kafka 3.x) |
+
+**Replikace a HA:**
+
+| Parametr | Hodnota |
+|----------|---------|
+| Replication factor | 2–3 (typicky 3 pro produkci) |
+| ISR (In-Sync Replicas) | Počet replik, které drží krok s leaderem |
+| Min ISR | Minimální počet ISR pro potvrzení zápisu (acks=all) |
+| acks=0 | Fire-and-forget (nejrychlejší, možná ztráta dat) |
+| acks=1 | Zápis potvrzen leaderem (kompromis) |
+| acks=all | Zápis potvrzen všemi ISR (nejbezpečnější) |
+| Leader failover | Automatický výběr nového leadera z ISR |
+
+**Důležité konfigurace:**
+
+```properties
+# Produkce
+replication.factor=3
+min.insync.replicas=2
+default.replication.factor=3
+
+# Retention
+log.retention.hours=168     # 7 dní
+log.retention.bytes=-1      # neomezeno (nebo limit)
+log.segment.bytes=1073741824 # 1 GB per segment
+
+# Performance
+num.partitions=3            # podle potřeb (scale-out)
+compression.type=snappy     # (snappy, gzip, lz4, zstd)
+```
+
+**Partitioning strategies:**
+
+| Strategy | Klíč | Výhoda | Nevýhoda |
+|----------|------|--------|----------|
+| Round-robin | null | Rovnoměrné rozložení | Ztráta pořadí per klíč |
+| Key-based | user_id, order_id | Zprávy se stejným klíčem → stejná partition | Nerovnoměrné rozložení (hot keys) |
+| Custom partitioner | Vlastní logika | Optimalizace per use case | Složitější na údržbu |
+
+### RabbitMQ
+
+**Architektura:**
+
+```
+Producer ──► Exchange ──► Binding ──► Queue ──► Consumer
+                  │
+      ┌───────────┼───────────┐
+      ▼           ▼           ▼
+  Direct      Topic      Fanout
+  Exchange   Exchange   Exchange
+```
+
+| Koncept | Popis |
+|---------|-------|
+| **Exchange** | Přijímá zprávy od producera, routuje do queue |
+| **Binding** | Vazba exchange → queue s routing key |
+| **Queue** | FIFO fronta zpráv (consumer čte) |
+| **Virtual Host (vhost)** | Izolace tenantů v rámci jednoho clusteru |
+| **Publisher Confirm** | Potvrzení že broker zprávu přijal |
+| **Consumer Ack** | Potvrzení že consumer zprávu zpracoval |
+
+**Exchange typy:**
+
+| Typ | Routing | Use case |
+|-----|---------|----------|
+| **Direct** | routing_key = binding_key | Task queue, point-to-point |
+| **Topic** | routing_key match binding pattern (wildcard `*`, `#`) | Pub/sub s filtrováním |
+| **Fanout** | Všem bindovaným queue | Broadcast, event notification |
+| **Headers** | AMQP headers match | Komplexní routing (není závislý na routing key) |
+
+**Queue typy:**
+
+```properties
+# Classic Queue (deprecated v produkci)
+x-queue-type: classic
+
+# Quorum Queue (doporučeno pro produkci)
+x-queue-type: quorum
+x-quorum-initial-group-size: 3
+x-dead-letter-exchange: dlx
+
+# Stream Queue (pro large backlogs)
+x-queue-type: stream
+x-max-length-bytes: 1073741824
+```
+
+**HA a clustering:**
+
+| Režim | Popis | Use case |
+|-------|-------|----------|
+| **Quorum Queues** | Raft-based replikace (3–5 node), auto failover | Produkce, HA messaging |
+| **Federation** | Async forwarding zpráv mezi nezávislými RabbitMQ clustery | Multi-region, DR |
+| **Shovel** | Point-to-point forwarding zpráv (Federation na úrovni queue) | Migrace, specifický routing |
+| **Warm Standby (DR)** | Druhý cluster, start až při failoveru | Cold DR |
+
+### Apache Pulsar
+
+**Unikátní architektura (compute/storage separation):**
+
+```
+┌──────────────┐    ┌──────────────┐    ┌──────────────┐
+│  Producer    │    │  Consumer    │    │  Consumer    │
+└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
+       │                   │                   │
+┌──────▼───────────────────▼───────────────────▼──────┐
+│               Broker (stateless)                    │
+│         Subscription: Exclusive / Shared / Failover │
+└──────────────────────┬──────────────────────────────┘
+                       │
+┌──────────────────────▼──────────────────────────────┐
+│           BookKeeper (stateful storage)              │
+│  ├── Bookie 1  ├── Bookie 2  ├── Bookie 3  ├── ... │
+│  └── Ledger (append-only, segmented log)            │
+└─────────────────────────────────────────────────────┘
+```
+
+| Koncept | Popis |
+|---------|-------|
+| **Topic** | Logická kategorie (partitioned nebo non-partitioned) |
+| **Subscription** | Způsob doručení (Exclusive, Shared, Failover, Key_Shared) |
+| **Ledger** | Storage unit v BookKeeper (append-only) |
+| **Bookie** | Storage node (BookKeeper) |
+| **Managed Ledger** | Segmentovaný log s cache a retention |
+
+**Výhody oproti Kafce:**
+- Compute/storage separation — nezávislé škálování
+- Geo-replication built-in (nativní)
+- Multi-tenant (namespaces, isolation)
+- TTL, retry, dead letter topic (built-in)
+- Read-at-least-once / effectively-once
+
+### NATS
+
+| Feature | Popis |
+|---------|-------|
+| **Core NATS** | Pub/sub, request-reply, < 1 ms latence |
+| **JetStream** | Persistence, exactly-once, key-value store, object store |
+| **Leaf nodes** | Hierarchické propojení clusterů |
+| **Super-cluster** | Multi-region clustering (global) |
+
+**Use case:** IoT, edge computing, microservices communication, low-latency messaging.
+
+### Oracle Service Bus (OSB)
+
+Součást Oracle SOA Suite, běží na WebLogic Serveru. Enterprise service bus pro integraci v Oracle-heavy prostředích.
+
+| Koncept | Popis |
+|---------|-------|
+| **Proxy Service** | Vstupní endpoint (HTTP, JMS, MQ, SOAP, REST) |
+| **Business Service** | Cílový backend service |
+| **Pipeline** | Message processing — routing, transformation, validation |
+| **Split-Join** | Parallel/sequential orchestration více služeb |
+| **Reporting** | Message tracking, SLA monitoring |
+
+**Klíčové vlastnosti:**
+- **Protocol mediation** — překlad mezi SOAP/REST/JMS/MQ/FTP
+- **Message transformation** — XSLT, XQuery, MFL (neXML)
+- **Throttling, SLA, alerting** — built-in
+- **Oracle AQ (Advanced Queuing)** — integrace s Oracle DB frontami
+- **XPath, XQuery, XSLT 2.0/3.0** — nativní podpora
+- **Error handling** — fault policies, error queues, retry
+
+**Use case:** Enterprise SOA, Oracle DB → Kafka bridging, legacy mainframe wrapping, B2B integration.
+
+**Alternativy:** IBM Integration Bus (IIB), MuleSoft Anypoint, WSO2 EI, Apache Camel / ServiceMix.
+
+---
+
+## Srovnání platformem
+
+### Výkon a škálování
+
+| Platforma | Max throughput | Latence (P99) | Počet zpráv/s (1 broker) | Škálování |
+|-----------|--------------|---------------|-------------------------|-----------|
+| **Kafka** | > 1 GB/s | 2–10 ms | ~1 000 000 | Partitions (horizontální) |
+| **Pulsar** | > 1 GB/s | 5–15 ms | ~1 000 000 | Brokers + Bookies |
+| **RabbitMQ** | ~100 MB/s | < 1 ms (RAM) | ~100 000 | Clustering (node) |
+| **NATS** | > 10 GB/s | < 0,5 ms | ~10 000 000 | Clustering + Leaf nodes |
+| **OSB** | < 1 GB/s | 10–100 ms | ~10 000 | Vertikální (WebLogic cluster)
+
+### Delivery guarantees
+
+| Platforma | At most once | At least once | Exactly once | Ordering |
+|-----------|-------------|---------------|-------------|----------|
+| **Kafka** | Ano | Ano (acks=all + min.insync) | Ano (idempotent + transactional) | Per partition |
+| **Pulsar** | Ano | Ano | Ano (dedup + transactional) | Per partition |
+| **RabbitMQ** | Ano | Ano (Publisher Confirm + Consumer Ack) | Omezeně | Per queue |
+| **NATS** | Ano | Ano (JetStream) | Omezeně | Per subject |
+| **OSB** | Ano | Ano (XA transactions, exactly-once delivery) | Ano (XA + WS-AT) | Per pipeline |
+
+### Kdy co použít
+
+| Use case | Doporučená platforma | Zdůvodnění |
+|----------|---------------------|------------|
+| **Event sourcing / audit log** | Kafka, Pulsar | Append-only log, high throughput, replay |
+| **CDC (Change Data Capture)** | Kafka (Kafka Connect + Debezium) | Ekosystém konektorů |
+| **Task queue (job processing)** | RabbitMQ, SQS | Dead letter, retry, priority, scheduling |
+| **API messaging / microservices** | NATS, RabbitMQ | Nízká latence, jednoduchost |
+| **Data pipeline (ETL)** | Kafka (KSQL, Kafka Streams) | Stream processing v platformě |
+| **IoT / Edge** | NATS, MQTT (RabbitMQ) | Lightweight, leaf nodes |
+| **Enterprise SOA / EAI** | OSB, IBM IIB, MuleSoft | Protocol mediation, XA, B2B, legacy wrapping |
+| **Multi-tenant cloud** | Pulsar | Nativní multi-tenant, geo-replication |
+| **Serverless / event-driven** | SQS/SNS, Pub/Sub | Managed, auto-scaling |
+
+---
+
+## DR a vysoká dostupnost
+
+Viz [DATACENTERS.md](DATACENTERS.md) — sekce "Vliv jednotlivých technologií na výběr DC topologie" pro detail DR mapping per platforma.
+
+### Best practices
+
+- **Neztrať zprávu v queue** — preferovat aknowledge-based consumption (ne auto-ack)
+- **Dead letter queue** — každá hlavní queue má DLQ pro nedoručitelné zprávy
+- **Monitoring lag** — consumer lag je klíčová metrika (Kafka: `kafka.consumer:consumer_lag`)
+- **Idempotentní consumer** — stejná zpráva může být doručena dvakrát
+- **Retry s backoff** — exponenciální backoff při selhání zpracování
+- **Schema registry** — vyhnout se deserialization errors (Avro, Protobuf, JSON Schema)
+- **Šifrování** — TLS in transit, encryption at rest (Kafka: cluster-side + topic-level)
+
+---
+
+## Související
+
+- [DATACENTERS.md](DATACENTERS.md) — DR topologie, per-platforma mapping
+- [CLOUD.md](CLOUD.md) — managed messaging (SQS, SNS, Service Bus, Pub/Sub)
+
+## Zdroje
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Poslední revize: 2026-06-12*
--- a/MONGODB.en.md
+++ b/MONGODB.en.md
@@ -111,6 +111,6 @@ MongoDB changed its license in 2018 from GNU AGPL v3 to **SSPL** (Server Side Pu

 ## Sources

-References, books, and standards: [sources/databases/sources.md](sources/databases/sources.md)
+References, books, and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md)

 *Last revision: 2026-06-03*
--- a/MONITORING.en.md
+++ b/MONITORING.en.md
@@ -497,6 +497,6 @@ OpenStack provides several services for telemetry and monitoring:

 ## Sources

-Links, books and standards: [sources/monitoring/sources.md](sources/monitoring/sources.md)
+Links, books and standards: [sources/monitoring/sources.en.md](sources/monitoring/sources.en.md)

 *Last revision: 2026-06-03*
--- a/MYSQL.en.md
+++ b/MYSQL.en.md
@@ -131,7 +131,7 @@ ProxySQL is an advanced proxy for MySQL with sophisticated routing:

 ## Sources

-References, books, and standards: [sources/databases/sources.md](sources/databases/sources.md)
+References, books, and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md)

 ### Recommended reading

--- a/NETWORKING.en.md
+++ b/NETWORKING.en.md
@@ -302,7 +302,7 @@ Anycast detail:

 ## Cloud Networking Resilience (2026)

-See also: [CLOUD.md](CLOUD.md) — cloud architecture, multi-AZ, hybrid cloud connectivity.
+See also: [CLOUD.en.md](CLOUD.en.md) — cloud architecture, multi-AZ, hybrid cloud connectivity.

 ### Cell-based Architectures

@@ -577,7 +577,7 @@ In a private DC, Zero Trust is deployed via:

 ## Resources

-Links, books and standards: [sources/networking/sources.md](sources/networking/sources.md)
+Links, books and standards: [sources/networking/sources.en.md](sources/networking/sources.en.md)
 - **MTU alignment** — consistent MTU across the entire path, check ICMP blocking for PMTUD
 - **IP planning** — RFC 1918 (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), avoid overlaps for peering

--- a/ORACLE.en.md
+++ b/ORACLE.en.md
@@ -195,7 +195,7 @@ Tip: For RAC, consider smaller CPUs (e.g., 64C instead of 96C) — license cost

 ## Sources

-References, books, and standards: [sources/databases/sources.md](sources/databases/sources.md)
+References, books, and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md)

 ### Recommended reading

--- a/OS.en.md
+++ b/OS.en.md
@@ -0,0 +1,337 @@
+# Operating Systems
+
+> Overview of Linux distributions and Microsoft Windows for server, container, and AI/GPU workloads, including support lifecycle, EOL dates, and comparison.
+
+---
+
+## Distribution overview
+
+| Distribution | Family | Package manager | Init | Security | Reference platform |
+|-------------|--------|----------------|------|----------|-------------------|
+| **Ubuntu LTS** | Debian | apt (deb) | systemd | AppArmor | NVIDIA DGX, widest AI/GPU support |
+| **Debian** | Debian | apt (deb) | systemd | AppArmor | General-purpose server, stability |
+| **RHEL** | Red Hat | dnf (rpm) | systemd | SELinux | Enterprise standard, SAP, Oracle DB |
+| **Rocky Linux** | Red Hat | dnf (rpm) | systemd | SELinux | RHEL binary compatible (free) |
+| **AlmaLinux** | Red Hat | dnf (rpm) | systemd | SELinux | RHEL binary compatible (free) |
+| **SLES** | SUSE | zypper (rpm) | systemd | AppArmor | HPC, SAP, mainframe |
+| **OpenSUSE Leap** | SUSE | zypper (rpm) | systemd | AppArmor | Desktop, development |
+| **OpenSUSE Tumbleweed** | SUSE | zypper (rpm) | systemd | AppArmor | Rolling release, bleeding edge |
+| **Fedora** | Red Hat | dnf (rpm) | systemd | SELinux | Desktop, technology preview |
+| **Arch Linux** | Independent | pacman | systemd | — | Rolling, power users |
+| **Alpine Linux** | Independent | apk | OpenRC | — | Container image, embedded |
+| **Flatcar Container Linux** | Independent | — (image-based) | systemd | — | K8s worker node, minimal footprint |
+| **Bottlerocket** | Independent | — (image-based) | systemd | — | AWS K8s, minimal footprint |
+
+---
+
+## Support lifecycle and EOL dates
+
+> **Standard:** base support (bug fixes, security). **LTS/ELS:** extended support (security only).
+> ESM = Ubuntu Extended Security Maintenance, EUS = RHEL Extended Update Support, LTSS = SUSE Long Term Service Pack Support.
+
+### Ubuntu LTS
+
+| Version | Release | Standard support | ESM / Ubuntu Pro | Note |
+|---------|---------|-----------------|------------------|------|
+| **20.04 LTS** (Focal) | 2020-04 | End 2025-04 | End 2030-04 | Last release with Python 2 |
+| **22.04 LTS** (Jammy) | 2022-04 | End 2027-04 | End 2032-04 | NVIDIA DGX standard |
+| **24.04 LTS** (Noble) | 2024-04 | End 2029-04 | End 2034-04 | Latest GPU/CUDA support |
+| **26.04 LTS** (planned) | 2026-04 | End 2031-04 | End 2036-04 | — |
+
+### RHEL
+
+| Version | Release | Full support | Maintenance support | Extended life cycle |
+|---------|---------|-------------|-------------------|-------------------|
+| **7** | 2014-06 | End 2019-08 | End 2024-06 | End 2028-06 (ELS) |
+| **8** | 2019-05 | End 2024-05 | End 2029-05 | End 2034-06 (ELS) |
+| **9** | 2022-05 | End 2027-05 | End 2032-05 | End 2037-06 (ELS) |
+| **10** (planned) | 2025 | End 2029 | End 2034 | — |
+
+### Rocky Linux / AlmaLinux
+
+| Version | Release | Support until | RHEL compatible | Note |
+|---------|---------|-------------|-----------------|------|
+| **8** | 2021-06 | 2029-05 | Yes (since RHEL 8.4) | Alma/Rocky |
+| **9** | 2022-07 | 2032-05 | Yes (since RHEL 9.0) | Alma/Rocky |
+
+### Debian
+
+| Version | Release | Full support | LTS support | ELTS (paid) |
+|---------|---------|-------------|-------------|-------------|
+| **11** (Bullseye) | 2021-08 | 2024-08 | End 2026-08 | End 2028-08 |
+| **12** (Bookworm) | 2023-06 | 2026-06 | End 2028-06 | End 2030-06 |
+| **13** (Trixie) | 2025 (expected) | ~3 years post-release | ~5 years post-release | — |
+
+### SLES
+
+| Version | Release | General support | LTSS | Note |
+|---------|---------|---------------|------|------|
+| **15 SP3** | 2021-06 | End 2024-12 | End 2027-12 | — |
+| **15 SP4** | 2022-06 | End 2025-12 | End 2028-12 | — |
+| **15 SP5** | 2023-06 | End 2026-12 | End 2029-12 | Current SP |
+| **15 SP6** | 2024-10 | End 2027-12 | End 2030-12 | — |
+
+### Fedora
+
+| Version | Release | EOL | Note |
+|---------|---------|-----|------|
+| **38** | 2023-04 | 2024-05 | — |
+| **39** | 2023-11 | 2024-12 | — |
+| **40** | 2024-04 | 2025-05 | — |
+| **41** | 2024-11 | 2025-12 | — |
+
+Fedora releases a new version every ~6 months, EOL ~13 months after release. Serves as upstream for RHEL.
+
+### Alpine Linux
+
+| Version | Release | EOL |
+|---------|---------|-----|
+| **3.18** | 2023-05 | 2025-05 |
+| **3.19** | 2023-12 | 2025-12 |
+| **3.20** | 2024-05 | 2026-05 |
+| **3.21** | 2024-12 | 2026-12 |
+
+---
+
+## Kernel version per distribution
+
+| Distribution | Kernel (default) | Kernel (HWE/enhanced) | Note |
+|------------|-----------------|----------------------|------|
+| Ubuntu 22.04 LTS | 5.15 (GA) | 6.5+ (HWE) | HWE from 22.04.2 |
+| Ubuntu 24.04 LTS | 6.8 | — | — |
+| RHEL 8 | 4.18 | — | Backported features |
+| RHEL 9 | 5.14 | — | Backported features |
+| RHEL 10 | 6.11+ (expected) | — | — |
+| Rocky/Alma 8 | 4.18 | — | Same as RHEL 8 |
+| Rocky/Alma 9 | 5.14 | — | Same as RHEL 9 |
+| Debian 11 | 5.10 | 6.1 (backports) | — |
+| Debian 12 | 6.1 | — | — |
+| SLES 15 SP5 | 5.14 | — | — |
+| SLES 15 SP6 | 6.4 | — | — |
+| Fedora 40 | 6.8+ | — | Rolling upstream |
+| Alpine 3.20 | 6.6 | — | — |
+
+---
+
+## Use case comparison
+
+| Use case | Recommended distribution | Rationale |
+|----------|------------------------|-----------|
+| **AI/GPU cluster (DGX)** | Ubuntu 22.04 LTS / DGX OS | NVIDIA standard, CUDA, MLNX_OFED |
+| **Enterprise K8s (OpenShift)** | RHEL 9 / RHCOS | Red Hat support, GPU Operator |
+| **Vanilla K8s (on-prem)** | Ubuntu 22.04 LTS + Flatcar (workers) | Community support, minimal worker image |
+| **HPC cluster (Slurm)** | Rocky Linux 9 / Ubuntu 22.04 | EL ecosystem + Lustre, or Ubuntu |
+| **Traditional enterprise DB (Oracle, SAP)** | RHEL 9 / SLES 15 | Vendor certification |
+| **Container host** | Ubuntu 22.04 / Alpine | Broad image compatibility / min size |
+| **Development / desktop** | Fedora / Ubuntu 24.04 / OpenSUSE Tumbleweed | Latest packages, HW support |
+| **Embedded / IoT** | Debian / Alpine / Yocto | Minimal footprint, stability |
+| **Edge inference** | Ubuntu (ARM) / NVIDIA JetPack | Jetson, GPU support |
+| **Mainframe (IBM z/Arch)** | SLES 15 / RHEL 9 | IBM certification |
+
+---
+
+## Package management comparison
+
+| Feature | apt (Debian/Ubuntu) | dnf (RHEL/Rocky/Alma/Fedora) | zypper (SUSE) | pacman (Arch) | apk (Alpine) |
+|---------|--------------------|------------------------------|---------------|---------------|-------------|
+| **Package format** | .deb | .rpm | .rpm | .pkg.tar.zst | .apk |
+| **Repo management** | /etc/apt/sources.list | /etc/yum.repos.d/ | /etc/zypp/repos.d/ | /etc/pacman.conf | /etc/apk/repositories |
+| **Lock file** | — (apt-mark hold) | — (exclude) | — (lock) | — (IgnorePkg) | — |
+| **Transactional update** | No | Yes (dnf history) | Yes (zypper history) | No | No |
+| **Rollback** | No (manual) | Yes (dnf history rollback) | Yes (snapper + zypper) | No | No |
+| **Delta updates** | Yes (apt-xapian) | Yes (deltarpm) | Yes (zsync) | No | No |
+| **Version (as of 2025)** | apt 2.7+ | dnf 4.18+ | zypper 1.14+ | pacman 6.1+ | apk 2.14+ |
+
+---
+
+## Security model comparison
+
+| Feature | SELinux (RHEL derivatives) | AppArmor (Ubuntu/Debian/SUSE) |
+|---------|--------------------------|-------------------------------|
+| **Type** | Mandatory Access Control (MAC) | Mandatory Access Control (MAC) |
+| **Labeling** | Context-based (user:role:type) | Path-based (profile per executable) |
+| **Configuration** | Policy (modules, booleans) | Profiles (text, in /etc/apparmor.d/) |
+| **Modes** | Enforcing / Permissive / Disabled | Enforce / Complain / Disabled |
+| **Learning curve** | Steep (complex policies) | Moderate (simpler profiles) |
+| **Default in** | RHEL, Rocky, Alma, Fedora | Ubuntu, Debian, SLES, OpenSUSE |
+| **Use case** | Enterprise multi-tenant, regulated | General-purpose server, app containment |
+| **Container integration** | SELinux labels on container | AppArmor profile on container |
+
+Additional layers:
+- **seccomp** — syscall filtering (default in containerd, Docker)
+- **Capabilities** — Linux capabilities (drop all except required)
+- **cgroups v2** — resource isolation (CPU, memory, IO, PID)
+- **User namespaces** — rootless containers (Podman, Docker rootless)
+
+---
+
+## Recommended migration path for EOL distributions
+
+| From | To | Recommended approach |
+|------|-----|---------------------|
+| Ubuntu 20.04 (EOL 2025) | Ubuntu 22.04 or 24.04 | `do-release-upgrade` or fresh install |
+| RHEL 7 (EOL 2024) | RHEL 8 or 9 | `leapp` upgrade, or fresh install |
+| Rocky/Alma 8 | Rocky/Alma 9 | `dnf upgrade --releasever=9` |
+| Debian 11 (EOL LTS 2026) | Debian 12 | `apt full-upgrade` + new sources.list |
+| SLES 15 SP4 (EOL 2025) | SLES 15 SP6 | `zypper migration` |
+| Fedora 40 (EOL 2025) | Fedora 42+ | `dnf system-upgrade` |
+
+---
+
+## Microsoft Windows
+
+### Windows Server — editions
+
+| Edition | Price (approx) | Core limits | VM rights | Use case |
+|---------|---------------|-------------|-----------|----------|
+| **Datacenter** | ~$6,155 (2025) | Unlimited | Unlimited Windows VMs per host | Virtualization, SDDC, S2D, HCI |
+| **Standard** | ~$1,069 (2025) | 2 CPU, unlimited cores | 2 Windows VMs + Hyper-V host | General server, AD, file server |
+| **Essentials** | ~$501 (2025) | 1 CPU, max 10 users | — | Small business (≤25 users) |
+| **Azure Edition** | Pay-as-you-go | Per Azure VM | Per Azure | Azure-only, hotpatching |
+
+Licensing: Windows Server Standard and Datacenter are licensed **per core** (min 16 core/server + 8 core/VM).
+
+### Windows Server — support lifecycle
+
+> **Mainstream:** regular updates (bug fixes, security, features). **Extended:** security updates only (free).
+> **ESU:** Extended Security Updates (paid tier, ~$45–300/core/year).
+
+| Version | Release | Mainstream support | Extended support | ESU | Note |
+|---------|---------|------------------|-----------------|-----|------|
+| **2012 R2** | 2013-11 | 2018-10 | 2023-10 | End 2026-10 (year 3) | ESU paid, final year |
+| **2016** | 2016-10 | 2022-01 | 2027-01 | — | Last with Desktop Experience |
+| **2019** | 2019-01 | 2024-01 | 2029-01 | — | Last with Nano Server (1803 only) |
+| **2022** | 2021-09 | 2026-10 | 2031-10 | — | Current, TPM 2.0, Credential Guard |
+| **2025** | 2024-11 | 2029-10 | 2034-10 | — | Hotpatching, PowerShell 7, SMB over QUIC |
+
+### Windows Server — version vs edition feature grid
+
+| Version | Hyper-V | Storage Spaces Direct | Software-defined networking | Containers | GPU DDA / vGPU | WSL2 |
+|---------|---------|---------------------|---------------------------|------------|---------------|------|
+| 2016 Standard | Yes | No (DC only) | No (DC only) | Windows only | Yes | No |
+| 2016 Datacenter | Yes | Yes | Yes | Windows | Yes | No |
+| 2019 Standard | Yes | No | No | Windows | Yes | No |
+| 2019 Datacenter | Yes | Yes | Yes | Windows | Yes | No |
+| 2022 Standard | Yes | No | No | Windows + Linux | Yes | No |
+| 2022 Datacenter | Yes | Yes | Yes | Windows + Linux (2022.2+) | Yes | No |
+| 2025 Datacenter | Yes | Yes | Yes | Windows + Linux | Yes | Yes |
+
+### Windows Desktop — support lifecycle
+
+> **E = Enterprise, Pro = Professional, Home = Consumer**
+> LTSC = Long Term Servicing Channel (stable, no feature updates)
+
+| Version | Release | EOL (Home/Pro) | EOL (Enterprise) | LTSC EOL | Note |
+|---------|---------|---------------|-----------------|----------|------|
+| **10 21H2** | 2021-11 | — | 2024-06 | — |
+| **10 22H2** | 2022-10 | 2025-10 | 2025-10 | — | Final Windows 10 |
+| **10 LTSC 2021** | 2021-11 | — | — | 2032-01 | IoT Enterprise LTSC |
+| **11 22H2** | 2022-09 | 2024-10 | 2025-10 | — |
+| **11 23H2** | 2023-10 | 2025-11 | 2026-11 | — |
+| **11 24H2** | 2024-10 | 2026-10 | 2027-10 | — | First with Recall, Copilot+ |
+| **11 LTSC 2024** | 2024-10 | — | — | 2029-10 | Enterprise LTSC |
+
+Windows 10 support **ended 2025-10-14** — last version with classic Control Panel.
+
+### Windows vs Linux — comparison
+
+| Feature | Windows Server | RHEL / Ubuntu |
+|---------|---------------|---------------|
+| **License (server)** | $500–6,000 (per core) + CAL | $0–800 (per node subscription) |
+| **License (desktop)** | $100–200 (OEM/retail) | Free |
+| **Support cost** | Included in license (SA/ESU) | $200–1,300/node/year (RHEL) |
+| **Package management** | MSI, AppX, winget, NuGet | APT, DNF, Zypper |
+| **Package count** | ~10,000 (chocolatey) | ~60,000+ (Ubuntu repo) |
+| **Desktop GUI** | Windows Shell (mandatory) | Optional (GNOME, KDE, XFCE…) |
+| **Server GUI** | Windows Shell (core-only since 2022) | CLI-only (standard) |
+| **Kernel** | NT hybrid kernel (kernel-mode Win32) | Monolithic Linux kernel |
+| **Device support** | OEM driver model (WHQL) | Open source + vendor drivers |
+| **Container types** | Windows + Linux (WSL2) | Linux (Docker, Podman, containerd) |
+| **Container registry** | Docker Hub, ACR, Nexus | Docker Hub, Quay, GHCR, Nexus… |
+| **Container image size** | ~4–8 GB (Windows Server Core) | ~100 MB – 1 GB (Alpine/Ubuntu) |
+| **GPU passthrough** | DDA (Discrete Device Assignment) | GPU Direct, VFIO, SR-IOV |
+| **AI/ML support** | WSL2 (CUDA), Azure ML | Native CUDA, ROCm, oneAPI |
+| **CUDA support** | Yes (via WSL2 or Docker) | Native (nvidia-container-toolkit) |
+| **Orchestration** | AD / GPO / SCCM / WAC | Ansible, Puppet, Salt, Foreman |
+| **RBAC/AAA** | Active Directory (+ Kerberos) | LDAP, FreeIPA, SSSD, AD |
+| **Remote management** | RDP, WinRM, PowerShell Remoting | SSH, Cockpit, Webmin |
+| **Filesystem** | NTFS, ReFS, CSVFS | ext4, XFS, Btrfs, ZFS |
+| **Max file system size** | 256 TB (NTFS), 1.2 YB (ReFS) | 1 EB (XFS), 16 EB (ZFS) |
+| **Hypervisor** | Hyper-V (Type 1) | KVM (Type 2-like), Xen |
+| **Dynamic memory** | Hyper-V Dynamic Memory | KSM, virtio-balloon (KVM) |
+| **Live migration** | Hyper-V Live Migration | KVM Live Migration, vMotion |
+
+### Windows specific features
+
+| Feature | Description | Linux alternative |
+|---------|------------|-------------------|
+| **Active Directory** | Identity, auth, GPO, DNS, DHCP | FreeIPA, Samba AD DC, 389-ds, SSSD |
+| **Group Policy** | Central desktop/server configuration | Ansible, Puppet, Salt (agent-based) |
+| **Hyper-V + S2D** | Hyper-converged storage and virtualization (HCI) | Proxmox Ceph / oVirt + Gluster |
+| **Failover Clustering** | Cluster-aware apps (SQL, File Server) | Pacemaker + Corosync + DRBD |
+| **IIS** | Web server, ASP.NET host | Nginx, Apache (.NET host possible) |
+| **PowerShell** | Scripting, Desired State Configuration | Bash, Python, Ansible |
+| **Windows Admin Center** | GUI management | Cockpit, Webmin |
+| **BitLocker** | Full disk encryption | LUKS + cryptsetup |
+| **Windows Defender** | Antivirus + EDR | ClamAV, Wazuh, Osquery |
+| **SQL Server** | Relational database | PostgreSQL, MySQL, MariaDB |
+
+### Recommended OS per use case (including Windows)
+
+| Use case | OS | Rationale |
+|----------|-----|-------|
+| **Active Directory / GPO / hybrid ID** | Windows Server 2022/2025 | AD is Windows-only |
+| **SQL Server (failover cluster)** | Windows Server Datacenter + SQL EE | Always On FCI, ReFS |
+| **Exchange / SharePoint** | Windows Server 2022 | Windows-only |
+| **Enterprise desktop management** | Windows 11 Enterprise + Intune/SCCM | GPO, AD, enterprise MDM |
+| **.NET / ASP.NET apps** | Windows Server / Linux (.NET Core) | .NET 6+ runs on Linux |
+| **HCI (Microsoft stack)** | Windows Server Datacenter + S2D + Hyper-V | Azure Stack HCI |
+| **Virtualization (mixed workload)** | Windows Server Datacenter (Hyper-V) | Linux + Windows VMs under one |
+| **AI/GPU inference** | Linux (Ubuntu) + CUDA | NVIDIA optimal; WSL2 alternative |
+| **Container orchestration (Windows nodes)** | Windows Server 2022/2025 + containerd | Windows Pods in AKS on-prem |
+| **Tier 2 apps / web / API** | Ubuntu or RHEL (Linux) | Lower TCO, smaller footprint |
+
+### Windows Server migration paths
+
+| From | To | Recommended approach |
+|------|-----|---------------------|
+| Windows Server 2012 R2 (EOL 2023) | Windows Server 2022/2025 | In-place upgrade or fresh + migration |
+| Windows Server 2016 (EOL 2027) | Windows Server 2022/2025 | In-place upgrade or fresh |
+| Windows Server 2019 | Windows Server 2022/2025 | In-place upgrade (`Setup.exe /auto upgrade`) |
+| Windows Server 2022 | Windows Server 2025 | In-place upgrade or fresh |
+| Windows Server → Cloud | Azure VM / Azure Stack HCI | Azure Migrate, Storage Migration Service |
+| Windows Server → Linux | Ubuntu / RHEL (re-platform) | Migrate app to .NET Core or alternative |
+
+### Windows — API and operational limits
+
+| Limit | Windows Server | Windows Desktop |
+|-------|---------------|----------------|
+| **Max RAM** | 24 TB (2025 Datacenter) | 2 TB (Pro/Enterprise), 128 GB (Home) |
+| **Max CPU sockets** | 64 (Datacenter), 2 (Standard) | 2 |
+| **Max CPU cores** | Unlimited | 128 (Pro), 64 (Home) |
+| **Max file size (NTFS)** | 256 TB | 256 TB |
+| **Max file size (ReFS)** | 18.4 EB (2025) | — |
+| **Max volume size (NTFS)** | 256 TB | 256 TB |
+| **Max volume size (ReFS)** | 1.2 YB (theoretical) | — |
+| **Max dedup volume** | 64 TB (Data Deduplication) | — |
+| **Max cluster nodes** | 64 (Failover Cluster) | — |
+| **Max VM per host** | Unlimited (Datacenter) | — |
+| **VM memory per VM** | 12 TB (2022+) | — |
+| **VM vCPU per VM** | 240 (2022+) | — |
+| **Concurrent RDP** | 2 (admin), 200+ (RDS CAL) | 1 (Home), more (RDP host) |
+| **PowerShell Remoting** | Unlimited (WinRM) | Yes (WinRM) |
+
+---
+
+## Related
+
+- [AI-INFRASTRUCTURE.en.md](AI-INFRASTRUCTURE.en.md) — OS for AI workloads, GPU drivers, kernel parameters
+- [KUBERNETES.en.md](KUBERNETES.en.md) — container runtime, orchestration
+- [HYPERVISORS.en.md](HYPERVISORS.en.md) — hypervisors, VM host OS
+- [DATACENTERS.en.md](DATACENTERS.en.md) — DC layout, HW platforms
+
+## Sources
+
+Links, books, and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
+
+*Last revision: 2026-06-18*
--- a/OS.md
+++ b/OS.md
@@ -0,0 +1,333 @@
+# Operační systémy
+
+> Přehled Linux distribucí a Microsoft Windows pro serverové, containerové a AI/GPU workloady, včetně support lifecycle, EOL dat a srovnání.
+
+---
+
+## Přehled distribucí
+
+| Distribuce | Rodina | Package manager | Init | Security | Reference platforma |
+|-----------|--------|----------------|------|----------|-------------------|
+| **Ubuntu LTS** | Debian | apt (deb) | systemd | AppArmor | NVIDIA DGX, nejširší AI/GPU support |
+| **Debian** | Debian | apt (deb) | systemd | AppArmor | Univerzální server, stabilita |
+| **RHEL** | Red Hat | dnf (rpm) | systemd | SELinux | Enterprise standard, SAP, Oracle DB |
+| **Rocky Linux** | Red Hat | dnf (rpm) | systemd | SELinux | RHEL binary compatible (free) |
+| **AlmaLinux** | Red Hat | dnf (rpm) | systemd | SELinux | RHEL binary compatible (free) |
+| **SLES** | SUSE | zypper (rpm) | systemd | AppArmor | HPC, SAP, mainframe |
+| **OpenSUSE Leap** | SUSE | zypper (rpm) | systemd | AppArmor | Desktop, vývoj |
+| **OpenSUSE Tumbleweed** | SUSE | zypper (rpm) | systemd | AppArmor | Rolling release, bleeding edge |
+| **Fedora** | Red Hat | dnf (rpm) | systemd | SELinux | Desktop, technologický preview |
+| **Arch Linux** | Independent | pacman | systemd | — | Rolling, power users |
+| **Alpine Linux** | Independent | apk | OpenRC | — | Container image, embedded |
+| **Flatcar Container Linux** | Independent | — (image-based) | systemd | — | K8s worker node, minimal footprint |
+| **Bottlerocket** | Independent | — (image-based) | systemd | — | AWS K8s, minimal footprint |
+
+---
+
+## Support lifecycle a EOL data
+
+> **Standard:** základní podpora (bug fixy, security). **LTS/ELS:** prodloužená podpora (jen security).
+> ESM = Ubuntu Extended Security Maintenance, EUS = RHEL Extended Update Support, LTSS = SUSE Long Term Service Pack Support.
+
+### Ubuntu LTS
+
+| Verze | Release | Standard support | ESM / Ubuntu Pro | Poznámka |
+|-------|---------|-----------------|------------------|----------|
+| **20.04 LTS** (Focal) | 2020-04 | Konec 2025-04 | Konec 2030-04 | Poslední verze s Python 2 |
+| **22.04 LTS** (Jammy) | 2022-04 | Konec 2027-04 | Konec 2032-04 | NVIDIA DGX standard |
+| **24.04 LTS** (Noble) | 2024-04 | Konec 2029-04 | Konec 2034-04 | Nejnovější GPU/CUDA support |
+| **26.04 LTS** (plán) | 2026-04 | Konec 2031-04 | Konec 2036-04 | — |
+
+### RHEL
+
+| Verze | Release | Full support | Maintenance support | Extended life cycle |
+|-------|---------|-------------|-------------------|-------------------|
+| **7** | 2014-06 | Konec 2019-08 | Konec 2024-06 | Konec 2028-06 (ELS) |
+| **8** | 2019-05 | Konec 2024-05 | Konec 2029-05 | Konec 2034-06 (ELS) |
+| **9** | 2022-05 | Konec 2027-05 | Konec 2032-05 | Konec 2037-06 (ELS) |
+| **10** (plán) | 2025 | Konec 2029 | Konec 2034 | — |
+
+### Rocky Linux / AlmaLinux
+
+| Verze | Release | Support do | Kompatibilní s RHEL | Poznámka |
+|-------|---------|-----------|-------------------|----------|
+| **8** | 2021-06 | 2029-05 | Ano (od RHEL 8.4) | Alma/rocky |
+| **9** | 2022-07 | 2032-05 | Ano (od RHEL 9.0) | Alma/rocky |
+
+### Debian
+
+| Verze | Release | Full support | LTS support | ELTS (paid) |
+|-------|---------|-------------|-------------|-------------|
+| **11** (Bullseye) | 2021-08 | 2024-08 | Konec 2026-08 | Konec 2028-08 |
+| **12** (Bookworm) | 2023-06 | 2026-06 | Konec 2028-06 | Konec 2030-06 |
+| **13** (Trixie) | 2025 (oček.) | ~3 roky po release | ~5 let po release | — |
+
+### SLES
+
+| Verze | Release | General support | LTSS | Poznámka |
+|-------|---------|---------------|------|----------|
+| **15 SP3** | 2021-06 | Konec 2024-12 | Konec 2027-12 | — |
+| **15 SP4** | 2022-06 | Konec 2025-12 | Konec 2028-12 | — |
+| **15 SP5** | 2023-06 | Konec 2026-12 | Konec 2029-12 | Aktuální SP |
+| **15 SP6** | 2024-10 | Konec 2027-12 | Konec 2030-12 | — |
+
+### Fedora
+
+| Verze | Release | EOL | Poznámka |
+|-------|---------|-----|----------|
+| **38** | 2023-04 | 2024-05 | — |
+| **39** | 2023-11 | 2024-12 | — |
+| **40** | 2024-04 | 2025-05 | — |
+| **41** | 2024-11 | 2025-12 | — |
+
+Fedora vydává novou verzi každých ~6 měsíců, EOL ~13 měsíců po release. Slouží jako upstream pro RHEL.
+
+### Alpine Linux
+
+| Verze | Release | EOL |
+|-------|---------|-----|
+| **3.18** | 2023-05 | 2025-05 |
+| **3.19** | 2023-12 | 2025-12 |
+| **3.20** | 2024-05 | 2026-05 |
+| **3.21** | 2024-12 | 2026-12 |
+
+---
+
+## Kernel verze per distribuce
+
+| Distribuce | Kernel (default) | Kernel (HWE/enhanced) | Poznámka |
+|-----------|-----------------|----------------------|----------|
+| Ubuntu 22.04 LTS | 5.15 (GA) | 6.5+ (HWE) | HWE od 22.04.2 |
+| Ubuntu 24.04 LTS | 6.8 | — | — |
+| RHEL 8 | 4.18 | — | Backportované featur |
+| RHEL 9 | 5.14 | — | Backportované featur |
+| RHEL 10 | 6.11+ (oček.) | — | — |
+| Rocky/Alma 8 | 4.18 | — | Stejný jako RHEL 8 |
+| Rocky/Alma 9 | 5.14 | — | Stejný jako RHEL 9 |
+| Debian 11 | 5.10 | 6.1 (backports) | — |
+| Debian 12 | 6.1 | — | — |
+| SLES 15 SP5 | 5.14 | — | — |
+| SLES 15 SP6 | 6.4 | — | — |
+| Fedora 40 | 6.8+ | — | Rolling upstream |
+| Alpine 3.20 | 6.6 | — | — |
+
+---
+
+## Srovnání dle use case
+
+| Use case | Doporučená distribuce | Zdůvodnění |
+|----------|---------------------|-------|
+| **AI/GPU cluster (DGX)** | Ubuntu 22.04 LTS / DGX OS | NVIDIA standard, CUDA, MLNX_OFED |
+| **Enterprise K8s (OpenShift)** | RHEL 9 / RHCOS | Red Hat support, GPU Operator |
+| **Vanilla K8s (on-prem)** | Ubuntu 22.04 LTS + Flatcar (workers) | Community support, minimal worker image |
+| **HPC cluster (Slurm)** | Rocky Linux 9 / Ubuntu 22.04 | EL ekosystém + Lustre, nebo Ubuntu |
+| **Traditional enterprise DB (Oracle, SAP)** | RHEL 9 / SLES 15 | Vendor certifikace |
+| **Container host** | Ubuntu 22.04 / Alpine | Široká image kompatibilita / min size |
+| **Vývoj / desktop** | Fedora / Ubuntu 24.04 / OpenSUSE Tumbleweed | Aktuální balíčky, HW support |
+| **Embedded / IoT** | Debian / Alpine / Yocto | Minimal footprint, stabilita |
+| **Edge inference** | Ubuntu (ARM) / NVIDIA JetPack | Jetson, GPU support |
+| **Mainframe (IBM z/Arch)** | SLES 15 / RHEL 9 | IBM certifikace |
+
+---
+
+## Package management srovnání
+
+| Vlastnost | apt (Debian/Ubuntu) | dnf (RHEL/Rocky/Alma/Fedora) | zypper (SUSE) | pacman (Arch) | apk (Alpine) |
+|-----------|--------------------|------------------------------|---------------|---------------|-------------|
+| **Formát balíčků** | .deb | .rpm | .rpm | .pkg.tar.zst | .apk |
+| **Repo management** | /etc/apt/sources.list | /etc/yum.repos.d/ | /etc/zypp/repos.d/ | /etc/pacman.conf | /etc/apk/repositories |
+| **Lock file** | — (apt-mark hold) | — (exclude) | — (lock) | — (IgnorePkg) | — |
+| **Transactional update** | Ne | Ano (dnf history) | Ano (zypper history) | Ne | Ne |
+| **Rollback** | Ne (manual) | Ano (dnf history rollback) | Ano (snapper + zypper) | Ne | Ne |
+| **Delta updates** | Ano (apt-xapian) | Ano (deltarpm) | Ano (zsync) | Ne | Ne |
+| **Verze (k 2025)** | apt 2.7+ | dnf 4.18+ | zypper 1.14+ | pacman 6.1+ | apk 2.14+ |
+
+---
+
+## Security model porovnání
+
+| Vlastnost | SELinux (RHEL deriváty) | AppArmor (Ubuntu/Debian/SUSE) |
+|-----------|----------------------|------------------------------|
+| **Typ** | Mandatory Access Control (MAC) | Mandatory Access Control (MAC) |
+| **Labelování** | Kontextové (user:role:type) | Path-based (profil k executable) |
+| **Konfigurace** | Policy (moduly, booleany) | Profily (textové, v /etc/apparmor.d/) |
+| **Režimy** | Enforcing / Permissive / Disabled | Enforce / Complain / Disabled |
+| **Křivka učení** | Strmá (politiky komplexní) | Mírná (profily jednodušší) |
+| **Default v** | RHEL, Rocky, Alma, Fedora | Ubuntu, Debian, SLES, OpenSUSE |
+| **Use case** | Enterprise multiclient, regulované prostředí | Univerzální server, containment aplikací |
+| **Container integrace** | SELinux labels na kontejner | AppArmor profile na kontejner |
+
+Další vrstvy:
+- **seccomp** — syscall filtering (default v containerd, Docker)
+- **Capabilities** — Linux capabilities (drop vše kromě nutných)
+- **cgroups v2** — resource isolation (CPU, memory, IO, PID)
+- **User namespaces** — rootless kontejnery (Podman, Docker rootless)
+
+---
+
+## Doporučená migrační cesta pro EOL distribuce
+
+| Ze staré verze | Na | Doporučený postup |
+|----------------|-----|-------------------|
+| Ubuntu 20.04 (EOL 2025) | Ubuntu 22.04 nebo 24.04 | `do-release-upgrade` nebo fresh install |
+| RHEL 7 (EOL 2024) | RHEL 8 nebo 9 | `leapp` upgrade, nebo fresh install |
+| Rocky/Alma 8 | Rocky/Alma 9 | `dnf upgrade --releasever=9` |
+| Debian 11 (EOL LTS 2026) | Debian 12 | `apt full-upgrade` + nové sources.list |
+| SLES 15 SP4 (EOL 2025) | SLES 15 SP6 | `zypper migration` |
+| Fedora 40 (EOL 2025) | Fedora 42+ | `dnf system-upgrade` |
+
+---
+
+## Microsoft Windows
+
+### Windows Server — edice
+
+| Edice | Cena (approx) | Core limity | VM rights | Use case |
+|-------|--------------|-------------|-----------|----------|
+| **Datacenter** | ~$6 155 (2025) | Neomezen | Neomezené Windows VM na hostiteli | Virtualizace, SDDC, S2D, HCI |
+| **Standard** | ~$1 069 (2025) | 2 CPU, neomezen jader | 2 Windows VM + Hyper-V host | Běžný server, AD, file server |
+| **Essentials** | ~$501 (2025) | 1 CPU, max 10 uživatelů | — | Malé firmy (do 25 uživatelů) |
+| **Azure Edition** | Pay-as-you-go | Dle Azure VM | Dle Azure | Azure-only, hotpatching |
+
+Licencování: Windows Server Standard a Datacenter se licencují **per core** (min 16 core/server + 8 core/VM).
+
+### Windows Server — support lifecycle
+
+> **Mainstream:** běžné aktualizace (bug fixy, security, feature). **Extended:** jen security aktualizace (zdarma).
+> **ESU:** Extended Security Updates (placená vrstva navíc, cca $45–300/core/rok).
+
+| Verze | Release | Mainstream support | Extended support | ESU | Poznámka |
+|-------|---------|------------------|-----------------|-----|----------|
+| **2012 R2** | 2013-11 | 2018-10 | 2023-10 | Konec 2026-10 (3. rok) | ESU placená, poslední rok |
+| **2016** | 2016-10 | 2022-01 | 2027-01 | — | Poslední s Desktop Experience |
+| **2019** | 2019-01 | 2024-01 | 2029-01 | — | Poslední s Nano Server (jen 1803) |
+| **2022** | 2021-09 | 2026-10 | 2031-10 | — | Aktuální, TPM 2.0, Credential Guard |
+| **2025** | 2024-11 | 2029-10 | 2034-10 | — | Hotpatching, PowerShell 7, SMB over QUIC |
+
+### Windows Server — verze vs edice grid
+
+| Verze | Hyper-V | Storage Spaces Direct | Software-defined networking | Containers | GPU DDA / vGPU | WSL2 |
+|-------|---------|---------------------|---------------------------|------------|---------------|------|
+| 2016 Standard | Ano | Ne (jen Datacenter) | Ne (jen Datacenter) | Jen Windows | Ano | Ne |
+| 2016 Datacenter | Ano | Ano | Ano | Windows | Ano | Ne |
+| 2019 Standard | Ano | Ne | Ne | Windows | Ano | Ne |
+| 2019 Datacenter | Ano | Ano | Ano | Windows | Ano | Ne |
+| 2022 Standard | Ano | Ne | Ne | Windows + Linux | Ano | Ne |
+| 2022 Datacenter | Ano | Ano | Ano | Windows + Linux (2022.2+) | Ano | Ne |
+| 2025 Datacenter | Ano | Ano | Ano | Windows + Linux | Ano | Ano |
+
+### Windows Desktop — support lifecycle
+
+> **E = Enterprise, Pro = Professional, Home = Consumer**
+> LTSC = Long Term Servicing Channel (stabilní, bez feature updatů)
+
+| Verze | Release | EOL (Home/Pro) | EOL (Enterprise) | LTSC EOL | Poznámka |
+|-------|---------|---------------|-----------------|----------|----------|
+| **10 21H2** | 2021-11 | — | 2024-06 | — |
+| **10 22H2** | 2022-10 | 2025-10 | 2025-10 | — | Poslední Windows 10 |
+| **10 LTSC 2021** | 2021-11 | — | — | 2032-01 | IoT Enterprise LTSC |
+| **11 22H2** | 2022-09 | 2024-10 | 2025-10 | — |
+| **11 23H2** | 2023-10 | 2025-11 | 2026-11 | — |
+| **11 24H2** | 2024-10 | 2026-10 | 2027-10 | — | První s Recall, Copilot+ |
+| **11 LTSC 2024** | 2024-10 | — | — | 2029-10 | Enterprise LTSC |
+
+Podpora Windows 10 **skončila 2025-10-14** — poslední verze s klasickým ovládacím panelem.
+
+### Windows vs Linux — srovnání
+
+| Vlastnost | Windows Server | RHEL / Ubuntu |
+|-----------|---------------|---------------|
+| **Licence (server)** | $500–6 000 (per core) + CAL | $0–800 (per node subscription) |
+| **Licence (desktop)** | $100–200 (OEM/retail) | Zdarma |
+| **Cena za support** | Zahrnuto v licenci (SA/ESU) | $200–1 300/node/rok (RHEL) |
+| **Package management** | MSI, AppX, winget, NuGet | APT, DNF, Zypper |
+| **Package count** | ~10 000 (chocolatey) | ~60 000+ (Ubuntu repo) |
+| **Desktop GUI** | Windows Shell (mandatory) | Volitelný (GNOME, KDE, XFCE…) |
+| **Server GUI** | Windows Shell (od 2022 Core only) | CLI-only (standard) |
+| **Kernel** | NT hybrid kernel (kernel-mode Win32) | Monolithic Linux kernel |
+| **Device support** | OEM driver model (WHQL) | Open source + vendor drivers |
+| **Container types** | Windows + Linux (WSL2) | Linux (Docker, Podman, containerd) |
+| **Container registry** | Docker Hub, ACR, Nexus | Docker Hub, Quay, GHCR, Nexus… |
+| **Container image size** | ~4–8 GB (Windows Server Core) | ~100 MB – 1 GB (Alpine/Ubuntu) |
+| **GPU passthrough** | DDA (Discrete Device Assignment) | GPU Direct, VFIO, SR-IOV |
+| **AI/ML support** | WSL2 (CUDA), Azure ML | Native CUDA, ROCm, oneAPI |
+| **CUDA support** | Ano (přes WSL2 nebo Docker) | Native (nvidia-container-toolkit) |
+| **Orchestration** | AD / GPO / SCCM / WAC | Ansible, Puppet, Salt, Foreman |
+| **RBAC/AAA** | Active Directory (+ Kerberos) | LDAP, FreeIPA, SSSD, AD |
+| **Remote management** | RDP, WinRM, PowerShell Remoting | SSH, Cockpit, Webmin |
+| **Filesystem** | NTFS, ReFS, CSVFS | ext4, XFS, Btrfs, ZFS |
+| **Max file system size** | 256 TB (NTFS), 1.2 YB (ReFS) | 1 EB (XFS), 16 EB (ZFS) |
+| **Hypervisor** | Hyper-V (Type 1) | KVM (Type 2-ish), Xen |
+| **Dynamic memory** | Hyper-V Dynamic Memory | KSM, virtio-balloon (KVM) |
+| **Live migration** | Hyper-V Live Migration | KVM Live Migration, vMotion |
+
+### Windows specific features
+
+| Feature | Popis | Lze nahradit na Linuxu? |
+|---------|-------|------------------------|
+| **Active Directory** | Identity, auth, GPO, DNS, DHCP | FreeIPA, Samba AD DC, 389-ds, SSSD |
+| **Group Policy** | Centrální konfigurace desktopů/serverů | Ansible, Puppet, Salt (agent-based) |
+| **Hyper-V + S2D** | Hyper-converged storage a virtualizace (HCI) | Proxmox Ceph / oVirt + Gluster |
+| **Failover Clustering** | Cluster-aware aplikace (SQL, File Server) | Pacemaker + Corosync + DRBD |
+| **IIS** | Web server, ASP.NET host | Nginx, Apache (bez ASP.NET, nebo .NET host) |
+| **PowerShell** | Scripting, Desired State Configuration | Bash, Python, Ansible |
+| **Windows Admin Center** | GUI management | Cockpit, Webmin |
+| **BitLocker** | Full disk encryption | LUKS + cryptsetup |
+| **Windows Defender** | Antivirus + EDR | ClamAV, Wazuh, Osquery |
+| **SQL Server** | Relační DB | PostgreSQL, MySQL, MariaDB |
+
+### Doporučený OS dle use case (včetně Windows)
+
+| Use case | OS | Zdůvodnění |
+|----------|-----|-------|
+| **Active Directory / GPO / hybrid ID** | Windows Server 2022/2025 | AD jen na Windows |
+| **SQL Server (failover cluster)** | Windows Server Datacenter + SQL EE | Always On FCI, ReFS |
+| **Exchange / SharePoint** | Windows Server 2022 | Jen na Windows |
+| **Enterprise desktop management** | Windows 11 Enterprise + Intune/SCCM | GPO, AD, enterprise MDM |
+| **.NET / ASP.NET aplikace** | Windows Server / Linux (.NET Core) | .NET 6+ běží na Linuxu |
+| **HCI (Microsoft stack)** | Windows Server Datacenter + S2D + Hyper-V | Azure Stack HCI |
+| **Virtualizace (mixed workload)** | Windows Server Datacenter (Hyper-V) | Linux i Windows VM pod jedním |
+| **AI/GPU inference** | Linux (Ubuntu) + CUDA | NVIDIA optimální; WSL2 alternativa |
+| **Container orchestration (Windows nodes)** | Windows Server 2022/2025 + containerd | Windows Pods v AKS on-prem |
+| **Tier 2 aplikace / web / API** | Ubuntu nebo RHEL (Linux) | Nižší TCO, menší footprint |
+
+### Windows Server migrační cesty
+
+| Ze staré verze | Na | Doporučený postup |
+|---------------|-----|-------------------|
+| Windows Server 2012 R2 (EOL 2023) | Windows Server 2022/2025 | In-place upgrade nebo fresh + migration |
+| Windows Server 2016 (EOL 2027) | Windows Server 2022/2025 | In-place upgrade nebo fresh |
+| Windows Server 2019 | Windows Server 2022/2025 | In-place upgrade (`Setup.exe /auto upgrade`) |
+| Windows Server 2022 | Windows Server 2025 | In-place upgrade nebo fresh |
+| Windows Server → Cloud | Azure VM / Azure Stack HCI | Azure Migrate, Storage Migration Service |
+| Windows Server → Linux | Ubuntu / RHEL (re-platform) | Migrace aplikace na .NET Core nebo alternativu |
+
+### Windows — API a provozní limity
+
+| Limit | Windows Server | Windows Desktop |
+|-------|---------------|----------------|
+| **Max RAM** | 24 TB (2025 Datacenter) | 2 TB (Pro/Enterprise), 128 GB (Home) |
+| **Max CPU sockets** | 64 (Datacenter), 2 (Standard) | 2 |
+| **Max CPU cores** | Neomezen | 128 (Pro), 64 (Home) |
+| **Max file size (NTFS)** | 256 TB | 256 TB |
+| **Max file size (ReFS)** | 18.4 EB (2025) | — |
+| **Max volume size (NTFS)** | 256 TB | 256 TB |
+| **Max volume size (ReFS)** | 1.2 YB (teoreticky) | — |
+| **Max dedup volume** | 64 TB (Data Deduplication) | — |
+| **Max cluster nodes** | 64 (Failover Cluster) | — |
+| **Max VM per host** | Neomezen (Datacenter) | — |
+| **VM memory per VM** | 12 TB (2022+) | — |
+| **VM vCPU per VM** | 240 (2022+) | — |
+| **Concurrent RDP** | 2 (admin), 200+ (RDS CAL) | 1 (Home), více (RDP host) |
+| **PowerShell Remoting** | Neomezen (WinRM) | Ano (WinRM) |
+
+- [AI-INFRASTRUCTURE.md](AI-INFRASTRUCTURE.md) — OS pro AI workloady, GPU drivery, kernel parametry
+- [KUBERNETES.md](KUBERNETES.md) — container runtime, orchestrace
+- [HYPERVISORS.md](HYPERVISORS.md) — hypervisory, VM host OS
+- [DATACENTERS.md](DATACENTERS.md) — DC layout, HW platformy
+
+## Zdroje
+
+Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+
+*Poslední revize: 2026-06-18*
--- a/POSTGRESQL.en.md
+++ b/POSTGRESQL.en.md
@@ -166,7 +166,7 @@ LIMIT 10;

 ## Sources

-References, books, and standards: [sources/databases/sources.md](sources/databases/sources.md)
+References, books, and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md)

 ### Recommended reading

--- a/PROVISIONING.en.md
+++ b/PROVISIONING.en.md
@@ -167,7 +167,7 @@ resource "vsphere_virtual_machine" "web" {
 }
 ```

-More in [CICD.md](CICD.md#infrastructure-as-code-iac).
+More in [CICD.en.md](CICD.en.md#infrastructure-as-code-iac).

 ## Firmware management

@@ -188,7 +188,7 @@ More in [CICD.md](CICD.md#infrastructure-as-code-iac).
 | **Chef** | Ruby DSL | Pull (agent) | Compliance, infrastructure automation |
 | **SaltStack** | YAML/Python | Both (salt-minion) | High-speed config, event-driven |

-More in [CICD.md](CICD.md).
+More in [CICD.en.md](CICD.en.md).

 ## OpenStack Provisioning

@@ -223,6 +223,6 @@ OpenStack offers several methods for provisioning infrastructure:

 ## Sources

-Links, books and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)

 *Last revision: 2026-06-03*
--- a/README.en.md
+++ b/README.en.md
@@ -35,11 +35,11 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
  │ PCIe,BM) │ │(BIOS,    │ │  AMD)  │ │  Terraform)  │
  └──────────┘ │ NUMA)    │ └────────┘ └──────────────┘
               └──────────┘
-  ┌──────────┐ ┌──────────┐ ┌────────┐
-  │HYPERVISOR│ │ MONITOR  │ │  CICD  │
-  │(VMware,  │ │(Prom,    │ │(GitOps, │
-  │ KVM, ...)│ │ Grafana) │ │  IaC)   │
-  └──────────┘ └──────────┘ └────────┘
+  ┌──────────┐ ┌──────────┐ ┌────────┐ ┌────────────┐
+  │HYPERVISOR│ │ MONITOR  │ │  CICD  │ │   ☸ K8s    │
+  │(VMware,  │ │(Prom,    │ │(GitOps, │ │(CAPI, K3s, │
+  │ KVM, ...)│ │ Grafana) │ │  IaC)   │ │  RKE2...)  │
+  └──────────┘ └──────────┘ └────────┘ └────────────┘
 ```

 ---
@@ -52,15 +52,22 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | 🌐 Network architecture | [NETWORKING.md](NETWORKING.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
 | 📊 Monitoring & observability | [MONITORING.md](MONITORING.md) | Prometheus, Grafana, OTel, logging, alerting | — |
 | 🔄 CI/CD & DevOps | [CICD.md](CICD.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
+| 💻 Operační systémy | [OS.md](OS.md) | Linux distribuce, Windows Server, lifecycle, EOL, kernel | KUBERNETES, HYPERVISORS, AI-INFRASTRUCTURE |
+| 🔄 Disaster Recovery | [DR.md](DR.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
 | 🗄️ Database architecture | [DATABASES.md](DATABASES.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VEKTOROVE-DB, DATABAZOVE-ENGINY |
+| 🗄️ Big Data | [BIG-DATA.md](BIG-DATA.md) | HDFS, Spark, Flink, Trino, Iceberg, Delta Lake, Lakehouse | DATABASES, CLOUD, MESSAGING, KUBERNETES |
 | 🖥️ Hypervisors | [HYPERVISORS.md](HYPERVISORS.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
-| 🏭 Data centers | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC services | MONITORING |
+| 🏭 Data centers | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING, MESSAGING |
 | 💾 Storage | [STORAGE.md](STORAGE.md) | SAN/NAS/object, RAID, SDS, Ceph, OpenStack Cinder/Swift/Manila | — |
 | 🔌 Server connectivity | [CONNECTIVITY.md](CONNECTIVITY.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
 | 🔧 Server hardware | [SERVER-HW.md](SERVER-HW.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
 | 🎮 GPU | [GPU.md](GPU.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — |
 | ⚙️ Server config | [SERVER-CONFIG.md](SERVER-CONFIG.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — |
 | 📦 Provisioning | [PROVISIONING.md](PROVISIONING.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD |
+| ☸ Kubernetes | [KUBERNETES.md](KUBERNETES.md) | K8s architektura, deployment, Cluster API (CAPI) | CICD, CLOUD, NETWORKING |
+| 📨 Messaging & streaming | [MESSAGING.md](MESSAGING.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD |
+| 🏗️ Migrace DC | [DC-MIGRATION.md](DC-MIGRATION.md) | Strategie, fáze, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE |
+| 🧠 AI infrastruktura | [AI-INFRASTRUCTURE.md](AI-INFRASTRUCTURE.md) | GPU, AI networking, storage, cluster, cooling, training/inference | GPU, NETWORKING, STORAGE, DATACENTERS, CLOUD |
 | 📋 Legacy index | [HARDWARE.md](HARDWARE.md) | → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING |
 | 📋 Legacy infra | [INFRASTRUCTURE.md](INFRASTRUCTURE.md) | → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE |
 | 📋 Review workflow | [REVIEW.md](REVIEW.md) | Review and content control process | — |
@@ -70,12 +77,12 @@ Bilingual: Czech (`.md`) and English (`.en.md`).

 | File | Description |
 |------|-------------|
-| [POSTGRESQL.md](POSTGRESQL.md) | PostgreSQL — architecture, replication, tuning |
-| [MYSQL.md](MYSQL.md) | MySQL & MariaDB |
-| [ORACLE.md](ORACLE.md) | Oracle Database — RAC, Data Guard, tuning |
-| [MONGODB.md](MONGODB.md) | MongoDB — document DB, sharding, replica sets |
-| [REDIS.md](REDIS.md) | Redis — cache, session store, streams |
-| [CASSANDRA.md](CASSANDRA.md) | Cassandra & ScyllaDB — wide-column, nosql |
+| [POSTGRESQL.en.md](POSTGRESQL.en.md) | PostgreSQL — architecture, replication, tuning |
+| [MYSQL.en.md](MYSQL.en.md) | MySQL & MariaDB |
+| [ORACLE.en.md](ORACLE.en.md) | Oracle Database — RAC, Data Guard, tuning |
+| [MONGODB.en.md](MONGODB.en.md) | MongoDB — document DB, sharding, replica sets |
+| [REDIS.en.md](REDIS.en.md) | Redis — cache, session store, streams |
+| [CASSANDRA.en.md](CASSANDRA.en.md) | Cassandra & ScyllaDB — wide-column, nosql |
 | [VEKTOROVE-DB.md](VEKTOROVE-DB.md) | Vector databases — Pinecone, Qdrant, Milvus, pgvector |
 | [DATABAZOVE-ENGINY.md](DATABAZOVE-ENGINY.md) | Common DB concepts — transactions, indexes, locking |

@@ -89,15 +96,22 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | 🌐 Network architecture | [NETWORKING.en.md](NETWORKING.en.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
 | 📊 Monitoring & observability | [MONITORING.en.md](MONITORING.en.md) | Prometheus, Grafana, OTel, logging, alerting | — |
 | 🔄 CI/CD & DevOps | [CICD.en.md](CICD.en.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
+| 💻 Operating systems | [OS.en.md](OS.en.md) | Linux distributions, Windows Server, lifecycle, EOL, kernel | KUBERNETES, HYPERVISORS, AI-INFRASTRUCTURE |
+| 🔄 Disaster Recovery | [DR.en.md](DR.en.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
 | 🗄️ Database architecture | [DATABASES.en.md](DATABASES.en.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VECTOR-DBS, DATABASE-ENGINES |
+| 🗄️ Big Data | [BIG-DATA.en.md](BIG-DATA.en.md) | HDFS, Spark, Flink, Trino, Iceberg, Delta Lake, Lakehouse | DATABASES, CLOUD, MESSAGING, KUBERNETES |
 | 🖥️ Hypervisors | [HYPERVISORS.en.md](HYPERVISORS.en.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
-| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services | MONITORING |
+| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING, MESSAGING |
 | 💾 Storage | [STORAGE.en.md](STORAGE.en.md) | SAN/NAS/object, RAID, SDS, Ceph | — |
 | 🔌 Server connectivity | [CONNECTIVITY.en.md](CONNECTIVITY.en.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
 | 🔧 Server hardware | [SERVER-HW.en.md](SERVER-HW.en.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
 | 🎮 GPU | [GPU.en.md](GPU.en.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — |
 | ⚙️ Server config | [SERVER-CONFIG.en.md](SERVER-CONFIG.en.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — |
 | 📦 Provisioning | [PROVISIONING.en.md](PROVISIONING.en.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD |
+| ☸ Kubernetes | [KUBERNETES.en.md](KUBERNETES.en.md) | K8s architecture, deployment, Cluster API (CAPI) | CICD, CLOUD, NETWORKING |
+| 📨 Messaging & streaming | [MESSAGING.en.md](MESSAGING.en.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD |
+| 🏗️ DC Migration | [DC-MIGRATION.en.md](DC-MIGRATION.en.md) | Strategies, phases, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE |
+| 🧠 AI Infrastructure | [AI-INFRASTRUCTURE.en.md](AI-INFRASTRUCTURE.en.md) | GPU, AI networking, storage, cluster, cooling, training/inference | GPU, NETWORKING, STORAGE, DATACENTERS, CLOUD |
 | 📋 Legacy index | [HARDWARE.en.md](HARDWARE.en.md) | → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING |
 | 📋 Legacy infra | [INFRASTRUCTURE.en.md](INFRASTRUCTURE.en.md) | → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE |
 | 📋 Review workflow | [REVIEW.en.md](REVIEW.en.md) | Review and content control process | — |
@@ -122,7 +136,7 @@ Bilingual: Czech (`.md`) and English (`.en.md`).

 | File | Description |
 |------|-------------|
-| [case-studies/proxmox-demo/README.md](case-studies/proxmox-demo/README.md) | Proxmox VE demo cluster — design (CZ) |
+| [case-studies/proxmox-demo/README.md](case-studies/proxmox-demo/README.md) | Proxmox VE demo cluster — návrh (CZ) |
 | [case-studies/proxmox-demo/README.en.md](case-studies/proxmox-demo/README.en.md) | Proxmox VE demo cluster — design (EN) |

 ---
@@ -131,21 +145,28 @@ Bilingual: Czech (`.md`) and English (`.en.md`).

 | File | References |
 |------|------------|
-| `CLOUD.md` / `CLOUD.en.md` | [`GPU.md`](GPU.md), [`NETWORKING.md`](NETWORKING.md), [`sources/cloud/sources.md`](sources/cloud/sources.md) |
-| `NETWORKING.md` / `NETWORKING.en.md` | [`CLOUD.md`](CLOUD.md), [`sources/networking/sources.md`](sources/networking/sources.md) |
-| `DATACENTERS.md` / `DATACENTERS.en.md` | [`MONITORING.md`](MONITORING.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
-| `MONITORING.md` / `MONITORING.en.md` | [`sources/monitoring/sources.md`](sources/monitoring/sources.md) |
-| `CICD.md` / `CICD.en.md` | [`sources/cicd/sources.md`](sources/cicd/sources.md) |
-| `PROVISIONING.md` / `PROVISIONING.en.md` | [`CICD.md`](CICD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
-| `STORAGE.md` / `STORAGE.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
-| `GPU.md` / `GPU.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
-| `SERVER-HW.md` / `SERVER-HW.en.md` | [`CONNECTIVITY.md`](CONNECTIVITY.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
-| `SERVER-CONFIG.md` / `SERVER-CONFIG.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
-| `CONNECTIVITY.md` / `CONNECTIVITY.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
-| `HYPERVISORS.md` / `HYPERVISORS.en.md` | [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
-| `DATABASES.md` / `DATABASES.en.md` | [`POSTGRESQL.md`](POSTGRESQL.md), [`MYSQL.md`](MYSQL.md), [`ORACLE.md`](ORACLE.md), [`MONGODB.md`](MONGODB.md), [`REDIS.md`](REDIS.md), [`CASSANDRA.md`](CASSANDRA.md), [`VEKTOROVE-DB.md`](VEKTOROVE-DB.md), [`DATABAZOVE-ENGINY.md`](DATABAZOVE-ENGINY.md), [`sources/databases/sources.md`](sources/databases/sources.md) |
-| `HARDWARE.md` / `HARDWARE.en.md` | [`SERVER-HW.md`](SERVER-HW.md), [`GPU.md`](GPU.md), [`SERVER-CONFIG.md`](SERVER-CONFIG.md), [`PROVISIONING.md`](PROVISIONING.md) |
-| `INFRASTRUCTURE.md` / `INFRASTRUCTURE.en.md` | [`HYPERVISORS.md`](HYPERVISORS.md), [`DATACENTERS.md`](DATACENTERS.md), [`STORAGE.md`](STORAGE.md), [`HARDWARE.md`](HARDWARE.md) |
+| `CLOUD.md` / `CLOUD.en.md` | [`GPU.en.md`](GPU.en.md), [`NETWORKING.en.md`](NETWORKING.en.md), [`sources/cloud/sources.en.md`](sources/cloud/sources.en.md) |
+| `NETWORKING.md` / `NETWORKING.en.md` | [`CLOUD.en.md`](CLOUD.en.md), [`sources/networking/sources.en.md`](sources/networking/sources.en.md) |
+| `DATACENTERS.md` / `DATACENTERS.en.md` | [`MONITORING.en.md`](MONITORING.en.md), [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `MONITORING.md` / `MONITORING.en.md` | [`sources/monitoring/sources.en.md`](sources/monitoring/sources.en.md) |
+| `CICD.md` / `CICD.en.md` | [`sources/cicd/sources.en.md`](sources/cicd/sources.en.md) |
+| `DR.md` / `DR.en.md` | [`CLOUD.en.md`](CLOUD.en.md), [`DATACENTERS.en.md`](DATACENTERS.en.md), [`MONITORING.en.md`](MONITORING.en.md), [`CICD.en.md`](CICD.en.md), [`STORAGE.en.md`](STORAGE.en.md), [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `MESSAGING.md` / `MESSAGING.en.md` | [`DATACENTERS.en.md`](DATACENTERS.en.md), [`CLOUD.en.md`](CLOUD.en.md), [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `DC-MIGRATION.md` / `DC-MIGRATION.en.md` | [`DATACENTERS.en.md`](DATACENTERS.en.md), [`CLOUD.en.md`](CLOUD.en.md), [`DR.en.md`](DR.en.md), [`NETWORKING.en.md`](NETWORKING.en.md), [`STORAGE.en.md`](STORAGE.en.md), [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `AI-INFRASTRUCTURE.md` / `AI-INFRASTRUCTURE.en.md` | [`GPU.en.md`](GPU.en.md), [`NETWORKING.en.md`](NETWORKING.en.md), [`STORAGE.en.md`](STORAGE.en.md), [`DATACENTERS.en.md`](DATACENTERS.en.md), [`CLOUD.en.md`](CLOUD.en.md), [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `PROVISIONING.md` / `PROVISIONING.en.md` | [`CICD.en.md`](CICD.en.md), [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `STORAGE.md` / `STORAGE.en.md` | [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `GPU.md` / `GPU.en.md` | [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `SERVER-HW.md` / `SERVER-HW.en.md` | [`CONNECTIVITY.en.md`](CONNECTIVITY.en.md), [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `SERVER-CONFIG.md` / `SERVER-CONFIG.en.md` | [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `CONNECTIVITY.md` / `CONNECTIVITY.en.md` | [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `HYPERVISORS.md` / `HYPERVISORS.en.md` | [`STORAGE.en.md`](STORAGE.en.md), [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `DATABASES.md` / `DATABASES.en.md` | [`POSTGRESQL.en.md`](POSTGRESQL.en.md), [`MYSQL.en.md`](MYSQL.en.md), [`ORACLE.en.md`](ORACLE.en.md), [`MONGODB.en.md`](MONGODB.en.md), [`REDIS.en.md`](REDIS.en.md), [`CASSANDRA.en.md`](CASSANDRA.en.md), [`VEKTOROVE-DB.md`](VEKTOROVE-DB.md), [`DATABAZOVE-ENGINY.md`](DATABAZOVE-ENGINY.md), [`sources/databases/sources.en.md`](sources/databases/sources.en.md) |
+| `HARDWARE.md` / `HARDWARE.en.md` | [`SERVER-HW.en.md`](SERVER-HW.en.md), [`GPU.en.md`](GPU.en.md), [`SERVER-CONFIG.en.md`](SERVER-CONFIG.en.md), [`PROVISIONING.en.md`](PROVISIONING.en.md) |
+| `OS.md` / `OS.en.md` | [`AI-INFRASTRUCTURE.en.md`](AI-INFRASTRUCTURE.en.md), [`KUBERNETES.en.md`](KUBERNETES.en.md), [`HYPERVISORS.en.md`](HYPERVISORS.en.md), [`DATACENTERS.en.md`](DATACENTERS.en.md), [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `KUBERNETES.md` / `KUBERNETES.en.md` | [`CICD.en.md`](CICD.en.md), [`CLOUD.en.md`](CLOUD.en.md), [`NETWORKING.en.md`](NETWORKING.en.md), [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `BIG-DATA.md` / `BIG-DATA.en.md` | [`DATABASES.en.md`](DATABASES.en.md), [`CLOUD.en.md`](CLOUD.en.md), [`MESSAGING.en.md`](MESSAGING.en.md), [`KUBERNETES.en.md`](KUBERNETES.en.md), [`sources/infrastructure/sources.en.md`](sources/infrastructure/sources.en.md) |
+| `INFRASTRUCTURE.md` / `INFRASTRUCTURE.en.md` | [`HYPERVISORS.en.md`](HYPERVISORS.en.md), [`DATACENTERS.en.md`](DATACENTERS.en.md), [`STORAGE.en.md`](STORAGE.en.md), [`HARDWARE.en.md`](HARDWARE.en.md) |

 ---

@@ -187,4 +208,4 @@ Raw reference data (documentation, books, standards) by area:

 ---

-*This index is automatically maintained by the `kb-index` agent. Last updated: 2026-06-11.*
+*This index is automatically maintained by the `kb-index` agent. Last updated: 2026-06-18.*
--- a/README.md
+++ b/README.md
@@ -35,11 +35,11 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
  │ PCIe,BM) │ │(BIOS,    │ │  AMD)  │ │  Terraform)  │
  └──────────┘ │ NUMA)    │ └────────┘ └──────────────┘
               └──────────┘
-  ┌──────────┐ ┌──────────┐ ┌────────┐
-  │HYPERVISOR│ │ MONITOR  │ │  CICD  │
-  │(VMware,  │ │(Prom,    │ │(GitOps, │
-  │ KVM, ...)│ │ Grafana) │ │  IaC)   │
-  └──────────┘ └──────────┘ └────────┘
+  ┌──────────┐ ┌──────────┐ ┌────────┐ ┌────────────┐
+  │HYPERVISOR│ │ MONITOR  │ │  CICD  │ │   ☸ K8s    │
+  │(VMware,  │ │(Prom,    │ │(GitOps, │ │(CAPI, K3s, │
+  │ KVM, ...)│ │ Grafana) │ │  IaC)   │ │  RKE2...)  │
+  └──────────┘ └──────────┘ └────────┘ └────────────┘
 ```

 ---
@@ -52,15 +52,22 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | 🌐 Síťová architektura | [NETWORKING.md](NETWORKING.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
 | 📊 Monitoring a observabilita | [MONITORING.md](MONITORING.md) | Prometheus, Grafana, OTel, logging, alerting, SLO | — |
 | 🔄 CI/CD a DevOps | [CICD.md](CICD.md) | Pipelines, GitOps, IaC (Terraform), deployment strategie | — |
+| 💻 Operační systémy | [OS.md](OS.md) | Linux distribuce, Windows Server, lifecycle, EOL, kernel | KUBERNETES, HYPERVISORS, AI-INFRASTRUCTURE |
+| 🔄 Disaster Recovery | [DR.md](DR.md) | RTO, RPO, scénáře, prevence, výpočet uptimu | CLOUD, DATACENTERS, MONITORING |
 | 🗄️ Databázová architektura | [DATABASES.md](DATABASES.md) | Klasifikace, sharding, replikace, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VEKTOROVE-DB, DATABAZOVE-ENGINY |
+| 🗄️ Big Data | [BIG-DATA.md](BIG-DATA.md) | HDFS, Spark, Flink, Trino, Iceberg, Delta Lake, Lakehouse | DATABASES, CLOUD, MESSAGING, KUBERNETES |
 | 🖥️ Hypervisory | [HYPERVISORS.md](HYPERVISORS.md) | VMware, Hyper-V, KVM, Proxmox, migrace | STORAGE, SERVER-HW |
-| 🏭 Datová centra | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC služby | MONITORING |
+| 🏭 Datová centra | [DATACENTERS.md](DATACENTERS.md) | Tier, power, cooling, layout, DC služby, sekundární DC topologie | MONITORING, MESSAGING |
 | 💾 Storage | [STORAGE.md](STORAGE.md) | SAN/NAS/object, RAID, SDS, Ceph, OpenStack Cinder/Swift/Manila | — |
 | 🔌 Server connectivity | [CONNECTIVITY.md](CONNECTIVITY.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
 | 🔧 Server hardware | [SERVER-HW.md](SERVER-HW.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
 | 🎮 GPU | [GPU.md](GPU.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — |
 | ⚙️ Server config | [SERVER-CONFIG.md](SERVER-CONFIG.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — |
 | 📦 Provisioning | [PROVISIONING.md](PROVISIONING.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD |
+| ☸ Kubernetes | [KUBERNETES.md](KUBERNETES.md) | K8s architektura, deployment, Cluster API (CAPI) | CICD, CLOUD, NETWORKING |
+| 📨 Messaging & streaming | [MESSAGING.md](MESSAGING.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD |
+| 🏗️ Migrace DC | [DC-MIGRATION.md](DC-MIGRATION.md) | Strategie, fáze, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE |
+| 🧠 AI infrastruktura | [AI-INFRASTRUCTURE.md](AI-INFRASTRUCTURE.md) | GPU, AI networking, storage, cluster, cooling, training/inference | GPU, NETWORKING, STORAGE, DATACENTERS, CLOUD |
 | 📋 Původní rozcestník | [HARDWARE.md](HARDWARE.md) | Legacy index → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING |
 | 📋 Původní infrastruktura | [INFRASTRUCTURE.md](INFRASTRUCTURE.md) | Legacy index → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE |
 | 📋 Review workflow | [REVIEW.md](REVIEW.md) | Proces oponentury a kontroly obsahu | — |
@@ -89,15 +96,22 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | 🌐 Network architecture | [NETWORKING.en.md](NETWORKING.en.md) | DNS, BGP, VPC, Zero Trust, EVPN VXLAN, TLS | CLOUD |
 | 📊 Monitoring & observability | [MONITORING.en.md](MONITORING.en.md) | Prometheus, Grafana, OTel, logging, alerting | — |
 | 🔄 CI/CD & DevOps | [CICD.en.md](CICD.en.md) | Pipelines, GitOps, IaC (Terraform), deployment | — |
+| 💻 Operating systems | [OS.en.md](OS.en.md) | Linux distributions, Windows Server, lifecycle, EOL, kernel | KUBERNETES, HYPERVISORS, AI-INFRASTRUCTURE |
+| 🔄 Disaster Recovery | [DR.en.md](DR.en.md) | RTO, RPO, scenarios, prevention, uptime calculation | CLOUD, DATACENTERS, MONITORING |
 | 🗄️ Database architecture | [DATABASES.en.md](DATABASES.en.md) | Classification, sharding, replication, caching | POSTGRESQL, MYSQL, ORACLE, MONGODB, REDIS, CASSANDRA, VECTOR-DBS, DATABASE-ENGINES |
+| 🗄️ Big Data | [BIG-DATA.en.md](BIG-DATA.en.md) | HDFS, Spark, Flink, Trino, Iceberg, Delta Lake, Lakehouse | DATABASES, CLOUD, MESSAGING, KUBERNETES |
 | 🖥️ Hypervisors | [HYPERVISORS.en.md](HYPERVISORS.en.md) | VMware, Hyper-V, KVM, Proxmox, migration | STORAGE, SERVER-HW |
-| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services | MONITORING |
+| 🏭 Data centers | [DATACENTERS.en.md](DATACENTERS.en.md) | Tier, power, cooling, layout, DC services, secondary DC topologies | MONITORING, MESSAGING |
 | 💾 Storage | [STORAGE.en.md](STORAGE.en.md) | SAN/NAS/object, RAID, SDS, Ceph | — |
 | 🔌 Server connectivity | [CONNECTIVITY.en.md](CONNECTIVITY.en.md) | Ethernet, FC SAN, iSCSI, NVMe-oF, SAS | — |
 | 🔧 Server hardware | [SERVER-HW.en.md](SERVER-HW.en.md) | CPU, RAM, PCIe, NUMA, BMC | CONNECTIVITY |
 | 🎮 GPU | [GPU.en.md](GPU.en.md) | NVIDIA/AMD, NVLink, MIG/vGPU, AI, Cyborg | — |
 | ⚙️ Server config | [SERVER-CONFIG.en.md](SERVER-CONFIG.en.md) | BIOS tuning, DB/hypervisor/K8s/storage best practices | — |
 | 📦 Provisioning | [PROVISIONING.en.md](PROVISIONING.en.md) | PXE, Redfish, Terraform, Ironic, OpenStack deploy | CICD |
+| ☸ Kubernetes | [KUBERNETES.en.md](KUBERNETES.en.md) | K8s architecture, deployment, Cluster API (CAPI) | CICD, CLOUD, NETWORKING |
+| 📨 Messaging & streaming | [MESSAGING.en.md](MESSAGING.en.md) | Kafka, RabbitMQ, Pulsar, NATS, managed queue/pubsub | DATACENTERS, CLOUD |
+| 🏗️ DC Migration | [DC-MIGRATION.en.md](DC-MIGRATION.en.md) | Strategies, phases, network, DB, rollback | DATACENTERS, CLOUD, DR, NETWORKING, STORAGE |
+| 🧠 AI Infrastructure | [AI-INFRASTRUCTURE.en.md](AI-INFRASTRUCTURE.en.md) | GPU, AI networking, storage, cluster, cooling, training/inference | GPU, NETWORKING, STORAGE, DATACENTERS, CLOUD |
 | 📋 Legacy index | [HARDWARE.en.md](HARDWARE.en.md) | → SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING | SERVER-HW, GPU, SERVER-CONFIG, PROVISIONING |
 | 📋 Legacy infra | [INFRASTRUCTURE.en.md](INFRASTRUCTURE.en.md) | → HYPERVISORS, DATACENTERS, STORAGE, HARDWARE | HYPERVISORS, DATACENTERS, STORAGE, HARDWARE |
 | 📋 Review workflow | [REVIEW.en.md](REVIEW.en.md) | Review and content control process | — |
@@ -136,6 +150,11 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | `DATACENTERS.md` / `DATACENTERS.en.md` | [`MONITORING.md`](MONITORING.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `MONITORING.md` / `MONITORING.en.md` | [`sources/monitoring/sources.md`](sources/monitoring/sources.md) |
 | `CICD.md` / `CICD.en.md` | [`sources/cicd/sources.md`](sources/cicd/sources.md) |
+| `DR.md` / `DR.en.md` | [`CLOUD.md`](CLOUD.md), [`DATACENTERS.md`](DATACENTERS.md), [`MONITORING.md`](MONITORING.md), [`CICD.md`](CICD.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
+| `MESSAGING.md` / `MESSAGING.en.md` | [`DATACENTERS.md`](DATACENTERS.md), [`CLOUD.md`](CLOUD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
+| `DC-MIGRATION.md` / `DC-MIGRATION.en.md` | [`DATACENTERS.md`](DATACENTERS.md), [`CLOUD.md`](CLOUD.md), [`DR.md`](DR.md), [`NETWORKING.md`](NETWORKING.md), [`STORAGE.md`](STORAGE.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
+| `OS.md` / `OS.en.md` | [`AI-INFRASTRUCTURE.md`](AI-INFRASTRUCTURE.md), [`KUBERNETES.md`](KUBERNETES.md), [`HYPERVISORS.md`](HYPERVISORS.md), [`DATACENTERS.md`](DATACENTERS.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
+| `AI-INFRASTRUCTURE.md` / `AI-INFRASTRUCTURE.en.md` | [`GPU.md`](GPU.md), [`NETWORKING.md`](NETWORKING.md), [`STORAGE.md`](STORAGE.md), [`DATACENTERS.md`](DATACENTERS.md), [`CLOUD.md`](CLOUD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `PROVISIONING.md` / `PROVISIONING.en.md` | [`CICD.md`](CICD.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `STORAGE.md` / `STORAGE.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
 | `GPU.md` / `GPU.en.md` | [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
@@ -146,6 +165,8 @@ Bilingual: Czech (`.md`) and English (`.en.md`).
 | `DATABASES.md` / `DATABASES.en.md` | [`POSTGRESQL.md`](POSTGRESQL.md), [`MYSQL.md`](MYSQL.md), [`ORACLE.md`](ORACLE.md), [`MONGODB.md`](MONGODB.md), [`REDIS.md`](REDIS.md), [`CASSANDRA.md`](CASSANDRA.md), [`VEKTOROVE-DB.md`](VEKTOROVE-DB.md), [`DATABAZOVE-ENGINY.md`](DATABAZOVE-ENGINY.md), [`sources/databases/sources.md`](sources/databases/sources.md) |
 | `HARDWARE.md` / `HARDWARE.en.md` | [`SERVER-HW.md`](SERVER-HW.md), [`GPU.md`](GPU.md), [`SERVER-CONFIG.md`](SERVER-CONFIG.md), [`PROVISIONING.md`](PROVISIONING.md) |
 | `INFRASTRUCTURE.md` / `INFRASTRUCTURE.en.md` | [`HYPERVISORS.md`](HYPERVISORS.md), [`DATACENTERS.md`](DATACENTERS.md), [`STORAGE.md`](STORAGE.md), [`HARDWARE.md`](HARDWARE.md) |
+| `KUBERNETES.md` / `KUBERNETES.en.md` | [`CICD.md`](CICD.md), [`CLOUD.md`](CLOUD.md), [`NETWORKING.md`](NETWORKING.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |
+| `BIG-DATA.md` / `BIG-DATA.en.md` | [`DATABASES.md`](DATABASES.md), [`CLOUD.md`](CLOUD.md), [`MESSAGING.md`](MESSAGING.md), [`KUBERNETES.md`](KUBERNETES.md), [`sources/infrastructure/sources.md`](sources/infrastructure/sources.md) |

 ---

@@ -187,4 +208,4 @@ Raw referenční data (dokumentace, knihy, standardy) podle oblastí:

 ---

-*Rozcestník je automaticky udržován agentem `kb-index`. Poslední aktualizace: 2026-06-11.*
+*Rozcestník je automaticky udržován agentem `kb-index`. Poslední aktualizace: 2026-06-18.*
--- a/REDIS.en.md
+++ b/REDIS.en.md
@@ -114,6 +114,6 @@ Redis underwent a major license change in 2024:

 ## Sources

-References, books, and standards: [sources/databases/sources.md](sources/databases/sources.md)
+References, books, and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md)

 *Last revision: 2026-06-03*
--- a/SERVER-CONFIG.en.md
+++ b/SERVER-CONFIG.en.md
@@ -752,6 +752,6 @@ flowchart TD

 ## Sources

-Links, books and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)

 *Last revision: 2026-06-03*
--- a/SERVER-HW.en.md
+++ b/SERVER-HW.en.md
@@ -230,6 +230,10 @@ Conclusion: 8 DIMMs per CPU (1DPC) = highest performance
 | AI training (CPU preprocessing) | 2-4 GB/core | 128-512 GB | 8× 32-64 GB RDIMM, 1DPC |
 | HPC | 1-2 GB/core | 64-128 GB | 8× 16 GB RDIMM, 1DPC, high-speed |
 | In-memory DB (SAP HANA) | 8-32 GB/core | 1-6 TB+ | 16× 128-256 GB LRDIMM/3DS |
+| Big Data — Spark worker | 4-8 GB/core | 128-512 GB | 8-16× 32-64 GB RDIMM, 1DPC, NVMe scratch |
+| Big Data — Flink worker | 8-16 GB/core (incl. managed state) | 128-512 GB | 8-16× 32-64 GB RDIMM, 1DPC, RocksDB on NVMe |
+| Big Data — Trino worker | 4-8 GB/core | 64-256 GB | 8× 16-32 GB RDIMM, 1DPC |
+| Big Data — HDFS DataNode | 1-2 GB/core (metadata cache) | 64-256 GB | 8× 16-32 GB RDIMM, 1DPC, max storage density |

 ## PCIe

@@ -324,7 +328,7 @@ Socket 0 (NUMA node 0)              Socket 1 (NUMA node 1)

 ## Server connectivity

-Detailed chapter on network and storage connectivity: [CONNECTIVITY.md](CONNECTIVITY.md)
+Detailed chapter on network and storage connectivity: [CONNECTIVITY.en.md](CONNECTIVITY.en.md)

 ## Storage controllers

@@ -346,8 +350,51 @@ Detailed chapter on network and storage connectivity: [CONNECTIVITY.md](CONNECTI
 | **Use case** | SDS (Ceph, MinIO), ZFS | VMware VMFS, Windows, legacy |
 | **Battery/Backup** | Not needed | Write-back cache requires BBU |

+## Pricing (2026)
+
+### CPU pricing (2026)
+| CPU | Cores | TDP | 1ku price | $/core |
+|-----|-------|-----|----------|--------|
+| AMD EPYC 9965 (Turin) | 192 | 500 W | ~$11,988 | $62 |
+| AMD EPYC 9655 (Turin) | 96 | 400 W | ~$6,500 | $68 |
+| AMD EPYC 9475F (Turin) | 48 | 360 W | ~$5,000 | $104 |
+| Intel Xeon 6980P (Granite Rapids) | 128 | 500 W | ~$12,460 | $97 |
+| Intel Xeon 6980P (Granite Rapids-AP) | 128 | 500 W | $13,955 | $109 |
+| Intel Xeon 6767P (Granite Rapids) | 64 | 350 W | ~$7,000 | $109 |
+
+Sources: AMD 1ku pricing, Intel RCP, Newegg verified.
+
+### DDR5 RDIMM pricing (2026 — AI-driven price surge)
+| Capacity | Speed | Price 2025 | Price Q2 2026 | Change |
+|----------|---------|-----------|-------------|-------|
+| 32 GB (2R×8) | DDR5-5600 | ~$95 | ~$400–550 | +400–500 % |
+| 64 GB (2R×4) | DDR5-4800 | ~$180 | ~$700–900 | +400 % |
+| 96 GB (2R×4) | DDR5-6400 | ~$300 | ~$1,200–1,600 | +400 % |
+| 128 GB (2R×4) | DDR5-6400 | ~$450 | ~$1,800–2,500 | +450 % |
+| 256 GB (LRDIMM) | DDR5-6400 | ~$900 | ~$4,000–5,000 | +450 % |
+
+Trend: DDR5 prices have risen ~400–500 % since mid-2025 due to AI-driven demand. Further increases expected in H2 2026. Source: Counterpoint, TrendForce.
+
+### NVMe SSD pricing (enterprise, 2026)
+| Capacity | Type | Price 2024 | Price Q2 2026 | Change |
+|----------|-----|-----------|-------------|-------|
+| 1.92 TB | NVMe U.3 (read-intensive) | ~$200 | ~$500–600 | +150 % |
+| 3.84 TB | NVMe U.3 (mixed-use) | ~$400 | ~$1,000–1,200 | +150 % |
+| 7.68 TB | NVMe U.3 (mixed-use) | ~$800 | ~$2,000–2,500 | +150 % |
+| 15.36 TB | NVMe U.3 (mixed-use) | ~$1,500 | ~$4,000–5,000 | +170 % |
+
+Trend: NAND flash prices have risen ~100–200 % since 2025, average enterprise SSD now costs 2–3× more. Source: TrendForce, Xinnor.
+
+### Total server cost (example configurations)
+| Configuration | CPU | RAM | Storage | Estimated Price |
+|-------------|-----|-----|------|-----------|
+| DB server (OLTP) | 2× EPYC 9655 (96C) | 1 TB DDR5 | 6× 1.92 TB NVMe | ~$45,000–60,000 |
+| GPU server (AI) | 2× Xeon 6980P | 2 TB DDR5 | 4× 3.84 TB NVMe | ~$80,000–120,000 (w/o GPU) |
+| Hypervisor host | 2× EPYC 9475F (48C) | 512 GB DDR5 | 2× 1.92 TB NVMe + 4× 16 TB HDD | ~$25,000–35,000 |
+| Storage server (Ceph) | 1× EPYC 9655 (96C) | 256 GB DDR5 | 24× 15.36 TB NVMe | ~$60,000–80,000 |
+
 ## Sources

-Links, books and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)

 *Last revision: 2026-06-03*
--- a/SERVER-HW.md
+++ b/SERVER-HW.md
@@ -230,6 +230,10 @@ Závěr: 8 DIMMů na CPU (1DPC) = nejvyšší výkon
 | AI training (CPU preprocessing) | 2-4 GB/core | 128-512 GB | 8× 32-64 GB RDIMM, 1DPC |
 | HPC | 1-2 GB/core | 64-128 GB | 8× 16 GB RDIMM, 1DPC, high-speed |
 | In-memory DB (SAP HANA) | 8-32 GB/core | 1-6 TB+ | 16× 128-256 GB LRDIMM/3DS |
+| Big Data — Spark worker | 4-8 GB/core | 128-512 GB | 8-16× 32-64 GB RDIMM, 1DPC, NVMe scratch |
+| Big Data — Flink worker | 8-16 GB/core (vč. managed state) | 128-512 GB | 8-16× 32-64 GB RDIMM, 1DPC, RocksDB na NVMe |
+| Big Data — Trino worker | 4-8 GB/core | 64-256 GB | 8× 16-32 GB RDIMM, 1DPC |
+| Big Data — HDFS DataNode | 1-2 GB/core (metadata cache) | 64-256 GB | 8× 16-32 GB RDIMM, 1DPC, max storage density |

 ## PCIe

@@ -346,6 +350,49 @@ Detailní kapitola o síťové a storage konektivitě: [CONNECTIVITY.md](CONNECT
 | **Use case** | SDS (Ceph, MinIO), ZFS | VMware VMFS, Windows, legacy |
 | **Battery/Backup** | Není potřeba | Write-back cache vyžaduje BBU |

+## Ceny (2026)
+
+### CPU ceny (2026)
+| CPU | Cores | TDP | 1ku cena | $/core |
+|-----|-------|-----|----------|--------|
+| AMD EPYC 9965 (Turin) | 192 | 500 W | ~$11 988 | $62 |
+| AMD EPYC 9655 (Turin) | 96 | 400 W | ~$6 500 | $68 |
+| AMD EPYC 9475F (Turin) | 48 | 360 W | ~$5 000 | $104 |
+| Intel Xeon 6980P (Granite Rapids) | 128 | 500 W | ~$12 460 | $97 |
+| Intel Xeon 6980P (Granite Rapids-AP) | 128 | 500 W | $13 955 | $109 |
+| Intel Xeon 6767P (Granite Rapids) | 64 | 350 W | ~$7 000 | $109 |
+
+Sources: AMD 1ku pricing, Intel RCP, Newegg verified.
+
+### DDR5 RDIMM ceny (2026 — AI-driven price surge)
+| Kapacita | Rychlost | Cena 2025 | Cena Q2 2026 | Změna |
+|----------|---------|-----------|-------------|-------|
+| 32 GB (2R×8) | DDR5-5600 | ~$95 | ~$400–550 | +400–500 % |
+| 64 GB (2R×4) | DDR5-4800 | ~$180 | ~$700–900 | +400 % |
+| 96 GB (2R×4) | DDR5-6400 | ~$300 | ~$1 200–1 600 | +400 % |
+| 128 GB (2R×4) | DDR5-6400 | ~$450 | ~$1 800–2 500 | +450 % |
+| 256 GB (LRDIMM) | DDR5-6400 | ~$900 | ~$4 000–5 000 | +450 % |
+
+Trend: DDR5 ceny vzrostly ~400–500 % od mid-2025 kvůli AI-driven poptávce. Očekává se další růst v H2 2026. Zdroj: Counterpoint, TrendForce.
+
+### NVMe SSD ceny (enterprise, 2026)
+| Kapacita | Typ | Cena 2024 | Cena Q2 2026 | Změna |
+|----------|-----|-----------|-------------|-------|
+| 1.92 TB | NVMe U.3 (read-intensive) | ~$200 | ~$500–600 | +150 % |
+| 3.84 TB | NVMe U.3 (mixed-use) | ~$400 | ~$1 000–1 200 | +150 % |
+| 7.68 TB | NVMe U.3 (mixed-use) | ~$800 | ~$2 000–2 500 | +150 % |
+| 15.36 TB | NVMe U.3 (mixed-use) | ~$1 500 | ~$4 000–5 000 | +170 % |
+
+Trend: NAND flash ceny vzrostly ~100–200 % od 2025, průměrný enterprise SSD stojí 2–3× více. Zdroj: TrendForce, Xinnor.
+
+### Celková cena serveru (příkladové konfigurace)
+| Konfigurace | CPU | RAM | Disk | Odhad ceny |
+|-------------|-----|-----|------|-----------|
+| DB server (OLTP) | 2× EPYC 9655 (96C) | 1 TB DDR5 | 6× 1.92 TB NVMe | ~$45 000–60 000 |
+| GPU server (AI) | 2× Xeon 6980P | 2 TB DDR5 | 4× 3.84 TB NVMe | ~$80 000–120 000 (bez GPU) |
+| Hypervisor host | 2× EPYC 9475F (48C) | 512 GB DDR5 | 2× 1.92 TB NVMe + 4× 16 TB HDD | ~$25 000–35 000 |
+| Storage server (Ceph) | 1× EPYC 9655 (96C) | 256 GB DDR5 | 24× 15.36 TB NVMe | ~$60 000–80 000 |
+
 ## Zdroje

 Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
--- a/STORAGE.en.md
+++ b/STORAGE.en.md
@@ -270,9 +270,60 @@ OpenStack offers three main storage services:

 Ceph is the most common storage backend for OpenStack: Cinder (RBD), Swift (RGW), Manila (CephFS), Glance (RBD images).

+## Big Data storage
+
+### HDFS cluster
+
+HDFS is the primary storage for the Hadoop ecosystem (on-prem). Typical configuration:
+
+| Parameter | Value | Note |
+|-----------|-------|------|
+| **Disk per DataNode** | 8–24 × HDD (14–22 TB) + 2× NVMe (metadata, cache) | Balance capacity / performance |
+| **Replication factor** | 3× | Rack-aware |
+| **Network** | 2× 25/100 GbE (data) + 1× 1 GbE (management) | Data + replication traffic |
+| **RAM** | 64–256 GB (OS cache + metadata) | HDFS cache + OS buffer cache |
+| **CPU** | 16–32 cores | HDFS overhead is low |
+| **NameNode HA** | Active + Standby + JN (JournalNode) | Quorum-based HA |
+| **Use case** | Sequential read/write, large files, Spark YARN |
+
+**Model cluster — 1 PB usable:**
+
+- 10× DataNode (12× 18 TB HDD, 2× 1.9 TB NVMe)
+- 2× NameNode (HA, 256 GB RAM)
+- 3× JournalNode (small VMs)
+- Replication 3× → raw ~ 2.2 PB
+- Network: 25 GbE for data, 100 GbE for shuffle-heavy Spark
+
+### Object storage as Data Lake (S3/GCS/MinIO)
+
+For new projects (Spark on K8s, Iceberg/Delta, lakehouse), object storage is preferred over HDFS:
+
+| Platform | Advantages | Limits |
+|----------|-----------|--------|
+| **MinIO** (on-prem) | S3 API, erasure coding, NVMe direct, high throughput | Single tenant (per cluster) |
+| **Pure //C** (on-prem) | QLC NVMe, dedupe, S3 + NFS | Higher $/TB |
+| **AWS S3** (cloud) | Unlimited capacity, Iceberg/Delta support | Egress fees |
+| **Azure ADLS** (cloud) | Hierarchical namespace, HNS, POSIX-like ACLs | Vendor lock |
+| **GCP GCS** (cloud) | Uniform + fine-grained ACLs, object versioning | Region restrictions |
+
+### Comparison: HDFS vs Object Storage for Big Data
+
+| Criteria | HDFS | Object Storage (S3/MinIO) |
+|----------|------|-------------------------|
+| **Architecture** | Master/worker (NameNode SPOF) | Distributed, no SPOF (erasure coding) |
+| **Consistency** | Strong (single writer per file) | Eventual (S3) / Strong (MinIO) |
+| **Throughput** | High (rack-aware, locality) | High (network-bound) |
+| **Scaling** | Horizontal (DataNode) | Horizontal (stateless) |
+| **Cost** | Low (HDD) | Medium (S3 API) |
+| **Metadata** | NameNode (1M blocks ~ 1 GB RAM) | Object-level (flat namespace) |
+| **Spark integration** | Native (locality-optimized) | S3A connector, Hadoop Compatible |
+| **2026 trend** | Legacy, declining | Standard for new projects |
+
+For more information about Big Data see [BIG-DATA.en.md](BIG-DATA.en.md).
+
 ## Sources

-Links, books and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)

 ### Recommended reading

--- a/STORAGE.md
+++ b/STORAGE.md
@@ -270,6 +270,57 @@ OpenStack nabízí tři hlavní storage služby:

 Ceph je nejčastější storage backend pro OpenStack: Cinder (RBD), Swift (RGW), Manila (CephFS), Glance (RBD images).

+## Big Data storage
+
+### HDFS cluster
+
+HDFS je primární storage pro Hadoop ekosystém (on-prem). Typická konfigurace:
+
+| Parametr | Hodnota | Poznámka |
+|----------|---------|----------|
+| **Disk per DataNode** | 8–24 × HDD (14–22 TB) + 2× NVMe (metadata, cache) | Balance capacity / performance |
+| **Replication factor** | 3× | Rack-aware |
+| **Network** | 2× 25/100 GbE (data) + 1× 1 GbE (management) | Data + replication traffic |
+| **RAM** | 64–256 GB (OS cache + metadata) | HDFS cache + OS buffer cache |
+| **CPU** | 16–32 cores | HDFS overhead je nízký |
+| **NameNode HA** | Active + Standby + JN (JournalNode) | Quorum-based HA |
+| **Use case** | Secvenční čtení/zápis, velké soubory, Spark YARN |
+
+**Modelový cluster — 1 PB usable:**
+
+- 10× DataNode (12× 18 TB HDD, 2× 1.9 TB NVMe)
+- 2× NameNode (HA, 256 GB RAM)
+- 3× JournalNode (malé VM)
+- Replication 3× → raw ~ 2.2 PB
+- Network: 25 GbE pro data, 100 GbE pro shuffle-heavy Spark
+
+### Object storage jako Data Lake (S3/GCS/MinIO)
+
+Pro nové projekty (Spark on K8s, Iceberg/Delta, lakehouse) se preferuje object storage před HDFS:
+
+| Platforma | Výhody | Limity |
+|-----------|--------|--------|
+| **MinIO** (on-prem) | S3 API, erasure coding, NVMe direct, high throughput | Single tenant (per cluster) |
+| **Pure //C** (on-prem) | QLC NVMe, dedupe, S3 + NFS | Vyšší cena/TB |
+| **AWS S3** (cloud) | Neomezená kapacita, Iceberg/Delta support | Egress fees |
+| **Azure ADLS** (cloud) | Hierarchical namespace, HNS, POSIX-like ACLs | Vendor lock |
+| **GCP GCS** (cloud) | Uniform + fine-grained ACLs, object versioning | Region restrictions |
+
+### Srovnání: HDFS vs Object Storage pro Big Data
+
+| Kritérium | HDFS | Object Storage (S3/MinIO) |
+|-----------|------|-------------------------|
+| **Architektura** | Master/worker (NameNode SPOF) | Distributed, no SPOF (erasure coding) |
+| **Konzistence** | Strong (jediný writer per file) | Eventual (S3) / Strong (MinIO) |
+| **Propustnost** | Vysoká (rack-aware, locality) | Vysoká (network-bound) |
+| **Škálování** | Horizontální (DataNode) | Horizontální (stateless) |
+| **Cena** | Nízká (HDD) | Střední (S3 API) |
+| **Metadata** | NameNode (1 mil. bloků ~ 1 GB RAM) | Object-level (flat namespace) |
+| **Spark integration** | Native (locality optimalizace) | S3A connector, Hadoop Compatible |
+| **2026 trend** | Legacy, klesající | Standard pro nové projekty |
+
+Podrobnější informace o Big Data viz [BIG-DATA.md](BIG-DATA.md).
+
 ## Zdroje

 Odkazy, knihy a standardy: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
--- a/VECTOR-DBS.en.md
+++ b/VECTOR-DBS.en.md
@@ -94,7 +94,7 @@ Variants:

 ## Sources

-References, books, and standards: [sources/databases/sources.md](sources/databases/sources.md)
+References, books, and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md)

 ### Recommended reading

--- a/sources/infrastructure/sources.en.md
+++ b/sources/infrastructure/sources.en.md
@@ -1,10 +1,10 @@
 # Infrastructure — Sources

 Split into separate files:
- [HYPERVISORS.md](../../HYPERVISORS.md) — hypervisors and virtualization
- [DATACENTERS.md](../../DATACENTERS.md) — data centers
- [STORAGE.md](../../STORAGE.md) — storage
- [HARDWARE.md](../../HARDWARE.md) — hardware and servers
+- [HYPERVISORS.en.md](../../HYPERVISORS.en.md) — hypervisors and virtualization
+- [DATACENTERS.en.md](../../DATACENTERS.en.md) — data centers
+- [STORAGE.en.md](../../STORAGE.en.md) — storage
+- [HARDWARE.en.md](../../HARDWARE.en.md) — hardware and servers

 ## Official documentation

@@ -112,7 +112,65 @@ Split into separate files:
 | Complete guide to modern vSphere alternatives — Spectro Cloud | https://www.spectrocloud.com/blog/vsphere-alternatives | `[done]` |
 | Broadcom VMware Acquisition: What's Next — Sayers | https://www.sayers.com/blog/after-the-deal-whats-next-for-vmware-customers | `[done]` |
 | Stanford University migration from VMware to Proxmox | https://itcommunity.stanford.edu/news/enterprise-technology-completes-successful-virtual-infrastructure-migration-vmware-proxmox | `[done]` |
-
+| | **Sangfor** | |
+| Sangfor HCI — product page | https://www.sangfor.com/cloud-and-infrastructure/products/hci-hyper-converged-infrastructure | `[done]` |
+| Sangfor aSV — hypervisor | https://www.sangfor.com/cloud-and-infrastructure/products/asv-hypervisor-server-virtualization | `[done]` |
+| Sangfor vs VMware — feature comparison | https://www.sangfor.com/blog/cloud-and-infrastructure/sangfor-hci-vs-vmware-feature-comparison | `[done]` |
+| | **AI infrastructure** | |
+| NVIDIA DGX — documentation | https://www.nvidia.com/en-us/data-center/dgx-platform/ | `[done]` |
+| InfiniBand — Mellanox/NVIDIA | https://www.nvidia.com/en-us/networking/products/infiniband/ | `[done]` |
+| Lustre parallel filesystem | https://www.lustre.org/ | `[done]` |
+| WekaFS — AI storage | https://www.weka.io/ | `[done]` |
+| vLLM — inference server | https://github.com/vllm-project/vllm | `[done]` |
+| Megatron-LM — distributed training | https://github.com/NVIDIA/Megatron-LM | `[done]` |
+| | **Kubernetes / Cluster API** | |
+| Cluster API (CAPI) — official documentation (The CAPI Book) | https://cluster-api.sigs.k8s.io/ | `[done]` |
+| Cluster API — GitHub (kubernetes-sigs/cluster-api) | https://github.com/kubernetes-sigs/cluster-api | `[done]` |
+| Cluster API — provider list | https://cluster-api.sigs.k8s.io/reference/providers.html | `[done]` |
+| Kubernetes — official documentation | https://kubernetes.io/docs/ | `[done]` |
+| K3s — lightweight Kubernetes | https://k3s.io/ | `[done]` |
+| RKE2 — Rancher Kubernetes Engine 2 | https://docs.rke2.io/ | `[done]` |
+| Talos — API-driven Kubernetes OS | https://www.talos.dev/ | `[done]` |
+| Kamaji — hosted control plane provider | https://kamaji.clastix.io/ | `[done]` |
+| Metal3 — bare metal provider for CAPI | https://metal3.io/ | `[done]` |
+| Cluster API — ClusterClass and topologies | https://kubernetes.io/blog/2021/10/08/capi-clusterclass-and-managed-topologies/ | `[done]` |
+| | **Big Data** | |
+| Apache Spark — official documentation | https://spark.apache.org/docs/latest/ | `[done]` |
+| Apache Flink — official documentation | https://flink.apache.org/ | `[done]` |
+| Trino — distributed SQL engine | https://trino.io/docs/current/ | `[done]` |
+| Apache Iceberg — table format | https://iceberg.apache.org/ | `[done]` |
+| Delta Lake — documentation | https://docs.delta.io/ | `[done]` |
+| Apache Hudi | https://hudi.apache.org/ | `[done]` |
+| Apache Paimon | https://paimon.apache.org/ | `[done]` |
+| Apache Hadoop — documentation | https://hadoop.apache.org/docs/stable/ | `[done]` |
+| Apache Airflow — documentation | https://airflow.apache.org/docs/ | `[done]` |
+| Dagster — documentation | https://docs.dagster.io/ | `[done]` |
+| Prefect — documentation | https://docs.prefect.io/ | `[done]` |
+| HDFS architecture (Apache) | https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html | `[done]` |
+| | **Operating Systems** | |
+| Ubuntu lifecycle — Ubuntu Pro + ESM | https://ubuntu.com/about/release-cycle | `[done]` |
+| RHEL lifecycle — Red Hat Enterprise Linux | https://access.redhat.com/support/policy/updates/errata | `[done]` |
+| Rocky Linux lifecycle | https://rockylinux.org/download/ | `[done]` |
+| AlmaLinux lifecycle | https://almalinux.org/ | `[done]` |
+| Debian releases / LTS | https://wiki.debian.org/LTS | `[done]` |
+| SLES lifecycle — SUSE | https://www.suse.com/lifecycle/ | `[done]` |
+| Alpine Linux releases | https://alpinelinux.org/releases/ | `[done]` |
+| Fedora lifecycle | https://docs.fedoraproject.org/en-US/releases/lifecycle/ | `[done]` |
+| SELinux — Red Hat docs | https://www.redhat.com/en/topics/linux/what-is-selinux | `[done]` |
+| AppArmor — Ubuntu wiki | https://wiki.ubuntu.com/AppArmor | `[done]` |
+| | **Windows** | |
+| Windows Server lifecycle | https://learn.microsoft.com/en-us/lifecycle/products/windows-server-2022/ | `[done]` |
+| Windows Server 2025 lifecycle | https://learn.microsoft.com/en-us/lifecycle/products/windows-server-2025/ | `[done]` |
+| Windows 11 lifecycle | https://learn.microsoft.com/en-us/lifecycle/products/windows-11-enterprise/ | `[done]` |
+| Windows 10 EOL | https://learn.microsoft.com/en-us/lifecycle/products/windows-10-enterprise/ | `[done]` |
+| Windows Server licensing (per core) | https://learn.microsoft.com/en-us/windows-server/get-started/editions-and-support | `[done]` |
+| | **GPU pricing** | |
+| NVIDIA AI GPU pricing guide (2026) | https://intuitionlabs.ai/articles/nvidia-ai-gpu-pricing-guide | `[done]` |
+| GPU cloud pricing comparison (2026) | https://www.spheron.network/blog/gpu-cloud-pricing-comparison-2026/ | `[done]` |
+| GPU pricing trends 2026 — CompuX | https://compux.net/docs/guides/gpu-pricing-trends-2026 | `[done]` |
+| AMD MI300X pricing (2026) | https://www.thundercompute.com/blog/amd-mi300x-pricing | `[done]` |
+| GPU price/performance frontier — Silicon Analysts | https://siliconanalysts.com/tools/frontier | `[done]` |
+ 
 ## Hardware manufacturers

 | Manufacturer | Server series | Management |
--- a/sources/infrastructure/sources.md
+++ b/sources/infrastructure/sources.md
@@ -111,8 +111,81 @@ Rozděleno do samostatných souborů:
 | VMware Migration in 2026: Proxmox, KVM, XCP-ng & Veeam — StarWind | https://starwindsoftware.com/blog/vmware-migration-to-proxmox-kvm-xcp-ng-2026 | `[done]` |
 | Complete guide to modern vSphere alternatives — Spectro Cloud | https://www.spectrocloud.com/blog/vsphere-alternatives | `[done]` |
 | Broadcom VMware Acquisition: What's Next — Sayers | https://www.sayers.com/blog/after-the-deal-whats-next-for-vmware-customers | `[done]` |
-| Stanford University migration from VMware to Proxmox | https://itcommunity.stanford.edu/news/enterprise-technology-completes-successful-virtual-infrastructure-migration-vmware-proxmox | `[done]` |
-
+ | Stanford University migration from VMware to Proxmox | https://itcommunity.stanford.edu/news/enterprise-technology-completes-successful-virtual-infrastructure-migration-vmware-proxmox | `[done]` |
+| | **Messaging / streaming** | |
+| Apache Kafka docs | https://kafka.apache.org/documentation/ | `[done]` |
+| RabbitMQ docs | https://www.rabbitmq.com/documentation.html | `[done]` |
+| Apache Pulsar docs | https://pulsar.apache.org/docs/ | `[done]` |
+| NATS docs | https://docs.nats.io/ | `[done]` |
+| Designing Event-Driven Systems (Confluent) | https://www.confluent.io/designing-event-driven-systems/ | `[done]` |
+| Kafka: The Definitive Guide (2nd ed.) — Confluent | https://www.confluent.io/resources/kafka-the-definitive-guide/ | `[done]` |
+| Enterprise Integration Patterns — Hohpe & Woolf | https://www.enterpriseintegrationpatterns.com/ | `[done]` |
+| | **DC migrace** | |
+| AWS Cloud Migration — 6 Strategies for Migrating to the Cloud | https://aws.amazon.com/blogs/enterprise-strategy/6-strategies-for-migrating-applications-to-the-cloud/ | `[done]` |
+| Azure Cloud Migration — Microsoft Cloud Adoption Framework | https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ | `[done]` |
+| Gartner 5 Rs of Cloud Migration | https://www.gartner.com/en/documents/3984835 | `[done]` |
+| VMware Site Recovery Manager — documentation | https://docs.vmware.com/en/Site-Recovery-Manager/ | `[done]` |
+| Zerto — Disaster Recovery & Migration | https://www.zerto.com/resources/ | `[done]` |
+| The Phoenix Project — IT Ops & Migration patterns | https://itrevolution.com/product/the-phoenix-project/ | `[done]` |
+| | **Sangfor** | |
+| Sangfor HCI — product page | https://www.sangfor.com/cloud-and-infrastructure/products/hci-hyper-converged-infrastructure | `[done]` |
+| Sangfor aSV — hypervisor | https://www.sangfor.com/cloud-and-infrastructure/products/asv-hypervisor-server-virtualization | `[done]` |
+| Sangfor vs VMware — feature comparison | https://www.sangfor.com/blog/cloud-and-infrastructure/sangfor-hci-vs-vmware-feature-comparison | `[done]` |
+| | **AI infrastruktura** | |
+| NVIDIA DGX — documentation | https://www.nvidia.com/en-us/data-center/dgx-platform/ | `[done]` |
+| InfiniBand — Mellanox/NVIDIA | https://www.nvidia.com/en-us/networking/products/infiniband/ | `[done]` |
+| Lustre parallel filesystem | https://www.lustre.org/ | `[done]` |
+| WekaFS — AI storage | https://www.weka.io/ | `[done]` |
+| vLLM — inference server | https://github.com/vllm-project/vllm | `[done]` |
+| Megatron-LM — distributed training | https://github.com/NVIDIA/Megatron-LM | `[done]`
+| | **Kubernetes / Cluster API** | |
+| Cluster API (CAPI) — oficiální dokumentace (The CAPI Book) | https://cluster-api.sigs.k8s.io/ | `[done]` |
+| Cluster API — GitHub (kubernetes-sigs/cluster-api) | https://github.com/kubernetes-sigs/cluster-api | `[done]` |
+| Cluster API — seznam providerů | https://cluster-api.sigs.k8s.io/reference/providers.html | `[done]` |
+| Kubernetes — oficiální dokumentace | https://kubernetes.io/docs/ | `[done]` |
+| K3s — lightweigh Kubernetes | https://k3s.io/ | `[done]` |
+| RKE2 — Rancher Kubernetes Engine 2 | https://docs.rke2.io/ | `[done]` |
+| Talos — API-driven Kubernetes OS | https://www.talos.dev/ | `[done]` |
+| Kamaji — hosted control plane provider | https://kamaji.clastix.io/ | `[done]` |
+| Metal3 — bare metal provider pro CAPI | https://metal3.io/ | `[done]` |
+| Cluster API — ClusterClass a topologies | https://kubernetes.io/blog/2021/10/08/capi-clusterclass-and-managed-topologies/ | `[done]` |
+| | **Big Data** | |
+| Apache Spark — oficiální dokumentace | https://spark.apache.org/docs/latest/ | `[done]` |
+| Apache Flink — oficiální dokumentace | https://flink.apache.org/ | `[done]` |
+| Trino — distribuovaný SQL engine | https://trino.io/docs/current/ | `[done]` |
+| Apache Iceberg — tabulkový formát | https://iceberg.apache.org/ | `[done]` |
+| Delta Lake — dokumentace | https://docs.delta.io/ | `[done]` |
+| Apache Hudi | https://hudi.apache.org/ | `[done]` |
+| Apache Paimon | https://paimon.apache.org/ | `[done]` |
+| Apache Hadoop — dokumentace | https://hadoop.apache.org/docs/stable/ | `[done]` |
+| Apache Airflow — dokumentace | https://airflow.apache.org/docs/ | `[done]` |
+| Dagster — dokumentace | https://docs.dagster.io/ | `[done]` |
+| Prefect — dokumentace | https://docs.prefect.io/ | `[done]` |
+| HDFS architektura (Apache) | https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html | `[done]` |
+| | **Operační systémy** | |
+| Ubuntu lifecycle — Ubuntu Pro + ESM | https://ubuntu.com/about/release-cycle | `[done]` |
+| RHEL lifecycle — Red Hat Enterprise Linux | https://access.redhat.com/support/policy/updates/errata | `[done]` |
+| Rocky Linux lifecycle | https://rockylinux.org/download/ | `[done]` |
+| AlmaLinux lifecycle | https://almalinux.org/ | `[done]` |
+| Debian releases / LTS | https://wiki.debian.org/LTS | `[done]` |
+| SLES lifecycle — SUSE | https://www.suse.com/lifecycle/ | `[done]` |
+| Alpine Linux releases | https://alpinelinux.org/releases/ | `[done]` |
+| Fedora lifecycle | https://docs.fedoraproject.org/en-US/releases/lifecycle/ | `[done]` |
+| SELinux — Red Hat docs | https://www.redhat.com/en/topics/linux/what-is-selinux | `[done]` |
+| AppArmor — Ubuntu wiki | https://wiki.ubuntu.com/AppArmor | `[done]` |
+| | **Windows** | |
+| Windows Server lifecycle | https://learn.microsoft.com/en-us/lifecycle/products/windows-server-2022/ | `[done]` |
+| Windows Server 2025 lifecycle | https://learn.microsoft.com/en-us/lifecycle/products/windows-server-2025/ | `[done]` |
+| Windows 11 lifecycle | https://learn.microsoft.com/en-us/lifecycle/products/windows-11-enterprise/ | `[done]` |
+| Windows 10 EOL | https://learn.microsoft.com/en-us/lifecycle/products/windows-10-enterprise/ | `[done]` |
+| Windows Server licensing (per core) | https://learn.microsoft.com/en-us/windows-server/get-started/editions-and-support | `[done]` |
+| | **GPU ceny** | |
+| NVIDIA AI GPU pricing guide (2026) | https://intuitionlabs.ai/articles/nvidia-ai-gpu-pricing-guide | `[done]` |
+| GPU cloud pricing comparison (2026) | https://www.spheron.network/blog/gpu-cloud-pricing-comparison-2026/ | `[done]` |
+| GPU pricing trends 2026 — CompuX | https://compux.net/docs/guides/gpu-pricing-trends-2026 | `[done]` |
+| AMD MI300X pricing (2026) | https://www.thundercompute.com/blog/amd-mi300x-pricing | `[done]` |
+| GPU price/performance frontier — Silicon Analysts | https://siliconanalysts.com/tools/frontier | `[done]` |
+ 
 ## Výrobci hardware

 | Výrobce | Serverové řady | Management |
Author	SHA1	Message	Date
Stanislav Hubacek	ef3c2f75b1	18.6.2026	2026-06-18 16:25:33 +02:00
Stanislav Hubacek	b53714113c	new files	2026-06-16 15:47:45 +02:00