Files
knowledge-base/AI-INFRASTRUCTURE.en.md
Stanislav Hubacek ef3c2f75b1 18.6.2026
2026-06-18 16:25:33 +02:00

600 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🧠 AI/ML Infrastructure
## Component overview
```mermaid
flowchart TD
subgraph Compute
GPU["GPU (H100/B200/Instinct)"]
CPU["CPU (AMD EPYC / Intel Xeon)"]
ASIC["ASIC (TPU, Trainium, Inferentia)"]
end
subgraph Network
IB["InfiniBand NDR/XDR"]
ROCE["RoCEv2"]
NVL["NVLink / NVSwitch"]
end
subgraph Storage
FS["Parallel FS (Lustre, GPFS, Weka)"]
OBJ["Object Store (S3, MinIO)"]
NVME["Local NVMe cache"]
end
subgraph Orchestration
S["Slurm"]
K["Kubernetes + Volcano/Kueue"]
end
subgraph Cooling
DLC["Direct-to-chip liquid"]
IMM["Immersion"]
AIR["Air (high-density)"]
end
Compute --> Network --> Storage
Orchestration --> Compute
Cooling --> Compute
```
---
## GPU compute
### NVIDIA
| GPU | Architecture | FP8 | FP16/BF16 | FP64 | HBM | NVLink | TDP | Rack config |
|-----|-------------|-----|-----------|------|-----|--------|-----|------|
| **H100 SXM** | Hopper | 3,958 TFLOPS | 1,979 TFLOPS | 67 TFLOPS | 80 GB HBM3 | 900 GB/s | 700 W | 68× in DGX H100 |
| **H200 SXM** | Hopper (HBM3e) | 3,958 TFLOPS | 1,979 TFLOPS | 67 TFLOPS | 141 GB HBM3e | 900 GB/s | 700 W | 68× in DGX H200 |
| **B200** | Blackwell | ~9,000 TFLOPS | ~4,500 TFLOPS | ~40 TFLOPS | 192 GB HBM3e | 1,800 GB/s | 1,000 W | 68× in DGX B200 |
| **GB200 Grace Hopper** | Blackwell | ~18,000 TFLOPS | ~9,000 TFLOPS | — | 192 GB + 480 GB (Grace) | NVLink-C2C | 1,000 W (GPU) + 500 W (CPU) | DGX GB200 (36× GPU) |
| **L40S** | Ada Lovelace | 733 TFLOPS | 367 TFLOPS | — | 48 GB GDDR6 | N/A | 350 W | Inference, enterprise |
| **A100 SXM** | Ampere | 1,248 TFLOPS | 624 TFLOPS | 19.5 TFLOPS | 80 GB HBM2e | 600 GB/s | 400 W | DGX A100 |
### AMD
| GPU | Architecture | FP8 | FP16/BF16 | FP64 | HBM | Infinity Fabric | TDP |
|-----|-------------|-----|-----------|------|-----|----------------|-----|
| **MI300X** | CDNA 3 | 2,615 TFLOPS | 1,307 TFLOPS | 81 TFLOPS | 192 GB HBM3 | 896 GB/s | 750 W |
| **MI250** | CDNA 2 | — | 383 TFLOPS | 95.7 TFLOPS | 128 GB HBM2e | 400 GB/s | 500 W |
### Intel
| GPU | Architecture | FP16/BF16 | FP32 | HBM | TDP |
|-----|-------------|-----------|------|-----|-----|
| **Gaudi 3** | Custom | 1,835 TFLOPS | — | 144 GB HBM2e | 600 W |
| **Max 1550** | Xe HPC | 600+ TFLOPS | 200 TFLOPS | 128 GB HBM2e | 600 W |
### Cloud ASIC
| ASIC | Provider | Use case | Performance |
|------|----------|----------|-------|
| **TPU v5p** | Google | Training | ~4,600 TFLOPS (BF16) per pod |
| **Trainium 2** | AWS | Training | ~1,000 TFLOPS (BF16) per chip |
| **Inferentia 2** | AWS | Inference | ~400 TOPS (INT8) per chip |
| **Maia 100** | Microsoft | Training + inference | Custom, 800 W TDP |
---
## AI networking
### Technology comparison
| Technology | Bandwidth per link | Latency | Topology | Use case |
|-------------|-------------------|---------|-----------|----------|
| **InfiniBand NDR200** | 200 Gb/s | < 1 µs | Fat-tree, Dragonfly+ | Training (NVIDIA) |
| **InfiniBand NDR400** | 400 Gb/s | < 1 µs | Fat-tree, Dragonfly+ | Training (NVIDIA) |
| **InfiniBand XDR** | 800 Gb/s (planned) | < 1 µs | Dragonfly+ | Next-gen training |
| **RoCEv2** (CX-7/8) | 200400 Gb/s | 12 µs | Fat-tree, Spine-leaf | Training (AMD, Intel, open) |
| **NVLink 4.0** | 900 GB/s per GPU | < 0.5 µs | NVSwitch full-mesh | Intra-node GPU comm |
| **NVLink 5.0** | 1,800 GB/s per GPU | < 0.5 µs | NVSwitch full-mesh | Intra-node (Blackwell) |
| **Ethernet (400 GbE)** | 400 Gb/s | 25 µs | Spine-leaf | Inference, data pipeline |
### AI fabric principles
- **Rail-optimized topology** — each GPU communicates on dedicated "rails" (same GPU indices across nodes connect to the same switch)
- **Fat-tree (Clos)** — standard for InfiniBand and RoCE, non-blocking bisection bandwidth
- **Dragonfly+** — reduces hop count while maintaining bandwidth (used in largest clusters)
- **GPU Direct RDMA** — direct GPU ↔ GPU communication without CPU involvement, supports InfiniBand and RoCE
- **SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)** — in-network reduction for AllReduce (InfiniBand only)
### Bandwidth sizing
```text
Rule of thumb: InfiniBand bandwidth ≥ 50 % GPU HBM bandwidth for scalable training
Example: H100 has 3.35 TB/s HBM
→ Needs min. 1.6 TB/s bisection bandwidth per GPU
→ 8× H100 in DGX: 4× NDR400 IB per GPU = 4 × 50 GB/s = 200 GB/s
→ Reality: 8× 200 Gb/s (25 GB/s) per GPU in typical config = ~6 % HBM → bottleneck
```
---
## AI storage
### Requirements
| Dataset size | IO pattern | Recommended storage | Bandwidth |
|-------------|-----------|-------------------|-----------|
| < 10 TB | Sequential read (data loading) | Local NVMe | > 10 GB/s per node |
| 10100 TB | Random read (checkpointing) | Parallel FS (Lustre, Weka) | > 100 GB/s cluster-wide |
| 100 TB10 PB | Mixed (training + checkpoint) | Parallel FS + object store | > 500 GB/s |
| 10 PB+ | Multi-modal, video, LLM | Tiered (NVMe cache + parallel FS + object) | > 1 TB/s |
### Storage solution comparison
| Solution | Type | Bandwidth per node | Max capacity | Scaling | Use case |
|--------|-----|-------------------|-------------|-----------|----------|
| **Lustre** | Parallel FS (POSIX) | > 100 GB/s (cluster) | 100s PB | OST + MDS | HPC, LLM training (standard) |
| **GPFS / StorageScale** | Parallel FS (POSIX) | > 100 GB/s | 100s PB | NSD servers | HPC, AI (IBM) |
| **WekaFS** | Parallel FS (POSIX + NFS/SMB) | ~80 GB/s per 10 nodes | 10s PB | Container-native | AI/ML, NVIDIA DGX preferred |
| **VAST Data** | Universal storage (NVMe + QLC) | ~100 GB/s per cluster | 10s PB | Scale-out | AI, checkpoint, data lake |
| **Pure Storage//E** | All-flash (NVMe) | ~50 GB/s | ~30 PB | Scale-out | Enterprise AI, database |
| **MinIO / S3** | Object store | ~20 GB/s per gateway | EB | Erasure coding | Dataset repository, checkpoint |
| **NetApp AFF** | NAS + S3 | ~10 GB/s per controller | ~50 PB | HA pair | Enterprise, NFS baseline |
### Checkpointing strategies
| Strategy | RPO | Storage impact | Description |
|-----------|-----|---------------|-------|
| **Full checkpoint** | every N steps | High (stops training) | Full model + optimizer state |
| **Async checkpoint** | every N steps | Medium (non-blocking) | Copy to staging buffer, async write |
| **Distributed checkpoint** (NVIDIA NeMo) | every N steps | Low | Each rank writes its own shard |
| **In-memory checkpoint** (IBM) | on failover | Minimal (DRAM) | Replication to another node's DRAM |
| **Continuous checkpoint** (Microsoft) | every 15 min | Low (delta) | Changed shards only |
---
## AI cluster architecture
### Physical topology — DGX H100 example
```
┌──────── DGX H100 (8× GPU) ────────┐
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │GPU 0│ │GPU 1│ │GPU 2│ │GPU 3│ │
│ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │
│ │GPU 4│ │GPU 5│ │GPU 6│ │GPU 7│ │
│ └─────┘ └─────┘ └─────┘ └─────┘ │
│ NVSwitch (NVLink 4.0, 900 GB/s) │
│ InfiniBand CX-7: 8× NDR400 │
└────────────────────────────────────┘
│ 8× IB rails
┌────┴──────────────┐
│ IB NDR400 Switches │ (rail-optimized)
└────────────────────┘
```
### Kubernetes for AI
| Component | Role |
|-----------|------|
| **Volcano** | Batch scheduling, gang scheduling, queue management |
| **Kueue** | Multi-tenant admission, resource quotas, fair sharing |
| **NVIDIA GPU Operator** | Driver, container toolkit, MIG, DCGM, monitoring |
| **HAMi** (ex k8s-vGPU-scheduler) | GPU sharing, MIG partitioning, fractional GPU |
| **Node Feature Discovery** | GPU type detection, NUMA topology |
| **Topology Manager** | NUMA-aware pod placement |
| **DPDK / SR-IOV** | High-performance networking for GPU Direct RDMA |
### Slurm for AI
| Component | Role |
|-----------|------|
| **slurm.conf** | Partition for GPU nodes, GRES (Generic Resource) |
| **gres.conf** | GPU type, GPU count per node |
| **srun --gres=gpu:8** | Allocate 8 GPUs per job |
| **sbatch --nodes=64 --ntasks=512** | 64 nodes, 512 ranks (8 GPU/node) |
| **Pixis** | NVIDIA orchestration plugin for Slurm |
---
## AI cluster cooling
### Power density comparison
| Configuration | TDP per node | Racks | kW/rack | Note |
|-------------|-------------|-------|---------|----------|
| Standard server (2U) | 1 kW | 20 | 510 | Typical DC |
| GPU server (DGX H100, 6×) | 42 kW | 6 | 4550 | Air cooling limit |
| GPU server (DGX B200, 6×) | 72 kW | 6 | 90100 | Liquid cooling required |
| GPU server (GB200 NVL72) | 120 kW | — | ~120 | Liquid cooling mandatory |
| NVIDIA NVL72 rack | 120 kW | 1 | 120 | Fully liquid cooled |
### Cooling technologies
| Method | Max kW/rack | CAPEX | OPEX | Complexity |
|--------|-------------|-------|------|-----------|
| **Air cooling (CRAC/CRAH)** | < 15 | Low | Medium | Low |
| **Air cooling (in-row)** | 1530 | Medium | Medium | Low |
| **Rear-door heat exchanger** | 3050 | Medium | Low | Medium |
| **Direct-to-chip liquid (cold plate)** | 50150 | High | Low | High |
| **Immersion (single-phase)** | 100200 | High | Low | High |
| **Immersion (two-phase)** | 200+ | Very high | Low | Very high |
---
## Inference infrastructure
### Inference server comparison
| Tool | Frameworks | Optimization | Use case |
|---------|-----------|-------------|----------|
| **vLLM** | Megatron, HF, AWQ, GPTQ | PagedAttention, KV cache, continuous batching | LLM inference (open source) |
| **TensorRT-LLM** | TensorRT | INT4/INT8/FP8, inflight batching, attention optimizations | Production (NVIDIA) |
| **Triton Inference Server** | All (TensorRT, vLLM, PyTorch) | Model ensemble, model caching, concurrent execution | Enterprise, multi-model |
| **SageMaker** | Managed | Auto-scaling, model parallelism | AWS managed |
| **OpenAI API / TGI** | HF Transformers | Continuous batching, flash attention | Hosting |
### Inference optimization
| Technique | Latency improvement | Throughput improvement | Memory reduction |
|----------|-----------------|---------------------|------------------|
| **FP8/INT8 quantization** | — | 2× | 2× |
| **INT4 quantization** | — | 4× | 4× |
| **Flash Attention 2/3** | 24× | — | 50 % (KV cache) |
| **PagedAttention** | — | 25× | 95 % (KV cache fragmentation) |
| **Continuous batching** | — | 1020× | — |
| **Speculative decoding** | 23× | — | — |
| **Multi-LoRA / S-LoRA** | — | 816× | — |
---
## Distributed training techniques
| Technique | Description | Frameworks |
|----------|-------|------------|
| **Data Parallelism (DDP/FSDP)** | Each GPU has model copy, different batch | PyTorch DDP, FSDP |
| **Tensor Parallelism (TP)** | Model split across layers (intra-node) | Megatron-LM, DeepSpeed |
| **Pipeline Parallelism (PP)** | Layers split across nodes | Megatron-LM, DeepSpeed |
| **Sequence Parallelism (SP)** | Sequence split across GPUs | Megatron-LM |
| **Expert Parallelism (EP)** | Different expert subnets on different GPUs | Mixture-of-Experts (MoE) |
| **3D Parallelism** | TP + PP + DP combination | Megatron-LM, NeMo |
| **ZeRO (1/2/3)** | Optimizer/gradient/parameter sharding | DeepSpeed |
| **NCCL / RCCL** | GPU collective communication library | NVIDIA/AMD |
---
## Operating systems for AI
### Distribution comparison
| OS | GPU driver | CUDA | Container toolkit | IB/RoCE | Lustre client | Production support |
|----|-----------|------|-------------------|---------|--------------|-------------------|
| **Ubuntu 22.04 LTS** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED, rdma-core | Yes (lustre-client) | NVIDIA DGX standard |
| **Ubuntu 24.04 LTS** | NVIDIA 550+ | 12.5+ | nvidia-container-toolkit | MLNX_OFED, rdma-core | Yes | Latest GPU support |
| **RHEL 9 / Rocky 9** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED | Yes (EL repo) | Red Hat, enterprise |
| **DGX OS** (Ubuntu-based) | NVIDIA custom | 12.x | Pre-installed | Pre-configured | Yes | NVIDIA DGX only supported |
| **SLES 15 SP5** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED | Yes | HPC, some Lustre clusters |
| **Debian 12** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | rdma-core | Yes (backports) | Community, research |
| **Flatcar / Bottlerocket** | Container-host | — | nvidia-container-toolkit | Limited | No | K8s-only, minimal footprint |
### Limitations and constraints
#### GPU drivers and CUDA
| Constraint | Detail |
|----------|--------|
| **Driver-CUDA compatibility** | NVIDIA driver major version must match CUDA toolkit (driver ≥ CUDA req). E.g., CUDA 12.5 requires driver ≥ 550 |
| **Kernel version** | NVIDIA driver not compatible with all kernels. New kernel (6.8+) may require DKMS build or delayed support |
| **Secure Boot** | NVIDIA driver requires signed module (MOK, shim) or disabled Secure Boot — common enterprise issue |
| **Open vs Proprietary driver** | NVIDIA `nvidia-open` (since R515) — open source kernel module. GPU support: DC (H100+) → OK, older GPUs → proprietary required |
| **nvidia-persistenced** | Required to maintain GPU initialization; without it GPUs may sleep after idle timeout (`nvidia-smi -pm 1`) |
| **GPU reset** | After crashed training job, GPU may hang. `nvidia-smi --gpu-reset` or reboot node, sometimes power cycle |
| **Multi-instance GPU (MIG)** | Requires specific driver, MIG mode on GPU, GPU restart. Cannot be changed at runtime. A100, H100, B200 only |
#### Network (InfiniBand / RoCE)
| Constraint | Detail |
|----------|--------|
| **MLNX_OFED vs rdma-core** | MLNX_OFED (NVIDIA) — full support, but own kernel modules, kernel version compatibility needed. `rdma-core` (open) — limited support, no custom modules |
| **Kernel compatibility** | MLNX_OFED supports only specific kernel versions (major.minor). Kernel upgrade → MLNX_OFED rebuild required |
| **NCCL** | NCCL version must be compatible with CUDA and IB firmware. `nccl-tests` for validation |
| **SHARP** | In-network reduction requires specific MLNX_OFED + IB switch firmware combination |
| **GPU Direct RDMA** | Requires `nvidia-peermem` module + MLNX_OFED. Does not work with all GPU and IB card combinations |
| **RoCE PFC/ECN** | RoCE requires lossless fabric (PFC, ECN, DCQCN). Switch and host configuration — complex tuning |
#### Storage
| Constraint | Detail |
|----------|--------|
| **Lustre client** | Client version must match server. Server upgrade → upgrade all clients. Compatible with RHEL/Debian derivatives only |
| **POSIX locking** | NFS and Lustre have different POSIX locking behavior. Distributed training relies on flock → problematic with mixed FS |
| **Filesystem cache** | Page cache can mask IO bottlenecks. Training jobs often require `O_DIRECT` or sync IO |
| **Local NVMe vs parallel FS** | Dataset staging on local NVMe eliminates network dependency but requires space and pre-fetch pipeline |
#### Container runtime
| Constraint | Detail |
|----------|--------|
| **Docker + GPU** | `nvidia-container-toolkit` (formerly nvidia-docker2). Requires runtime installation and config in `/etc/docker/daemon.json` |
| **Podman + GPU** | Requires `nvidia-container-toolkit` + podman hook. Less tested than Docker |
| **containerd + GPU** | Standard for K8s. Requires `cdi` (Container Device Interface) or `nvidia-container-runtime` |
| **Enroot + Pyxis** | NVIDIA container stack for Slurm (Enroot = daemonless container runtime, Pyxis = Slurm plugin) |
| **User namespace mapping** | Container GPU access requires device cgroup; rootless may fail (exception for /dev/dri and /dev/nvidia*) |
#### Kernel parameters
```text
# AI workload recommended sysctl
net.core.rmem_max = 134217728 # sufficient for NCCL
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.core.netdev_budget = 600 # for high packet rate
vm.max_map_count = 1048576 # PyTorch DataLoader workers
kernel.numa_balancing = 0 # disable NUMA balancing (breaks locality)
kernel.sched_min_granularity_ns = 10000000
# Disable security mitigations for perf (dedicated AI clusters only)
mitigations=off
transparent_hugepages=never # or madvise — THP may cause latency spikes
intel_idle.max_cstate=1 # reduce C-state transition latency
```
#### Firmware and HW
| Constraint | Detail |
|----------|--------|
| **GPU firmware (VBIOS)** | NVIDIA datacenter GPUs (H100, B200) have VBIOS updates via NVFlash. Without update → missing partitioning support or newer CUDA features |
| **InfiniBand firmware** | IB switch and HCA firmware must be compatible. Mix old switch + new HCA → degraded perf |
| **NVSwitch firmware** | DGX systems have NVSwitch firmware updatable only via NVIDIA DGX tools |
| **Power capping (nvidia-smi)** | `nvidia-smi -pl <power>` — limit TDP for power budget management. Test impact on training throughput |
| **GPU clock locking** | `nvidia-smi -ac <clock,mem>` — locked clock frequency for stable benchmarks. Apply after `nvidia-persistenced` |
| **PCIe Gen** | GPU in PCIe Gen4 slot (instead of Gen5) → bottleneck for CPU↔GPU data transfer. Important for FSDP sharding |
### Recommended OS per use case
| Use case | OS | Rationale |
|----------|-----|-------|
| **DGX cluster (production)** | DGX OS / Ubuntu 22.04 LTS | NVIDIA standard, best driver support |
| **Enterprise K8s (OpenShift)** | RHEL 9 / RHCOS | Red Hat support, GPU Operator compatible |
| **Vanilla K8s (on-prem)** | Ubuntu 22.04 LTS + Flatcar (workers) | Widest community support, Flatcar for minimal footprint |
| **Slurm cluster (HPC/AI)** | Rocky Linux 9 / Ubuntu 22.04 LTS | EL ecosystem (Lustre, OFED) or Ubuntu (community) |
| **Research / rapid prototyping** | Ubuntu 24.04 LTS | Latest CUDA, PyTorch, driver support |
| **Edge inference** | NVIDIA JetPack / Ubuntu (ARM) | Embedded GPU (Jetson Orin, AGX) |
---
## AI-ready data center — check-list
| Area | Requirement |
|--------|-----------|
| **Power** | 30120 kW/rack, HVDC (400 V DC), UPS supporting GPU spikes |
| **Cooling** | Liquid cooling ready (direct-to-chip), rear-door for 30+ kW |
| **Network** | InfiniBand (NDR/XDR) or RoCEv2, rail-optimized fat-tree |
| **Storage** | Parallel FS (Lustre/Weka), checkpoint bandwidth > 100 GB/s |
| **GPU density** | Max GPU/rack, minimize NVSwitch hops |
| **Physical** | Floor load 1,500+ kg/m², rack 52U60U |
| **Security** | Tenant isolation, network segmentation, data encryption |
| **Monitoring** | DCGM, NCCL health checks, thermals, power capping |
---
## Model and throughput limitations
### Model size per GPU
Maximum model size fitting on a single GPU depends on HBM capacity and precision:
| GPU | HBM | FP32 | FP16/BF16 | INT8 | INT4 |
|-----|-----|------|-----------|------|------|
| **H100 80GB** | 80 GB | ~10B | ~40B | ~80B | ~160B |
| **H200 141GB** | 141 GB | ~18B | ~70B | ~140B | ~280B |
| **B200 192GB** | 192 GB | ~24B | ~96B | ~192B | ~384B |
| **MI300X 192GB** | 192 GB | ~24B | ~96B | ~192B | ~384B |
| **A100 80GB** | 80 GB | ~10B | ~40B | ~80B | ~160B |
| **GB200 (192+480)** | 192 GB GPU + 480 GB Grace | — | ~96B + CPU offload | — | — |
*Approximate: 1B params ≈ 2 GB FP16 ≈ 4 GB FP32 ≈ 1 GB INT8 ≈ 0.5 GB INT4. Subtract ~1015 % HBM for activations, KV cache, optimizer states.*
### Memory breakdown inference
| Component | Llama 3 70B (FP16) | Llama 3 8B (FP16) |
|------------|-------------------|-------------------|
| Model weights | 140 GB | 16 GB |
| KV cache (4K context, batch 1) | ~2 GB | ~0.2 GB |
| KV cache (128K context, batch 1) | ~60 GB | ~6.5 GB |
| Activations (peak) | ~5 GB | ~1 GB |
| **Total 4K ctx** | ~147 GB | ~17 GB |
| **Total 128K ctx** | ~205 GB | ~23 GB |
**Conclusion:** Llama 3 70B FP16 does not fit on a single H100 (80 GB). Required: INT8 (170 GB → 2× H100), INT4 (85 GB → 1× H200), or tensor parallelism.
### Context length vs memory
| Context | KV cache 70B (FP16) | KV cache 8B (FP16) | Note |
|---------|-------------------|-------------------|------|
| 4K | ~2.2 GB | ~0.25 GB | Typical chat |
| 32K | ~18 GB | ~2 GB | Documents |
| 128K | ~72 GB | ~8 GB | Long-context (Claude, Gemini) |
| 1M | ~560 GB | ~64 GB | Experimental (Gemini 1.5 Pro) |
KV cache is **linear with context length** and quadratic with attention head count. Critical for long-context inference.
### Throughput inference
| Model | GPU | Precision | Batch size | Tokens/s | QPS (1K output) |
|-------|-----|-----------|-----------|----------|-----------------|
| Llama 3 8B | H100 | FP16 | 1 | ~800 | ~0.8 |
| Llama 3 8B | H100 | FP16 | 128 | ~4 500 | ~35 |
| Llama 3 8B | H100 | INT4 | 128 | ~8 000 | ~62 |
| Llama 3 70B | 4× H100 | FP16 | 1 | ~180 | ~0.18 |
| Llama 3 70B | 4× H100 | INT4 | 64 | ~1 200 | ~19 |
| Llama 3 70B | 8× H100 | FP16 (TP=8) | 128 | ~2 500 | ~20 |
| DeepSeek-R1 671B | 8× H200 | FP8 (MoE) | 64 | ~500 | ~8 |
| GPT-4 class (est.) | — | — | — | ~100300 | ~13 |
**Notes:**
- QPS (queries per second) depends on output length (1K tokens ≈ ~1 query)
- Larger batch increases throughput but increases TTFB (time to first token)
- Tensor Parallelism (TP) scales, but communication overhead grows linearly
### Training limits
#### Scaling efficiency
| GPU count | Model | Efficiency | Reason |
|-----------|-------|-----------|-------|
| 8 (1 node) | Llama 3 8B | ~95 % | NVLink intra-node |
| 64 (8 nodes) | Llama 3 8B | ~85 % | IB inter-node |
| 512 (64 nodes) | Llama 3 70B | ~75 % | Communication overhead |
| 4 096 (512 nodes) | Llama 3 70B | ~60 % | Pipeline bubble, network |
| 16 384 (2 048 nodes) | Llama 3 405B | ~45 % | Synchronous SGD overhead |
**Note:** Efficiency = (actual throughput) / (ideal linear speedup). Decreases logarithmically with GPU count.
#### Memory breakdown training
| Component | Llama 3 70B (BF16) | Llama 3 8B (BF16) |
|------------|-------------------|-------------------|
| Model weights | 140 GB | 16 GB |
| Optimizer states (Adam) | 280 GB | 32 GB |
| Gradients | 140 GB | 16 GB |
| Activations (peak) | ~30 GB | ~4 GB |
| **Total (DDP)** | ~590 GB | ~68 GB |
| **Total (FSDP shard=8)** | ~74 GB | ~8.5 GB |
**Conclusion:** FSDP (Fully Sharded Data Parallelism) is required for training models > 10B. Adam optimizer doubles memory vs inference (weights + optimizer + gradients).
#### Time to train
| Model | GPU count | GPU type | Precision | Time | Cost (on-prem estimate) |
|-------|-----------|---------|-----------|------|---------------------|
| Llama 3 8B | 64 | H100 | BF16 | ~3 days | ~$5 000 |
| Llama 3 70B | 512 | H100 | BF16 | ~14 days | ~$100 000 |
| Llama 3 405B | 16 384 | H100 | BF16 | ~60 days | ~$14 M |
| DeepSeek-R1 671B (MoE) | 2 048 | H800 | BF16 | ~30 days | ~$6 M |
| GPT-4 (est.) | 25 000 | A100/H100 | Mixed | ~90100 days | ~$100 M |
### Power and thermal limits
| Configuration | TDP limit | Throughput loss | Reason |
|-------------|-----------|------------------|--------|
| H100 SXM | 700 W (default) | 0 % | Nominal |
| H100 SXM | 600 W (-15 %) | ~58 % | Power capping |
| H100 SXM | 500 W (-30 %) | ~1525 % | Significant throttling |
| H100 SXM | 400 W (-43 %) | ~3050 % | Emergency only |
| DGX H100 (8×) | 5.6 kW (max) | 0 % | Liquid cooling required |
| DGX H100 (8×) | 4.5 kW (air) | ~1015 % | Rear-door heat exchanger |
GPU throttles when exceeding TDP or temperature (85°C+). Power capping correlates linearly with frequency but non-linearly with throughput.
### API and operational limits
| Limit | Description | Typical value |
|-------|-------|-----------------|
| **Rate limit** | Max requests per minute/hour | 10010 000 RPM (per tier) |
| **Tokens per minute (TPM)** | Max tokens per minute | 1M300M (per model) |
| **Context window** | Max input tokens | 4K2M (per model) |
| **Max output tokens** | Max generated tokens | 4K32K (per model) |
| **Concurrent requests** | Parallel request count | 1010 000 (per backend) |
| **Batch window** | Time to accumulate batch | 020 s (vLLM, TGI) |
| **TTFB timeout** | Max latency to first token | 30120 s |
| **Idle timeout** | GPU idle → scale to 0 | 515 min (cloud) |
### Limits per deployment model
| Dimension | On-prem HW | Managed cloud (SageMaker, Vertex) | API (OpenAI, Anthropic) |
|-----------|--------------|----------------------------------|------------------------|
| **Model size** | Limited by HBM (max 192 GB/GPU) | Unlimited (cluster scaling) | Unlimited |
| **Queries** | Limited by GPU count | Auto-scaling | Rate limit (per tier) |
| **Latency** | < 10 ms (same node) | 10100 ms (network hop) | 100 ms 10 s |
| **Customization** | Full (fine-tuning, quantization) | Managed (SageMaker, Bedrock) | Prompt engineering only |
| **Data privacy** | Yes (on-prem) | Contractual (region, encryption) | Limited |
| **Cost per 1M tokens** | ~$0.100.50 (FP16 inference) | ~$0.201.00 | ~$0.1515.00 |
| **Max context** | 128K+ (depending on GPU count) | 128K+ | 32K2M |
| **Cold start** | 0 (always-on) | 30 s 5 min | 0 (shared infra) |
---
## GPU pricing and price/performance (2026)
> Prices are approximate — NVIDIA does not publish official datacenter GPU price lists. Cloud prices from public providers (Q2 2026). HW purchase prices vary by volume, reseller, and region.
### Purchase price (buy)
| GPU | Price/GPU | Price 8× GPU baseboard | $/PFLOPS (FP16) | Note |
|-----|---------|----------------------|----------------|------|
| **H100 SXM** | $27,00040,000 | ~$200,000 | $25,000 | Scarcity 20232024, now stabilized |
| **H200 SXM** | $35,00050,000 | ~$280,000 | ~$35,000 | H100 upgrade, HBM3e |
| **B200** | ~$60,00070,000 | ~$500,000+ | ~$31,000 | Blackwell, FP4 support |
| **B100** | ~$30,000 | ~$240,000 | ~$20,000 | Lower price than B200, similar FP8 perf |
| **GB200** (Grace+Blackwell) | ~$70,000100,000 | ~$2,000,000 (rack) | — | CPU+GPU unified, high-density |
| **A100 80GB** | ~$10,00015,000 | ~$120,000 | ~$19,200 | Previous gen, still relevant |
| **MI300X** | ~$12,00018,000 | ~$100,000 | ~$9,600 | AMD, 192 GB HBM3 |
| **Gaudi 3** | ~$15,625 | ~$125,000 | **$8,515** | Intel, best $/PFLOPS |
| **L40S** | ~$8,00010,000 | — | — | Inference, enterprise |
### Cloud pricing (on-demand $/GPU/hr)
| GPU | Cheapest | Mid-range (CoreWeave, Lambda) | Hyperscaler (AWS, GCP, Azure) |
|-----|----------|-----------------------------|-------------------------------|
| **H100 SXM** | $1.38 (Thunder) | $2.893.29 | $4.156.88 |
| **H100 PCIe** | $2.01 (Spheron) | $2.50 | — |
| **H200 SXM** | $3.89 (Spheron) | $4.54 | $5.00+ |
| **B200** | **$3.39** (Spheron) | $6.02 | $14.24 (AWS) |
| **B200 spot** | **$2.12** (Spheron) | — | — |
| **GB200** | $3.50 (Runcrate) | $5.85 (Oracle) | $6.95 (GCP) |
| **MI300X** | **$1.50** (TensorWave) | $1.85 (Vultr) | $7.86 (Azure) |
| **A100 80GB** | $1.07 (Spheron) | $1.502.00 | $3.00+ |
| **Gaudi 3** | ~$1.502.50 | — | — |
| **L40S** | $0.91 (Spheron) | $1.502.00 | — |
### Inference cost ($/M tokens)
| GPU | Provider | $/hr | Est. tok/s | $/M tok |
|-----|----------|------|-----------|--------|
| **B200** | Spheron | $3.39 | ~4,000 | **$0.42** |
| **B200 spot** | Spheron | $2.12 | ~4,000 | **$0.15** |
| **H100 PCIe** | Spheron | $2.01 | ~1,200 | $0.47 |
| **A100 80GB** | Spheron | $1.07 | ~520 | $0.57 |
| **H100 SXM** | AWS | $6.88 | ~1,200 | $1.59 |
| **H200 SXM** | Spheron | $4.54 | ~1,800 | $0.70 |
| **L40S** | Spheron | $0.91 | ~450 | $0.56 |
*Values for Llama 3 70B (INT8, batch=1, output 1K tok). Actual values vary by batch size, context, and quantization.*
### Cost per GB HBM
| GPU | HBM | Price/hr cloud | $/GB/hr | Best for memory-bound workloads |
|-----|-----|-------------|--------|--------------------------------|
| **MI300X** | 192 GB | $1.50 | **$0.0078** | ✅ Best |
| **B200** | 192 GB | $3.39 | $0.0177 | ✅ Good |
| **H200** | 141 GB | $3.89 | $0.0276 | ⚠️ |
| **H100 SXM** | 80 GB | $1.38 | $0.0173 | ⚠️ Only up to 70B models |
| **GB200** | 384 GB | $3.50 | $0.0091 | ✅✅ (2× MI300X capacity) |
### Price/performance by scenario
| Scenario | Winner | Rationale |
|----------|--------|-----------|
| **Absolute performance** (cost no object) | **GB200 DGX NVL72** | 72× GPU, 18 PFLOPS FP8, 384 GB HBM/GPU |
| **Cloud inference** — best $/token | **B200 spot** | $0.15/M tok; 4× H100 throughput at lower cost |
| **Cloud inference** — on-demand | **B200** | $0.42/M tok |
| **Cloud inference** — budget | **A100 / L40S** | $0.570.56/M tok |
| **Training** — price/perf on purchase | **Gaudi 3** | $8,515/PFLOPS, 2.53× better than H100 |
| **Training** — cloud | **H100 SXM** | $1.38/hr, CUDA ecosystem, NCCL |
| **Memory-bound** — long context, 70B+ | **MI300X / GB200** | 192384 GB, $0.00780.0091/GB |
| **Ecosystem + safe choice** | **H100/H200** | CUDA, widest SW, NVIDIA tools |
| **Spot / preemptible** — lowest cost | **A100 / H100** | $1.071.38/hr, 5090% off on-demand |
### 2026 Trends
- **H100** — price dropped 64% from peak $8/hr to $1.382.89/hr, then 40% rebound from inference demand
- **B200** — new high-end, $3.39/hr cloud → ~$0.15/M tok on spot — new inference benchmark
- **MI300X** — supply growing (TensorWave, Vultr, CoreWeave, Oracle, Azure), from $1.50/hr
- **Gaudi 3** — best $/PFLOPS on purchase, but narrow ecosystem and limited cloud availability
- **Market bifurcation** — prior gen (H100, A100) commoditizing, new gen (B200, GB200) commanding premium
- [GPU.en.md](GPU.en.md) — GPU architecture, NVIDIA/AMD, vGPU, MIG
- [NETWORKING.en.md](NETWORKING.en.md) — InfiniBand, RoCE, network topology
- [STORAGE.en.md](STORAGE.en.md) — parallel filesystem, object store
- [DATACENTERS.en.md](DATACENTERS.en.md) — DC layout, power, cooling
- [CLOUD.en.md](CLOUD.en.md) — cloud AI services (SageMaker, Vertex AI)
## Sources
Links, books, and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
*Last revision: 2026-06-18*