600 lines
30 KiB
Markdown
600 lines
30 KiB
Markdown
# 🧠 AI/ML Infrastructure
|
||
|
||
## Component overview
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
subgraph Compute
|
||
GPU["GPU (H100/B200/Instinct)"]
|
||
CPU["CPU (AMD EPYC / Intel Xeon)"]
|
||
ASIC["ASIC (TPU, Trainium, Inferentia)"]
|
||
end
|
||
subgraph Network
|
||
IB["InfiniBand NDR/XDR"]
|
||
ROCE["RoCEv2"]
|
||
NVL["NVLink / NVSwitch"]
|
||
end
|
||
subgraph Storage
|
||
FS["Parallel FS (Lustre, GPFS, Weka)"]
|
||
OBJ["Object Store (S3, MinIO)"]
|
||
NVME["Local NVMe cache"]
|
||
end
|
||
subgraph Orchestration
|
||
S["Slurm"]
|
||
K["Kubernetes + Volcano/Kueue"]
|
||
end
|
||
subgraph Cooling
|
||
DLC["Direct-to-chip liquid"]
|
||
IMM["Immersion"]
|
||
AIR["Air (high-density)"]
|
||
end
|
||
|
||
Compute --> Network --> Storage
|
||
Orchestration --> Compute
|
||
Cooling --> Compute
|
||
```
|
||
|
||
---
|
||
|
||
## GPU compute
|
||
|
||
### NVIDIA
|
||
|
||
| GPU | Architecture | FP8 | FP16/BF16 | FP64 | HBM | NVLink | TDP | Rack config |
|
||
|-----|-------------|-----|-----------|------|-----|--------|-----|------|
|
||
| **H100 SXM** | Hopper | 3,958 TFLOPS | 1,979 TFLOPS | 67 TFLOPS | 80 GB HBM3 | 900 GB/s | 700 W | 6–8× in DGX H100 |
|
||
| **H200 SXM** | Hopper (HBM3e) | 3,958 TFLOPS | 1,979 TFLOPS | 67 TFLOPS | 141 GB HBM3e | 900 GB/s | 700 W | 6–8× in DGX H200 |
|
||
| **B200** | Blackwell | ~9,000 TFLOPS | ~4,500 TFLOPS | ~40 TFLOPS | 192 GB HBM3e | 1,800 GB/s | 1,000 W | 6–8× in DGX B200 |
|
||
| **GB200 Grace Hopper** | Blackwell | ~18,000 TFLOPS | ~9,000 TFLOPS | — | 192 GB + 480 GB (Grace) | NVLink-C2C | 1,000 W (GPU) + 500 W (CPU) | DGX GB200 (36× GPU) |
|
||
| **L40S** | Ada Lovelace | 733 TFLOPS | 367 TFLOPS | — | 48 GB GDDR6 | N/A | 350 W | Inference, enterprise |
|
||
| **A100 SXM** | Ampere | 1,248 TFLOPS | 624 TFLOPS | 19.5 TFLOPS | 80 GB HBM2e | 600 GB/s | 400 W | DGX A100 |
|
||
|
||
### AMD
|
||
|
||
| GPU | Architecture | FP8 | FP16/BF16 | FP64 | HBM | Infinity Fabric | TDP |
|
||
|-----|-------------|-----|-----------|------|-----|----------------|-----|
|
||
| **MI300X** | CDNA 3 | 2,615 TFLOPS | 1,307 TFLOPS | 81 TFLOPS | 192 GB HBM3 | 896 GB/s | 750 W |
|
||
| **MI250** | CDNA 2 | — | 383 TFLOPS | 95.7 TFLOPS | 128 GB HBM2e | 400 GB/s | 500 W |
|
||
|
||
### Intel
|
||
|
||
| GPU | Architecture | FP16/BF16 | FP32 | HBM | TDP |
|
||
|-----|-------------|-----------|------|-----|-----|
|
||
| **Gaudi 3** | Custom | 1,835 TFLOPS | — | 144 GB HBM2e | 600 W |
|
||
| **Max 1550** | Xe HPC | 600+ TFLOPS | 200 TFLOPS | 128 GB HBM2e | 600 W |
|
||
|
||
### Cloud ASIC
|
||
|
||
| ASIC | Provider | Use case | Performance |
|
||
|------|----------|----------|-------|
|
||
| **TPU v5p** | Google | Training | ~4,600 TFLOPS (BF16) per pod |
|
||
| **Trainium 2** | AWS | Training | ~1,000 TFLOPS (BF16) per chip |
|
||
| **Inferentia 2** | AWS | Inference | ~400 TOPS (INT8) per chip |
|
||
| **Maia 100** | Microsoft | Training + inference | Custom, 800 W TDP |
|
||
|
||
---
|
||
|
||
## AI networking
|
||
|
||
### Technology comparison
|
||
|
||
| Technology | Bandwidth per link | Latency | Topology | Use case |
|
||
|-------------|-------------------|---------|-----------|----------|
|
||
| **InfiniBand NDR200** | 200 Gb/s | < 1 µs | Fat-tree, Dragonfly+ | Training (NVIDIA) |
|
||
| **InfiniBand NDR400** | 400 Gb/s | < 1 µs | Fat-tree, Dragonfly+ | Training (NVIDIA) |
|
||
| **InfiniBand XDR** | 800 Gb/s (planned) | < 1 µs | Dragonfly+ | Next-gen training |
|
||
| **RoCEv2** (CX-7/8) | 200–400 Gb/s | 1–2 µs | Fat-tree, Spine-leaf | Training (AMD, Intel, open) |
|
||
| **NVLink 4.0** | 900 GB/s per GPU | < 0.5 µs | NVSwitch full-mesh | Intra-node GPU comm |
|
||
| **NVLink 5.0** | 1,800 GB/s per GPU | < 0.5 µs | NVSwitch full-mesh | Intra-node (Blackwell) |
|
||
| **Ethernet (400 GbE)** | 400 Gb/s | 2–5 µs | Spine-leaf | Inference, data pipeline |
|
||
|
||
### AI fabric principles
|
||
|
||
- **Rail-optimized topology** — each GPU communicates on dedicated "rails" (same GPU indices across nodes connect to the same switch)
|
||
- **Fat-tree (Clos)** — standard for InfiniBand and RoCE, non-blocking bisection bandwidth
|
||
- **Dragonfly+** — reduces hop count while maintaining bandwidth (used in largest clusters)
|
||
- **GPU Direct RDMA** — direct GPU ↔ GPU communication without CPU involvement, supports InfiniBand and RoCE
|
||
- **SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)** — in-network reduction for AllReduce (InfiniBand only)
|
||
|
||
### Bandwidth sizing
|
||
|
||
```text
|
||
Rule of thumb: InfiniBand bandwidth ≥ 50 % GPU HBM bandwidth for scalable training
|
||
|
||
Example: H100 has 3.35 TB/s HBM
|
||
→ Needs min. 1.6 TB/s bisection bandwidth per GPU
|
||
→ 8× H100 in DGX: 4× NDR400 IB per GPU = 4 × 50 GB/s = 200 GB/s
|
||
→ Reality: 8× 200 Gb/s (25 GB/s) per GPU in typical config = ~6 % HBM → bottleneck
|
||
```
|
||
|
||
---
|
||
|
||
## AI storage
|
||
|
||
### Requirements
|
||
|
||
| Dataset size | IO pattern | Recommended storage | Bandwidth |
|
||
|-------------|-----------|-------------------|-----------|
|
||
| < 10 TB | Sequential read (data loading) | Local NVMe | > 10 GB/s per node |
|
||
| 10–100 TB | Random read (checkpointing) | Parallel FS (Lustre, Weka) | > 100 GB/s cluster-wide |
|
||
| 100 TB–10 PB | Mixed (training + checkpoint) | Parallel FS + object store | > 500 GB/s |
|
||
| 10 PB+ | Multi-modal, video, LLM | Tiered (NVMe cache + parallel FS + object) | > 1 TB/s |
|
||
|
||
### Storage solution comparison
|
||
|
||
| Solution | Type | Bandwidth per node | Max capacity | Scaling | Use case |
|
||
|--------|-----|-------------------|-------------|-----------|----------|
|
||
| **Lustre** | Parallel FS (POSIX) | > 100 GB/s (cluster) | 100s PB | OST + MDS | HPC, LLM training (standard) |
|
||
| **GPFS / StorageScale** | Parallel FS (POSIX) | > 100 GB/s | 100s PB | NSD servers | HPC, AI (IBM) |
|
||
| **WekaFS** | Parallel FS (POSIX + NFS/SMB) | ~80 GB/s per 10 nodes | 10s PB | Container-native | AI/ML, NVIDIA DGX preferred |
|
||
| **VAST Data** | Universal storage (NVMe + QLC) | ~100 GB/s per cluster | 10s PB | Scale-out | AI, checkpoint, data lake |
|
||
| **Pure Storage//E** | All-flash (NVMe) | ~50 GB/s | ~30 PB | Scale-out | Enterprise AI, database |
|
||
| **MinIO / S3** | Object store | ~20 GB/s per gateway | EB | Erasure coding | Dataset repository, checkpoint |
|
||
| **NetApp AFF** | NAS + S3 | ~10 GB/s per controller | ~50 PB | HA pair | Enterprise, NFS baseline |
|
||
|
||
### Checkpointing strategies
|
||
|
||
| Strategy | RPO | Storage impact | Description |
|
||
|-----------|-----|---------------|-------|
|
||
| **Full checkpoint** | every N steps | High (stops training) | Full model + optimizer state |
|
||
| **Async checkpoint** | every N steps | Medium (non-blocking) | Copy to staging buffer, async write |
|
||
| **Distributed checkpoint** (NVIDIA NeMo) | every N steps | Low | Each rank writes its own shard |
|
||
| **In-memory checkpoint** (IBM) | on failover | Minimal (DRAM) | Replication to another node's DRAM |
|
||
| **Continuous checkpoint** (Microsoft) | every 1–5 min | Low (delta) | Changed shards only |
|
||
|
||
---
|
||
|
||
## AI cluster architecture
|
||
|
||
### Physical topology — DGX H100 example
|
||
|
||
```
|
||
┌──────── DGX H100 (8× GPU) ────────┐
|
||
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
|
||
│ │GPU 0│ │GPU 1│ │GPU 2│ │GPU 3│ │
|
||
│ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
|
||
│ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │
|
||
│ │GPU 4│ │GPU 5│ │GPU 6│ │GPU 7│ │
|
||
│ └─────┘ └─────┘ └─────┘ └─────┘ │
|
||
│ NVSwitch (NVLink 4.0, 900 GB/s) │
|
||
│ InfiniBand CX-7: 8× NDR400 │
|
||
└────────────────────────────────────┘
|
||
│ 8× IB rails
|
||
┌────┴──────────────┐
|
||
│ IB NDR400 Switches │ (rail-optimized)
|
||
└────────────────────┘
|
||
```
|
||
|
||
### Kubernetes for AI
|
||
|
||
| Component | Role |
|
||
|-----------|------|
|
||
| **Volcano** | Batch scheduling, gang scheduling, queue management |
|
||
| **Kueue** | Multi-tenant admission, resource quotas, fair sharing |
|
||
| **NVIDIA GPU Operator** | Driver, container toolkit, MIG, DCGM, monitoring |
|
||
| **HAMi** (ex k8s-vGPU-scheduler) | GPU sharing, MIG partitioning, fractional GPU |
|
||
| **Node Feature Discovery** | GPU type detection, NUMA topology |
|
||
| **Topology Manager** | NUMA-aware pod placement |
|
||
| **DPDK / SR-IOV** | High-performance networking for GPU Direct RDMA |
|
||
|
||
### Slurm for AI
|
||
|
||
| Component | Role |
|
||
|-----------|------|
|
||
| **slurm.conf** | Partition for GPU nodes, GRES (Generic Resource) |
|
||
| **gres.conf** | GPU type, GPU count per node |
|
||
| **srun --gres=gpu:8** | Allocate 8 GPUs per job |
|
||
| **sbatch --nodes=64 --ntasks=512** | 64 nodes, 512 ranks (8 GPU/node) |
|
||
| **Pixis** | NVIDIA orchestration plugin for Slurm |
|
||
|
||
---
|
||
|
||
## AI cluster cooling
|
||
|
||
### Power density comparison
|
||
|
||
| Configuration | TDP per node | Racks | kW/rack | Note |
|
||
|-------------|-------------|-------|---------|----------|
|
||
| Standard server (2U) | 1 kW | 20 | 5–10 | Typical DC |
|
||
| GPU server (DGX H100, 6×) | 42 kW | 6 | 45–50 | Air cooling limit |
|
||
| GPU server (DGX B200, 6×) | 72 kW | 6 | 90–100 | Liquid cooling required |
|
||
| GPU server (GB200 NVL72) | 120 kW | — | ~120 | Liquid cooling mandatory |
|
||
| NVIDIA NVL72 rack | 120 kW | 1 | 120 | Fully liquid cooled |
|
||
|
||
### Cooling technologies
|
||
|
||
| Method | Max kW/rack | CAPEX | OPEX | Complexity |
|
||
|--------|-------------|-------|------|-----------|
|
||
| **Air cooling (CRAC/CRAH)** | < 15 | Low | Medium | Low |
|
||
| **Air cooling (in-row)** | 15–30 | Medium | Medium | Low |
|
||
| **Rear-door heat exchanger** | 30–50 | Medium | Low | Medium |
|
||
| **Direct-to-chip liquid (cold plate)** | 50–150 | High | Low | High |
|
||
| **Immersion (single-phase)** | 100–200 | High | Low | High |
|
||
| **Immersion (two-phase)** | 200+ | Very high | Low | Very high |
|
||
|
||
---
|
||
|
||
## Inference infrastructure
|
||
|
||
### Inference server comparison
|
||
|
||
| Tool | Frameworks | Optimization | Use case |
|
||
|---------|-----------|-------------|----------|
|
||
| **vLLM** | Megatron, HF, AWQ, GPTQ | PagedAttention, KV cache, continuous batching | LLM inference (open source) |
|
||
| **TensorRT-LLM** | TensorRT | INT4/INT8/FP8, inflight batching, attention optimizations | Production (NVIDIA) |
|
||
| **Triton Inference Server** | All (TensorRT, vLLM, PyTorch) | Model ensemble, model caching, concurrent execution | Enterprise, multi-model |
|
||
| **SageMaker** | Managed | Auto-scaling, model parallelism | AWS managed |
|
||
| **OpenAI API / TGI** | HF Transformers | Continuous batching, flash attention | Hosting |
|
||
|
||
### Inference optimization
|
||
|
||
| Technique | Latency improvement | Throughput improvement | Memory reduction |
|
||
|----------|-----------------|---------------------|------------------|
|
||
| **FP8/INT8 quantization** | — | 2× | 2× |
|
||
| **INT4 quantization** | — | 4× | 4× |
|
||
| **Flash Attention 2/3** | 2–4× | — | 50 % (KV cache) |
|
||
| **PagedAttention** | — | 2–5× | 95 % (KV cache fragmentation) |
|
||
| **Continuous batching** | — | 10–20× | — |
|
||
| **Speculative decoding** | 2–3× | — | — |
|
||
| **Multi-LoRA / S-LoRA** | — | 8–16× | — |
|
||
|
||
---
|
||
|
||
## Distributed training techniques
|
||
|
||
| Technique | Description | Frameworks |
|
||
|----------|-------|------------|
|
||
| **Data Parallelism (DDP/FSDP)** | Each GPU has model copy, different batch | PyTorch DDP, FSDP |
|
||
| **Tensor Parallelism (TP)** | Model split across layers (intra-node) | Megatron-LM, DeepSpeed |
|
||
| **Pipeline Parallelism (PP)** | Layers split across nodes | Megatron-LM, DeepSpeed |
|
||
| **Sequence Parallelism (SP)** | Sequence split across GPUs | Megatron-LM |
|
||
| **Expert Parallelism (EP)** | Different expert subnets on different GPUs | Mixture-of-Experts (MoE) |
|
||
| **3D Parallelism** | TP + PP + DP combination | Megatron-LM, NeMo |
|
||
| **ZeRO (1/2/3)** | Optimizer/gradient/parameter sharding | DeepSpeed |
|
||
| **NCCL / RCCL** | GPU collective communication library | NVIDIA/AMD |
|
||
|
||
---
|
||
|
||
## Operating systems for AI
|
||
|
||
### Distribution comparison
|
||
|
||
| OS | GPU driver | CUDA | Container toolkit | IB/RoCE | Lustre client | Production support |
|
||
|----|-----------|------|-------------------|---------|--------------|-------------------|
|
||
| **Ubuntu 22.04 LTS** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED, rdma-core | Yes (lustre-client) | NVIDIA DGX standard |
|
||
| **Ubuntu 24.04 LTS** | NVIDIA 550+ | 12.5+ | nvidia-container-toolkit | MLNX_OFED, rdma-core | Yes | Latest GPU support |
|
||
| **RHEL 9 / Rocky 9** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED | Yes (EL repo) | Red Hat, enterprise |
|
||
| **DGX OS** (Ubuntu-based) | NVIDIA custom | 12.x | Pre-installed | Pre-configured | Yes | NVIDIA DGX only supported |
|
||
| **SLES 15 SP5** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED | Yes | HPC, some Lustre clusters |
|
||
| **Debian 12** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | rdma-core | Yes (backports) | Community, research |
|
||
| **Flatcar / Bottlerocket** | Container-host | — | nvidia-container-toolkit | Limited | No | K8s-only, minimal footprint |
|
||
|
||
### Limitations and constraints
|
||
|
||
#### GPU drivers and CUDA
|
||
|
||
| Constraint | Detail |
|
||
|----------|--------|
|
||
| **Driver-CUDA compatibility** | NVIDIA driver major version must match CUDA toolkit (driver ≥ CUDA req). E.g., CUDA 12.5 requires driver ≥ 550 |
|
||
| **Kernel version** | NVIDIA driver not compatible with all kernels. New kernel (6.8+) may require DKMS build or delayed support |
|
||
| **Secure Boot** | NVIDIA driver requires signed module (MOK, shim) or disabled Secure Boot — common enterprise issue |
|
||
| **Open vs Proprietary driver** | NVIDIA `nvidia-open` (since R515) — open source kernel module. GPU support: DC (H100+) → OK, older GPUs → proprietary required |
|
||
| **nvidia-persistenced** | Required to maintain GPU initialization; without it GPUs may sleep after idle timeout (`nvidia-smi -pm 1`) |
|
||
| **GPU reset** | After crashed training job, GPU may hang. `nvidia-smi --gpu-reset` or reboot node, sometimes power cycle |
|
||
| **Multi-instance GPU (MIG)** | Requires specific driver, MIG mode on GPU, GPU restart. Cannot be changed at runtime. A100, H100, B200 only |
|
||
|
||
#### Network (InfiniBand / RoCE)
|
||
|
||
| Constraint | Detail |
|
||
|----------|--------|
|
||
| **MLNX_OFED vs rdma-core** | MLNX_OFED (NVIDIA) — full support, but own kernel modules, kernel version compatibility needed. `rdma-core` (open) — limited support, no custom modules |
|
||
| **Kernel compatibility** | MLNX_OFED supports only specific kernel versions (major.minor). Kernel upgrade → MLNX_OFED rebuild required |
|
||
| **NCCL** | NCCL version must be compatible with CUDA and IB firmware. `nccl-tests` for validation |
|
||
| **SHARP** | In-network reduction requires specific MLNX_OFED + IB switch firmware combination |
|
||
| **GPU Direct RDMA** | Requires `nvidia-peermem` module + MLNX_OFED. Does not work with all GPU and IB card combinations |
|
||
| **RoCE PFC/ECN** | RoCE requires lossless fabric (PFC, ECN, DCQCN). Switch and host configuration — complex tuning |
|
||
|
||
#### Storage
|
||
|
||
| Constraint | Detail |
|
||
|----------|--------|
|
||
| **Lustre client** | Client version must match server. Server upgrade → upgrade all clients. Compatible with RHEL/Debian derivatives only |
|
||
| **POSIX locking** | NFS and Lustre have different POSIX locking behavior. Distributed training relies on flock → problematic with mixed FS |
|
||
| **Filesystem cache** | Page cache can mask IO bottlenecks. Training jobs often require `O_DIRECT` or sync IO |
|
||
| **Local NVMe vs parallel FS** | Dataset staging on local NVMe eliminates network dependency but requires space and pre-fetch pipeline |
|
||
|
||
#### Container runtime
|
||
|
||
| Constraint | Detail |
|
||
|----------|--------|
|
||
| **Docker + GPU** | `nvidia-container-toolkit` (formerly nvidia-docker2). Requires runtime installation and config in `/etc/docker/daemon.json` |
|
||
| **Podman + GPU** | Requires `nvidia-container-toolkit` + podman hook. Less tested than Docker |
|
||
| **containerd + GPU** | Standard for K8s. Requires `cdi` (Container Device Interface) or `nvidia-container-runtime` |
|
||
| **Enroot + Pyxis** | NVIDIA container stack for Slurm (Enroot = daemonless container runtime, Pyxis = Slurm plugin) |
|
||
| **User namespace mapping** | Container GPU access requires device cgroup; rootless may fail (exception for /dev/dri and /dev/nvidia*) |
|
||
|
||
#### Kernel parameters
|
||
|
||
```text
|
||
# AI workload recommended sysctl
|
||
net.core.rmem_max = 134217728 # sufficient for NCCL
|
||
net.core.wmem_max = 134217728
|
||
net.ipv4.tcp_rmem = 4096 87380 134217728
|
||
net.ipv4.tcp_wmem = 4096 65536 134217728
|
||
net.core.netdev_budget = 600 # for high packet rate
|
||
vm.max_map_count = 1048576 # PyTorch DataLoader workers
|
||
kernel.numa_balancing = 0 # disable NUMA balancing (breaks locality)
|
||
kernel.sched_min_granularity_ns = 10000000
|
||
|
||
# Disable security mitigations for perf (dedicated AI clusters only)
|
||
mitigations=off
|
||
transparent_hugepages=never # or madvise — THP may cause latency spikes
|
||
intel_idle.max_cstate=1 # reduce C-state transition latency
|
||
```
|
||
|
||
#### Firmware and HW
|
||
|
||
| Constraint | Detail |
|
||
|----------|--------|
|
||
| **GPU firmware (VBIOS)** | NVIDIA datacenter GPUs (H100, B200) have VBIOS updates via NVFlash. Without update → missing partitioning support or newer CUDA features |
|
||
| **InfiniBand firmware** | IB switch and HCA firmware must be compatible. Mix old switch + new HCA → degraded perf |
|
||
| **NVSwitch firmware** | DGX systems have NVSwitch firmware updatable only via NVIDIA DGX tools |
|
||
| **Power capping (nvidia-smi)** | `nvidia-smi -pl <power>` — limit TDP for power budget management. Test impact on training throughput |
|
||
| **GPU clock locking** | `nvidia-smi -ac <clock,mem>` — locked clock frequency for stable benchmarks. Apply after `nvidia-persistenced` |
|
||
| **PCIe Gen** | GPU in PCIe Gen4 slot (instead of Gen5) → bottleneck for CPU↔GPU data transfer. Important for FSDP sharding |
|
||
|
||
### Recommended OS per use case
|
||
|
||
| Use case | OS | Rationale |
|
||
|----------|-----|-------|
|
||
| **DGX cluster (production)** | DGX OS / Ubuntu 22.04 LTS | NVIDIA standard, best driver support |
|
||
| **Enterprise K8s (OpenShift)** | RHEL 9 / RHCOS | Red Hat support, GPU Operator compatible |
|
||
| **Vanilla K8s (on-prem)** | Ubuntu 22.04 LTS + Flatcar (workers) | Widest community support, Flatcar for minimal footprint |
|
||
| **Slurm cluster (HPC/AI)** | Rocky Linux 9 / Ubuntu 22.04 LTS | EL ecosystem (Lustre, OFED) or Ubuntu (community) |
|
||
| **Research / rapid prototyping** | Ubuntu 24.04 LTS | Latest CUDA, PyTorch, driver support |
|
||
| **Edge inference** | NVIDIA JetPack / Ubuntu (ARM) | Embedded GPU (Jetson Orin, AGX) |
|
||
|
||
---
|
||
|
||
## AI-ready data center — check-list
|
||
|
||
| Area | Requirement |
|
||
|--------|-----------|
|
||
| **Power** | 30–120 kW/rack, HVDC (400 V DC), UPS supporting GPU spikes |
|
||
| **Cooling** | Liquid cooling ready (direct-to-chip), rear-door for 30+ kW |
|
||
| **Network** | InfiniBand (NDR/XDR) or RoCEv2, rail-optimized fat-tree |
|
||
| **Storage** | Parallel FS (Lustre/Weka), checkpoint bandwidth > 100 GB/s |
|
||
| **GPU density** | Max GPU/rack, minimize NVSwitch hops |
|
||
| **Physical** | Floor load 1,500+ kg/m², rack 52U–60U |
|
||
| **Security** | Tenant isolation, network segmentation, data encryption |
|
||
| **Monitoring** | DCGM, NCCL health checks, thermals, power capping |
|
||
|
||
---
|
||
|
||
## Model and throughput limitations
|
||
|
||
### Model size per GPU
|
||
|
||
Maximum model size fitting on a single GPU depends on HBM capacity and precision:
|
||
|
||
| GPU | HBM | FP32 | FP16/BF16 | INT8 | INT4 |
|
||
|-----|-----|------|-----------|------|------|
|
||
| **H100 80GB** | 80 GB | ~10B | ~40B | ~80B | ~160B |
|
||
| **H200 141GB** | 141 GB | ~18B | ~70B | ~140B | ~280B |
|
||
| **B200 192GB** | 192 GB | ~24B | ~96B | ~192B | ~384B |
|
||
| **MI300X 192GB** | 192 GB | ~24B | ~96B | ~192B | ~384B |
|
||
| **A100 80GB** | 80 GB | ~10B | ~40B | ~80B | ~160B |
|
||
| **GB200 (192+480)** | 192 GB GPU + 480 GB Grace | — | ~96B + CPU offload | — | — |
|
||
|
||
*Approximate: 1B params ≈ 2 GB FP16 ≈ 4 GB FP32 ≈ 1 GB INT8 ≈ 0.5 GB INT4. Subtract ~10–15 % HBM for activations, KV cache, optimizer states.*
|
||
|
||
### Memory breakdown inference
|
||
|
||
| Component | Llama 3 70B (FP16) | Llama 3 8B (FP16) |
|
||
|------------|-------------------|-------------------|
|
||
| Model weights | 140 GB | 16 GB |
|
||
| KV cache (4K context, batch 1) | ~2 GB | ~0.2 GB |
|
||
| KV cache (128K context, batch 1) | ~60 GB | ~6.5 GB |
|
||
| Activations (peak) | ~5 GB | ~1 GB |
|
||
| **Total 4K ctx** | ~147 GB | ~17 GB |
|
||
| **Total 128K ctx** | ~205 GB | ~23 GB |
|
||
|
||
**Conclusion:** Llama 3 70B FP16 does not fit on a single H100 (80 GB). Required: INT8 (170 GB → 2× H100), INT4 (85 GB → 1× H200), or tensor parallelism.
|
||
|
||
### Context length vs memory
|
||
|
||
| Context | KV cache 70B (FP16) | KV cache 8B (FP16) | Note |
|
||
|---------|-------------------|-------------------|------|
|
||
| 4K | ~2.2 GB | ~0.25 GB | Typical chat |
|
||
| 32K | ~18 GB | ~2 GB | Documents |
|
||
| 128K | ~72 GB | ~8 GB | Long-context (Claude, Gemini) |
|
||
| 1M | ~560 GB | ~64 GB | Experimental (Gemini 1.5 Pro) |
|
||
|
||
KV cache is **linear with context length** and quadratic with attention head count. Critical for long-context inference.
|
||
|
||
### Throughput inference
|
||
|
||
| Model | GPU | Precision | Batch size | Tokens/s | QPS (1K output) |
|
||
|-------|-----|-----------|-----------|----------|-----------------|
|
||
| Llama 3 8B | H100 | FP16 | 1 | ~800 | ~0.8 |
|
||
| Llama 3 8B | H100 | FP16 | 128 | ~4 500 | ~35 |
|
||
| Llama 3 8B | H100 | INT4 | 128 | ~8 000 | ~62 |
|
||
| Llama 3 70B | 4× H100 | FP16 | 1 | ~180 | ~0.18 |
|
||
| Llama 3 70B | 4× H100 | INT4 | 64 | ~1 200 | ~19 |
|
||
| Llama 3 70B | 8× H100 | FP16 (TP=8) | 128 | ~2 500 | ~20 |
|
||
| DeepSeek-R1 671B | 8× H200 | FP8 (MoE) | 64 | ~500 | ~8 |
|
||
| GPT-4 class (est.) | — | — | — | ~100–300 | ~1–3 |
|
||
|
||
**Notes:**
|
||
- QPS (queries per second) depends on output length (1K tokens ≈ ~1 query)
|
||
- Larger batch increases throughput but increases TTFB (time to first token)
|
||
- Tensor Parallelism (TP) scales, but communication overhead grows linearly
|
||
|
||
### Training limits
|
||
|
||
#### Scaling efficiency
|
||
|
||
| GPU count | Model | Efficiency | Reason |
|
||
|-----------|-------|-----------|-------|
|
||
| 8 (1 node) | Llama 3 8B | ~95 % | NVLink intra-node |
|
||
| 64 (8 nodes) | Llama 3 8B | ~85 % | IB inter-node |
|
||
| 512 (64 nodes) | Llama 3 70B | ~75 % | Communication overhead |
|
||
| 4 096 (512 nodes) | Llama 3 70B | ~60 % | Pipeline bubble, network |
|
||
| 16 384 (2 048 nodes) | Llama 3 405B | ~45 % | Synchronous SGD overhead |
|
||
|
||
**Note:** Efficiency = (actual throughput) / (ideal linear speedup). Decreases logarithmically with GPU count.
|
||
|
||
#### Memory breakdown training
|
||
|
||
| Component | Llama 3 70B (BF16) | Llama 3 8B (BF16) |
|
||
|------------|-------------------|-------------------|
|
||
| Model weights | 140 GB | 16 GB |
|
||
| Optimizer states (Adam) | 280 GB | 32 GB |
|
||
| Gradients | 140 GB | 16 GB |
|
||
| Activations (peak) | ~30 GB | ~4 GB |
|
||
| **Total (DDP)** | ~590 GB | ~68 GB |
|
||
| **Total (FSDP shard=8)** | ~74 GB | ~8.5 GB |
|
||
|
||
**Conclusion:** FSDP (Fully Sharded Data Parallelism) is required for training models > 10B. Adam optimizer doubles memory vs inference (weights + optimizer + gradients).
|
||
|
||
#### Time to train
|
||
|
||
| Model | GPU count | GPU type | Precision | Time | Cost (on-prem estimate) |
|
||
|-------|-----------|---------|-----------|------|---------------------|
|
||
| Llama 3 8B | 64 | H100 | BF16 | ~3 days | ~$5 000 |
|
||
| Llama 3 70B | 512 | H100 | BF16 | ~14 days | ~$100 000 |
|
||
| Llama 3 405B | 16 384 | H100 | BF16 | ~60 days | ~$14 M |
|
||
| DeepSeek-R1 671B (MoE) | 2 048 | H800 | BF16 | ~30 days | ~$6 M |
|
||
| GPT-4 (est.) | 25 000 | A100/H100 | Mixed | ~90–100 days | ~$100 M |
|
||
|
||
### Power and thermal limits
|
||
|
||
| Configuration | TDP limit | Throughput loss | Reason |
|
||
|-------------|-----------|------------------|--------|
|
||
| H100 SXM | 700 W (default) | 0 % | Nominal |
|
||
| H100 SXM | 600 W (-15 %) | ~5–8 % | Power capping |
|
||
| H100 SXM | 500 W (-30 %) | ~15–25 % | Significant throttling |
|
||
| H100 SXM | 400 W (-43 %) | ~30–50 % | Emergency only |
|
||
| DGX H100 (8×) | 5.6 kW (max) | 0 % | Liquid cooling required |
|
||
| DGX H100 (8×) | 4.5 kW (air) | ~10–15 % | Rear-door heat exchanger |
|
||
|
||
GPU throttles when exceeding TDP or temperature (85°C+). Power capping correlates linearly with frequency but non-linearly with throughput.
|
||
|
||
### API and operational limits
|
||
|
||
| Limit | Description | Typical value |
|
||
|-------|-------|-----------------|
|
||
| **Rate limit** | Max requests per minute/hour | 100–10 000 RPM (per tier) |
|
||
| **Tokens per minute (TPM)** | Max tokens per minute | 1M–300M (per model) |
|
||
| **Context window** | Max input tokens | 4K–2M (per model) |
|
||
| **Max output tokens** | Max generated tokens | 4K–32K (per model) |
|
||
| **Concurrent requests** | Parallel request count | 10–10 000 (per backend) |
|
||
| **Batch window** | Time to accumulate batch | 0–20 s (vLLM, TGI) |
|
||
| **TTFB timeout** | Max latency to first token | 30–120 s |
|
||
| **Idle timeout** | GPU idle → scale to 0 | 5–15 min (cloud) |
|
||
|
||
### Limits per deployment model
|
||
|
||
| Dimension | On-prem HW | Managed cloud (SageMaker, Vertex) | API (OpenAI, Anthropic) |
|
||
|-----------|--------------|----------------------------------|------------------------|
|
||
| **Model size** | Limited by HBM (max 192 GB/GPU) | Unlimited (cluster scaling) | Unlimited |
|
||
| **Queries** | Limited by GPU count | Auto-scaling | Rate limit (per tier) |
|
||
| **Latency** | < 10 ms (same node) | 10–100 ms (network hop) | 100 ms – 10 s |
|
||
| **Customization** | Full (fine-tuning, quantization) | Managed (SageMaker, Bedrock) | Prompt engineering only |
|
||
| **Data privacy** | Yes (on-prem) | Contractual (region, encryption) | Limited |
|
||
| **Cost per 1M tokens** | ~$0.10–0.50 (FP16 inference) | ~$0.20–1.00 | ~$0.15–15.00 |
|
||
| **Max context** | 128K+ (depending on GPU count) | 128K+ | 32K–2M |
|
||
| **Cold start** | 0 (always-on) | 30 s – 5 min | 0 (shared infra) |
|
||
|
||
---
|
||
|
||
## GPU pricing and price/performance (2026)
|
||
|
||
> Prices are approximate — NVIDIA does not publish official datacenter GPU price lists. Cloud prices from public providers (Q2 2026). HW purchase prices vary by volume, reseller, and region.
|
||
|
||
### Purchase price (buy)
|
||
|
||
| GPU | Price/GPU | Price 8× GPU baseboard | $/PFLOPS (FP16) | Note |
|
||
|-----|---------|----------------------|----------------|------|
|
||
| **H100 SXM** | $27,000–40,000 | ~$200,000 | $25,000 | Scarcity 2023–2024, now stabilized |
|
||
| **H200 SXM** | $35,000–50,000 | ~$280,000 | ~$35,000 | H100 upgrade, HBM3e |
|
||
| **B200** | ~$60,000–70,000 | ~$500,000+ | ~$31,000 | Blackwell, FP4 support |
|
||
| **B100** | ~$30,000 | ~$240,000 | ~$20,000 | Lower price than B200, similar FP8 perf |
|
||
| **GB200** (Grace+Blackwell) | ~$70,000–100,000 | ~$2,000,000 (rack) | — | CPU+GPU unified, high-density |
|
||
| **A100 80GB** | ~$10,000–15,000 | ~$120,000 | ~$19,200 | Previous gen, still relevant |
|
||
| **MI300X** | ~$12,000–18,000 | ~$100,000 | ~$9,600 | AMD, 192 GB HBM3 |
|
||
| **Gaudi 3** | ~$15,625 | ~$125,000 | **$8,515** | Intel, best $/PFLOPS |
|
||
| **L40S** | ~$8,000–10,000 | — | — | Inference, enterprise |
|
||
|
||
### Cloud pricing (on-demand $/GPU/hr)
|
||
|
||
| GPU | Cheapest | Mid-range (CoreWeave, Lambda) | Hyperscaler (AWS, GCP, Azure) |
|
||
|-----|----------|-----------------------------|-------------------------------|
|
||
| **H100 SXM** | $1.38 (Thunder) | $2.89–3.29 | $4.15–6.88 |
|
||
| **H100 PCIe** | $2.01 (Spheron) | $2.50 | — |
|
||
| **H200 SXM** | $3.89 (Spheron) | $4.54 | $5.00+ |
|
||
| **B200** | **$3.39** (Spheron) | $6.02 | $14.24 (AWS) |
|
||
| **B200 spot** | **$2.12** (Spheron) | — | — |
|
||
| **GB200** | $3.50 (Runcrate) | $5.85 (Oracle) | $6.95 (GCP) |
|
||
| **MI300X** | **$1.50** (TensorWave) | $1.85 (Vultr) | $7.86 (Azure) |
|
||
| **A100 80GB** | $1.07 (Spheron) | $1.50–2.00 | $3.00+ |
|
||
| **Gaudi 3** | ~$1.50–2.50 | — | — |
|
||
| **L40S** | $0.91 (Spheron) | $1.50–2.00 | — |
|
||
|
||
### Inference cost ($/M tokens)
|
||
|
||
| GPU | Provider | $/hr | Est. tok/s | $/M tok |
|
||
|-----|----------|------|-----------|--------|
|
||
| **B200** | Spheron | $3.39 | ~4,000 | **$0.42** |
|
||
| **B200 spot** | Spheron | $2.12 | ~4,000 | **$0.15** |
|
||
| **H100 PCIe** | Spheron | $2.01 | ~1,200 | $0.47 |
|
||
| **A100 80GB** | Spheron | $1.07 | ~520 | $0.57 |
|
||
| **H100 SXM** | AWS | $6.88 | ~1,200 | $1.59 |
|
||
| **H200 SXM** | Spheron | $4.54 | ~1,800 | $0.70 |
|
||
| **L40S** | Spheron | $0.91 | ~450 | $0.56 |
|
||
|
||
*Values for Llama 3 70B (INT8, batch=1, output 1K tok). Actual values vary by batch size, context, and quantization.*
|
||
|
||
### Cost per GB HBM
|
||
|
||
| GPU | HBM | Price/hr cloud | $/GB/hr | Best for memory-bound workloads |
|
||
|-----|-----|-------------|--------|--------------------------------|
|
||
| **MI300X** | 192 GB | $1.50 | **$0.0078** | ✅ Best |
|
||
| **B200** | 192 GB | $3.39 | $0.0177 | ✅ Good |
|
||
| **H200** | 141 GB | $3.89 | $0.0276 | ⚠️ |
|
||
| **H100 SXM** | 80 GB | $1.38 | $0.0173 | ⚠️ Only up to 70B models |
|
||
| **GB200** | 384 GB | $3.50 | $0.0091 | ✅✅ (2× MI300X capacity) |
|
||
|
||
### Price/performance by scenario
|
||
|
||
| Scenario | Winner | Rationale |
|
||
|----------|--------|-----------|
|
||
| **Absolute performance** (cost no object) | **GB200 DGX NVL72** | 72× GPU, 18 PFLOPS FP8, 384 GB HBM/GPU |
|
||
| **Cloud inference** — best $/token | **B200 spot** | $0.15/M tok; 4× H100 throughput at lower cost |
|
||
| **Cloud inference** — on-demand | **B200** | $0.42/M tok |
|
||
| **Cloud inference** — budget | **A100 / L40S** | $0.57–0.56/M tok |
|
||
| **Training** — price/perf on purchase | **Gaudi 3** | $8,515/PFLOPS, 2.5–3× better than H100 |
|
||
| **Training** — cloud | **H100 SXM** | $1.38/hr, CUDA ecosystem, NCCL |
|
||
| **Memory-bound** — long context, 70B+ | **MI300X / GB200** | 192–384 GB, $0.0078–0.0091/GB |
|
||
| **Ecosystem + safe choice** | **H100/H200** | CUDA, widest SW, NVIDIA tools |
|
||
| **Spot / preemptible** — lowest cost | **A100 / H100** | $1.07–1.38/hr, 50–90% off on-demand |
|
||
|
||
### 2026 Trends
|
||
|
||
- **H100** — price dropped 64% from peak $8/hr to $1.38–2.89/hr, then 40% rebound from inference demand
|
||
- **B200** — new high-end, $3.39/hr cloud → ~$0.15/M tok on spot — new inference benchmark
|
||
- **MI300X** — supply growing (TensorWave, Vultr, CoreWeave, Oracle, Azure), from $1.50/hr
|
||
- **Gaudi 3** — best $/PFLOPS on purchase, but narrow ecosystem and limited cloud availability
|
||
- **Market bifurcation** — prior gen (H100, A100) commoditizing, new gen (B200, GB200) commanding premium
|
||
|
||
- [GPU.en.md](GPU.en.md) — GPU architecture, NVIDIA/AMD, vGPU, MIG
|
||
- [NETWORKING.en.md](NETWORKING.en.md) — InfiniBand, RoCE, network topology
|
||
- [STORAGE.en.md](STORAGE.en.md) — parallel filesystem, object store
|
||
- [DATACENTERS.en.md](DATACENTERS.en.md) — DC layout, power, cooling
|
||
- [CLOUD.en.md](CLOUD.en.md) — cloud AI services (SageMaker, Vertex AI)
|
||
|
||
## Sources
|
||
|
||
Links, books, and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
|
||
|
||
*Last revision: 2026-06-18* |