# ๐Ÿง  AI/ML Infrastructure ## Component overview ```mermaid flowchart TD subgraph Compute GPU["GPU (H100/B200/Instinct)"] CPU["CPU (AMD EPYC / Intel Xeon)"] ASIC["ASIC (TPU, Trainium, Inferentia)"] end subgraph Network IB["InfiniBand NDR/XDR"] ROCE["RoCEv2"] NVL["NVLink / NVSwitch"] end subgraph Storage FS["Parallel FS (Lustre, GPFS, Weka)"] OBJ["Object Store (S3, MinIO)"] NVME["Local NVMe cache"] end subgraph Orchestration S["Slurm"] K["Kubernetes + Volcano/Kueue"] end subgraph Cooling DLC["Direct-to-chip liquid"] IMM["Immersion"] AIR["Air (high-density)"] end Compute --> Network --> Storage Orchestration --> Compute Cooling --> Compute ``` --- ## GPU compute ### NVIDIA | GPU | Architecture | FP8 | FP16/BF16 | FP64 | HBM | NVLink | TDP | Rack config | |-----|-------------|-----|-----------|------|-----|--------|-----|------| | **H100 SXM** | Hopper | 3,958 TFLOPS | 1,979 TFLOPS | 67 TFLOPS | 80 GB HBM3 | 900 GB/s | 700 W | 6โ€“8ร— in DGX H100 | | **H200 SXM** | Hopper (HBM3e) | 3,958 TFLOPS | 1,979 TFLOPS | 67 TFLOPS | 141 GB HBM3e | 900 GB/s | 700 W | 6โ€“8ร— in DGX H200 | | **B200** | Blackwell | ~9,000 TFLOPS | ~4,500 TFLOPS | ~40 TFLOPS | 192 GB HBM3e | 1,800 GB/s | 1,000 W | 6โ€“8ร— in DGX B200 | | **GB200 Grace Hopper** | Blackwell | ~18,000 TFLOPS | ~9,000 TFLOPS | โ€” | 192 GB + 480 GB (Grace) | NVLink-C2C | 1,000 W (GPU) + 500 W (CPU) | DGX GB200 (36ร— GPU) | | **L40S** | Ada Lovelace | 733 TFLOPS | 367 TFLOPS | โ€” | 48 GB GDDR6 | N/A | 350 W | Inference, enterprise | | **A100 SXM** | Ampere | 1,248 TFLOPS | 624 TFLOPS | 19.5 TFLOPS | 80 GB HBM2e | 600 GB/s | 400 W | DGX A100 | ### AMD | GPU | Architecture | FP8 | FP16/BF16 | FP64 | HBM | Infinity Fabric | TDP | |-----|-------------|-----|-----------|------|-----|----------------|-----| | **MI300X** | CDNA 3 | 2,615 TFLOPS | 1,307 TFLOPS | 81 TFLOPS | 192 GB HBM3 | 896 GB/s | 750 W | | **MI250** | CDNA 2 | โ€” | 383 TFLOPS | 95.7 TFLOPS | 128 GB HBM2e | 400 GB/s | 500 W | ### Intel | GPU | Architecture | FP16/BF16 | FP32 | HBM | TDP | |-----|-------------|-----------|------|-----|-----| | **Gaudi 3** | Custom | 1,835 TFLOPS | โ€” | 144 GB HBM2e | 600 W | | **Max 1550** | Xe HPC | 600+ TFLOPS | 200 TFLOPS | 128 GB HBM2e | 600 W | ### Cloud ASIC | ASIC | Provider | Use case | Performance | |------|----------|----------|-------| | **TPU v5p** | Google | Training | ~4,600 TFLOPS (BF16) per pod | | **Trainium 2** | AWS | Training | ~1,000 TFLOPS (BF16) per chip | | **Inferentia 2** | AWS | Inference | ~400 TOPS (INT8) per chip | | **Maia 100** | Microsoft | Training + inference | Custom, 800 W TDP | --- ## AI networking ### Technology comparison | Technology | Bandwidth per link | Latency | Topology | Use case | |-------------|-------------------|---------|-----------|----------| | **InfiniBand NDR200** | 200 Gb/s | < 1 ยตs | Fat-tree, Dragonfly+ | Training (NVIDIA) | | **InfiniBand NDR400** | 400 Gb/s | < 1 ยตs | Fat-tree, Dragonfly+ | Training (NVIDIA) | | **InfiniBand XDR** | 800 Gb/s (planned) | < 1 ยตs | Dragonfly+ | Next-gen training | | **RoCEv2** (CX-7/8) | 200โ€“400 Gb/s | 1โ€“2 ยตs | Fat-tree, Spine-leaf | Training (AMD, Intel, open) | | **NVLink 4.0** | 900 GB/s per GPU | < 0.5 ยตs | NVSwitch full-mesh | Intra-node GPU comm | | **NVLink 5.0** | 1,800 GB/s per GPU | < 0.5 ยตs | NVSwitch full-mesh | Intra-node (Blackwell) | | **Ethernet (400 GbE)** | 400 Gb/s | 2โ€“5 ยตs | Spine-leaf | Inference, data pipeline | ### AI fabric principles - **Rail-optimized topology** โ€” each GPU communicates on dedicated "rails" (same GPU indices across nodes connect to the same switch) - **Fat-tree (Clos)** โ€” standard for InfiniBand and RoCE, non-blocking bisection bandwidth - **Dragonfly+** โ€” reduces hop count while maintaining bandwidth (used in largest clusters) - **GPU Direct RDMA** โ€” direct GPU โ†” GPU communication without CPU involvement, supports InfiniBand and RoCE - **SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)** โ€” in-network reduction for AllReduce (InfiniBand only) ### Bandwidth sizing ```text Rule of thumb: InfiniBand bandwidth โ‰ฅ 50 % GPU HBM bandwidth for scalable training Example: H100 has 3.35 TB/s HBM โ†’ Needs min. 1.6 TB/s bisection bandwidth per GPU โ†’ 8ร— H100 in DGX: 4ร— NDR400 IB per GPU = 4 ร— 50 GB/s = 200 GB/s โ†’ Reality: 8ร— 200 Gb/s (25 GB/s) per GPU in typical config = ~6 % HBM โ†’ bottleneck ``` --- ## AI storage ### Requirements | Dataset size | IO pattern | Recommended storage | Bandwidth | |-------------|-----------|-------------------|-----------| | < 10 TB | Sequential read (data loading) | Local NVMe | > 10 GB/s per node | | 10โ€“100 TB | Random read (checkpointing) | Parallel FS (Lustre, Weka) | > 100 GB/s cluster-wide | | 100 TBโ€“10 PB | Mixed (training + checkpoint) | Parallel FS + object store | > 500 GB/s | | 10 PB+ | Multi-modal, video, LLM | Tiered (NVMe cache + parallel FS + object) | > 1 TB/s | ### Storage solution comparison | Solution | Type | Bandwidth per node | Max capacity | Scaling | Use case | |--------|-----|-------------------|-------------|-----------|----------| | **Lustre** | Parallel FS (POSIX) | > 100 GB/s (cluster) | 100s PB | OST + MDS | HPC, LLM training (standard) | | **GPFS / StorageScale** | Parallel FS (POSIX) | > 100 GB/s | 100s PB | NSD servers | HPC, AI (IBM) | | **WekaFS** | Parallel FS (POSIX + NFS/SMB) | ~80 GB/s per 10 nodes | 10s PB | Container-native | AI/ML, NVIDIA DGX preferred | | **VAST Data** | Universal storage (NVMe + QLC) | ~100 GB/s per cluster | 10s PB | Scale-out | AI, checkpoint, data lake | | **Pure Storage//E** | All-flash (NVMe) | ~50 GB/s | ~30 PB | Scale-out | Enterprise AI, database | | **MinIO / S3** | Object store | ~20 GB/s per gateway | EB | Erasure coding | Dataset repository, checkpoint | | **NetApp AFF** | NAS + S3 | ~10 GB/s per controller | ~50 PB | HA pair | Enterprise, NFS baseline | ### Checkpointing strategies | Strategy | RPO | Storage impact | Description | |-----------|-----|---------------|-------| | **Full checkpoint** | every N steps | High (stops training) | Full model + optimizer state | | **Async checkpoint** | every N steps | Medium (non-blocking) | Copy to staging buffer, async write | | **Distributed checkpoint** (NVIDIA NeMo) | every N steps | Low | Each rank writes its own shard | | **In-memory checkpoint** (IBM) | on failover | Minimal (DRAM) | Replication to another node's DRAM | | **Continuous checkpoint** (Microsoft) | every 1โ€“5 min | Low (delta) | Changed shards only | --- ## AI cluster architecture ### Physical topology โ€” DGX H100 example ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ DGX H100 (8ร— GPU) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚GPU 0โ”‚ โ”‚GPU 1โ”‚ โ”‚GPU 2โ”‚ โ”‚GPU 3โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”‚ โ”‚ โ”‚GPU 4โ”‚ โ”‚GPU 5โ”‚ โ”‚GPU 6โ”‚ โ”‚GPU 7โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ NVSwitch (NVLink 4.0, 900 GB/s) โ”‚ โ”‚ InfiniBand CX-7: 8ร— NDR400 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ 8ร— IB rails โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ IB NDR400 Switches โ”‚ (rail-optimized) โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### Kubernetes for AI | Component | Role | |-----------|------| | **Volcano** | Batch scheduling, gang scheduling, queue management | | **Kueue** | Multi-tenant admission, resource quotas, fair sharing | | **NVIDIA GPU Operator** | Driver, container toolkit, MIG, DCGM, monitoring | | **HAMi** (ex k8s-vGPU-scheduler) | GPU sharing, MIG partitioning, fractional GPU | | **Node Feature Discovery** | GPU type detection, NUMA topology | | **Topology Manager** | NUMA-aware pod placement | | **DPDK / SR-IOV** | High-performance networking for GPU Direct RDMA | ### Slurm for AI | Component | Role | |-----------|------| | **slurm.conf** | Partition for GPU nodes, GRES (Generic Resource) | | **gres.conf** | GPU type, GPU count per node | | **srun --gres=gpu:8** | Allocate 8 GPUs per job | | **sbatch --nodes=64 --ntasks=512** | 64 nodes, 512 ranks (8 GPU/node) | | **Pixis** | NVIDIA orchestration plugin for Slurm | --- ## AI cluster cooling ### Power density comparison | Configuration | TDP per node | Racks | kW/rack | Note | |-------------|-------------|-------|---------|----------| | Standard server (2U) | 1 kW | 20 | 5โ€“10 | Typical DC | | GPU server (DGX H100, 6ร—) | 42 kW | 6 | 45โ€“50 | Air cooling limit | | GPU server (DGX B200, 6ร—) | 72 kW | 6 | 90โ€“100 | Liquid cooling required | | GPU server (GB200 NVL72) | 120 kW | โ€” | ~120 | Liquid cooling mandatory | | NVIDIA NVL72 rack | 120 kW | 1 | 120 | Fully liquid cooled | ### Cooling technologies | Method | Max kW/rack | CAPEX | OPEX | Complexity | |--------|-------------|-------|------|-----------| | **Air cooling (CRAC/CRAH)** | < 15 | Low | Medium | Low | | **Air cooling (in-row)** | 15โ€“30 | Medium | Medium | Low | | **Rear-door heat exchanger** | 30โ€“50 | Medium | Low | Medium | | **Direct-to-chip liquid (cold plate)** | 50โ€“150 | High | Low | High | | **Immersion (single-phase)** | 100โ€“200 | High | Low | High | | **Immersion (two-phase)** | 200+ | Very high | Low | Very high | --- ## Inference infrastructure ### Inference server comparison | Tool | Frameworks | Optimization | Use case | |---------|-----------|-------------|----------| | **vLLM** | Megatron, HF, AWQ, GPTQ | PagedAttention, KV cache, continuous batching | LLM inference (open source) | | **TensorRT-LLM** | TensorRT | INT4/INT8/FP8, inflight batching, attention optimizations | Production (NVIDIA) | | **Triton Inference Server** | All (TensorRT, vLLM, PyTorch) | Model ensemble, model caching, concurrent execution | Enterprise, multi-model | | **SageMaker** | Managed | Auto-scaling, model parallelism | AWS managed | | **OpenAI API / TGI** | HF Transformers | Continuous batching, flash attention | Hosting | ### Inference optimization | Technique | Latency improvement | Throughput improvement | Memory reduction | |----------|-----------------|---------------------|------------------| | **FP8/INT8 quantization** | โ€” | 2ร— | 2ร— | | **INT4 quantization** | โ€” | 4ร— | 4ร— | | **Flash Attention 2/3** | 2โ€“4ร— | โ€” | 50 % (KV cache) | | **PagedAttention** | โ€” | 2โ€“5ร— | 95 % (KV cache fragmentation) | | **Continuous batching** | โ€” | 10โ€“20ร— | โ€” | | **Speculative decoding** | 2โ€“3ร— | โ€” | โ€” | | **Multi-LoRA / S-LoRA** | โ€” | 8โ€“16ร— | โ€” | --- ## Distributed training techniques | Technique | Description | Frameworks | |----------|-------|------------| | **Data Parallelism (DDP/FSDP)** | Each GPU has model copy, different batch | PyTorch DDP, FSDP | | **Tensor Parallelism (TP)** | Model split across layers (intra-node) | Megatron-LM, DeepSpeed | | **Pipeline Parallelism (PP)** | Layers split across nodes | Megatron-LM, DeepSpeed | | **Sequence Parallelism (SP)** | Sequence split across GPUs | Megatron-LM | | **Expert Parallelism (EP)** | Different expert subnets on different GPUs | Mixture-of-Experts (MoE) | | **3D Parallelism** | TP + PP + DP combination | Megatron-LM, NeMo | | **ZeRO (1/2/3)** | Optimizer/gradient/parameter sharding | DeepSpeed | | **NCCL / RCCL** | GPU collective communication library | NVIDIA/AMD | --- ## Operating systems for AI ### Distribution comparison | OS | GPU driver | CUDA | Container toolkit | IB/RoCE | Lustre client | Production support | |----|-----------|------|-------------------|---------|--------------|-------------------| | **Ubuntu 22.04 LTS** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED, rdma-core | Yes (lustre-client) | NVIDIA DGX standard | | **Ubuntu 24.04 LTS** | NVIDIA 550+ | 12.5+ | nvidia-container-toolkit | MLNX_OFED, rdma-core | Yes | Latest GPU support | | **RHEL 9 / Rocky 9** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED | Yes (EL repo) | Red Hat, enterprise | | **DGX OS** (Ubuntu-based) | NVIDIA custom | 12.x | Pre-installed | Pre-configured | Yes | NVIDIA DGX only supported | | **SLES 15 SP5** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | MLNX_OFED | Yes | HPC, some Lustre clusters | | **Debian 12** | NVIDIA 525+ | 12.x | nvidia-container-toolkit | rdma-core | Yes (backports) | Community, research | | **Flatcar / Bottlerocket** | Container-host | โ€” | nvidia-container-toolkit | Limited | No | K8s-only, minimal footprint | ### Limitations and constraints #### GPU drivers and CUDA | Constraint | Detail | |----------|--------| | **Driver-CUDA compatibility** | NVIDIA driver major version must match CUDA toolkit (driver โ‰ฅ CUDA req). E.g., CUDA 12.5 requires driver โ‰ฅ 550 | | **Kernel version** | NVIDIA driver not compatible with all kernels. New kernel (6.8+) may require DKMS build or delayed support | | **Secure Boot** | NVIDIA driver requires signed module (MOK, shim) or disabled Secure Boot โ€” common enterprise issue | | **Open vs Proprietary driver** | NVIDIA `nvidia-open` (since R515) โ€” open source kernel module. GPU support: DC (H100+) โ†’ OK, older GPUs โ†’ proprietary required | | **nvidia-persistenced** | Required to maintain GPU initialization; without it GPUs may sleep after idle timeout (`nvidia-smi -pm 1`) | | **GPU reset** | After crashed training job, GPU may hang. `nvidia-smi --gpu-reset` or reboot node, sometimes power cycle | | **Multi-instance GPU (MIG)** | Requires specific driver, MIG mode on GPU, GPU restart. Cannot be changed at runtime. A100, H100, B200 only | #### Network (InfiniBand / RoCE) | Constraint | Detail | |----------|--------| | **MLNX_OFED vs rdma-core** | MLNX_OFED (NVIDIA) โ€” full support, but own kernel modules, kernel version compatibility needed. `rdma-core` (open) โ€” limited support, no custom modules | | **Kernel compatibility** | MLNX_OFED supports only specific kernel versions (major.minor). Kernel upgrade โ†’ MLNX_OFED rebuild required | | **NCCL** | NCCL version must be compatible with CUDA and IB firmware. `nccl-tests` for validation | | **SHARP** | In-network reduction requires specific MLNX_OFED + IB switch firmware combination | | **GPU Direct RDMA** | Requires `nvidia-peermem` module + MLNX_OFED. Does not work with all GPU and IB card combinations | | **RoCE PFC/ECN** | RoCE requires lossless fabric (PFC, ECN, DCQCN). Switch and host configuration โ€” complex tuning | #### Storage | Constraint | Detail | |----------|--------| | **Lustre client** | Client version must match server. Server upgrade โ†’ upgrade all clients. Compatible with RHEL/Debian derivatives only | | **POSIX locking** | NFS and Lustre have different POSIX locking behavior. Distributed training relies on flock โ†’ problematic with mixed FS | | **Filesystem cache** | Page cache can mask IO bottlenecks. Training jobs often require `O_DIRECT` or sync IO | | **Local NVMe vs parallel FS** | Dataset staging on local NVMe eliminates network dependency but requires space and pre-fetch pipeline | #### Container runtime | Constraint | Detail | |----------|--------| | **Docker + GPU** | `nvidia-container-toolkit` (formerly nvidia-docker2). Requires runtime installation and config in `/etc/docker/daemon.json` | | **Podman + GPU** | Requires `nvidia-container-toolkit` + podman hook. Less tested than Docker | | **containerd + GPU** | Standard for K8s. Requires `cdi` (Container Device Interface) or `nvidia-container-runtime` | | **Enroot + Pyxis** | NVIDIA container stack for Slurm (Enroot = daemonless container runtime, Pyxis = Slurm plugin) | | **User namespace mapping** | Container GPU access requires device cgroup; rootless may fail (exception for /dev/dri and /dev/nvidia*) | #### Kernel parameters ```text # AI workload recommended sysctl net.core.rmem_max = 134217728 # sufficient for NCCL net.core.wmem_max = 134217728 net.ipv4.tcp_rmem = 4096 87380 134217728 net.ipv4.tcp_wmem = 4096 65536 134217728 net.core.netdev_budget = 600 # for high packet rate vm.max_map_count = 1048576 # PyTorch DataLoader workers kernel.numa_balancing = 0 # disable NUMA balancing (breaks locality) kernel.sched_min_granularity_ns = 10000000 # Disable security mitigations for perf (dedicated AI clusters only) mitigations=off transparent_hugepages=never # or madvise โ€” THP may cause latency spikes intel_idle.max_cstate=1 # reduce C-state transition latency ``` #### Firmware and HW | Constraint | Detail | |----------|--------| | **GPU firmware (VBIOS)** | NVIDIA datacenter GPUs (H100, B200) have VBIOS updates via NVFlash. Without update โ†’ missing partitioning support or newer CUDA features | | **InfiniBand firmware** | IB switch and HCA firmware must be compatible. Mix old switch + new HCA โ†’ degraded perf | | **NVSwitch firmware** | DGX systems have NVSwitch firmware updatable only via NVIDIA DGX tools | | **Power capping (nvidia-smi)** | `nvidia-smi -pl ` โ€” limit TDP for power budget management. Test impact on training throughput | | **GPU clock locking** | `nvidia-smi -ac ` โ€” locked clock frequency for stable benchmarks. Apply after `nvidia-persistenced` | | **PCIe Gen** | GPU in PCIe Gen4 slot (instead of Gen5) โ†’ bottleneck for CPUโ†”GPU data transfer. Important for FSDP sharding | ### Recommended OS per use case | Use case | OS | Rationale | |----------|-----|-------| | **DGX cluster (production)** | DGX OS / Ubuntu 22.04 LTS | NVIDIA standard, best driver support | | **Enterprise K8s (OpenShift)** | RHEL 9 / RHCOS | Red Hat support, GPU Operator compatible | | **Vanilla K8s (on-prem)** | Ubuntu 22.04 LTS + Flatcar (workers) | Widest community support, Flatcar for minimal footprint | | **Slurm cluster (HPC/AI)** | Rocky Linux 9 / Ubuntu 22.04 LTS | EL ecosystem (Lustre, OFED) or Ubuntu (community) | | **Research / rapid prototyping** | Ubuntu 24.04 LTS | Latest CUDA, PyTorch, driver support | | **Edge inference** | NVIDIA JetPack / Ubuntu (ARM) | Embedded GPU (Jetson Orin, AGX) | --- ## AI-ready data center โ€” check-list | Area | Requirement | |--------|-----------| | **Power** | 30โ€“120 kW/rack, HVDC (400 V DC), UPS supporting GPU spikes | | **Cooling** | Liquid cooling ready (direct-to-chip), rear-door for 30+ kW | | **Network** | InfiniBand (NDR/XDR) or RoCEv2, rail-optimized fat-tree | | **Storage** | Parallel FS (Lustre/Weka), checkpoint bandwidth > 100 GB/s | | **GPU density** | Max GPU/rack, minimize NVSwitch hops | | **Physical** | Floor load 1,500+ kg/mยฒ, rack 52Uโ€“60U | | **Security** | Tenant isolation, network segmentation, data encryption | | **Monitoring** | DCGM, NCCL health checks, thermals, power capping | --- ## Model and throughput limitations ### Model size per GPU Maximum model size fitting on a single GPU depends on HBM capacity and precision: | GPU | HBM | FP32 | FP16/BF16 | INT8 | INT4 | |-----|-----|------|-----------|------|------| | **H100 80GB** | 80 GB | ~10B | ~40B | ~80B | ~160B | | **H200 141GB** | 141 GB | ~18B | ~70B | ~140B | ~280B | | **B200 192GB** | 192 GB | ~24B | ~96B | ~192B | ~384B | | **MI300X 192GB** | 192 GB | ~24B | ~96B | ~192B | ~384B | | **A100 80GB** | 80 GB | ~10B | ~40B | ~80B | ~160B | | **GB200 (192+480)** | 192 GB GPU + 480 GB Grace | โ€” | ~96B + CPU offload | โ€” | โ€” | *Approximate: 1B params โ‰ˆ 2 GB FP16 โ‰ˆ 4 GB FP32 โ‰ˆ 1 GB INT8 โ‰ˆ 0.5 GB INT4. Subtract ~10โ€“15 % HBM for activations, KV cache, optimizer states.* ### Memory breakdown inference | Component | Llama 3 70B (FP16) | Llama 3 8B (FP16) | |------------|-------------------|-------------------| | Model weights | 140 GB | 16 GB | | KV cache (4K context, batch 1) | ~2 GB | ~0.2 GB | | KV cache (128K context, batch 1) | ~60 GB | ~6.5 GB | | Activations (peak) | ~5 GB | ~1 GB | | **Total 4K ctx** | ~147 GB | ~17 GB | | **Total 128K ctx** | ~205 GB | ~23 GB | **Conclusion:** Llama 3 70B FP16 does not fit on a single H100 (80 GB). Required: INT8 (170 GB โ†’ 2ร— H100), INT4 (85 GB โ†’ 1ร— H200), or tensor parallelism. ### Context length vs memory | Context | KV cache 70B (FP16) | KV cache 8B (FP16) | Note | |---------|-------------------|-------------------|------| | 4K | ~2.2 GB | ~0.25 GB | Typical chat | | 32K | ~18 GB | ~2 GB | Documents | | 128K | ~72 GB | ~8 GB | Long-context (Claude, Gemini) | | 1M | ~560 GB | ~64 GB | Experimental (Gemini 1.5 Pro) | KV cache is **linear with context length** and quadratic with attention head count. Critical for long-context inference. ### Throughput inference | Model | GPU | Precision | Batch size | Tokens/s | QPS (1K output) | |-------|-----|-----------|-----------|----------|-----------------| | Llama 3 8B | H100 | FP16 | 1 | ~800 | ~0.8 | | Llama 3 8B | H100 | FP16 | 128 | ~4 500 | ~35 | | Llama 3 8B | H100 | INT4 | 128 | ~8 000 | ~62 | | Llama 3 70B | 4ร— H100 | FP16 | 1 | ~180 | ~0.18 | | Llama 3 70B | 4ร— H100 | INT4 | 64 | ~1 200 | ~19 | | Llama 3 70B | 8ร— H100 | FP16 (TP=8) | 128 | ~2 500 | ~20 | | DeepSeek-R1 671B | 8ร— H200 | FP8 (MoE) | 64 | ~500 | ~8 | | GPT-4 class (est.) | โ€” | โ€” | โ€” | ~100โ€“300 | ~1โ€“3 | **Notes:** - QPS (queries per second) depends on output length (1K tokens โ‰ˆ ~1 query) - Larger batch increases throughput but increases TTFB (time to first token) - Tensor Parallelism (TP) scales, but communication overhead grows linearly ### Training limits #### Scaling efficiency | GPU count | Model | Efficiency | Reason | |-----------|-------|-----------|-------| | 8 (1 node) | Llama 3 8B | ~95 % | NVLink intra-node | | 64 (8 nodes) | Llama 3 8B | ~85 % | IB inter-node | | 512 (64 nodes) | Llama 3 70B | ~75 % | Communication overhead | | 4 096 (512 nodes) | Llama 3 70B | ~60 % | Pipeline bubble, network | | 16 384 (2 048 nodes) | Llama 3 405B | ~45 % | Synchronous SGD overhead | **Note:** Efficiency = (actual throughput) / (ideal linear speedup). Decreases logarithmically with GPU count. #### Memory breakdown training | Component | Llama 3 70B (BF16) | Llama 3 8B (BF16) | |------------|-------------------|-------------------| | Model weights | 140 GB | 16 GB | | Optimizer states (Adam) | 280 GB | 32 GB | | Gradients | 140 GB | 16 GB | | Activations (peak) | ~30 GB | ~4 GB | | **Total (DDP)** | ~590 GB | ~68 GB | | **Total (FSDP shard=8)** | ~74 GB | ~8.5 GB | **Conclusion:** FSDP (Fully Sharded Data Parallelism) is required for training models > 10B. Adam optimizer doubles memory vs inference (weights + optimizer + gradients). #### Time to train | Model | GPU count | GPU type | Precision | Time | Cost (on-prem estimate) | |-------|-----------|---------|-----------|------|---------------------| | Llama 3 8B | 64 | H100 | BF16 | ~3 days | ~$5 000 | | Llama 3 70B | 512 | H100 | BF16 | ~14 days | ~$100 000 | | Llama 3 405B | 16 384 | H100 | BF16 | ~60 days | ~$14 M | | DeepSeek-R1 671B (MoE) | 2 048 | H800 | BF16 | ~30 days | ~$6 M | | GPT-4 (est.) | 25 000 | A100/H100 | Mixed | ~90โ€“100 days | ~$100 M | ### Power and thermal limits | Configuration | TDP limit | Throughput loss | Reason | |-------------|-----------|------------------|--------| | H100 SXM | 700 W (default) | 0 % | Nominal | | H100 SXM | 600 W (-15 %) | ~5โ€“8 % | Power capping | | H100 SXM | 500 W (-30 %) | ~15โ€“25 % | Significant throttling | | H100 SXM | 400 W (-43 %) | ~30โ€“50 % | Emergency only | | DGX H100 (8ร—) | 5.6 kW (max) | 0 % | Liquid cooling required | | DGX H100 (8ร—) | 4.5 kW (air) | ~10โ€“15 % | Rear-door heat exchanger | GPU throttles when exceeding TDP or temperature (85ยฐC+). Power capping correlates linearly with frequency but non-linearly with throughput. ### API and operational limits | Limit | Description | Typical value | |-------|-------|-----------------| | **Rate limit** | Max requests per minute/hour | 100โ€“10 000 RPM (per tier) | | **Tokens per minute (TPM)** | Max tokens per minute | 1Mโ€“300M (per model) | | **Context window** | Max input tokens | 4Kโ€“2M (per model) | | **Max output tokens** | Max generated tokens | 4Kโ€“32K (per model) | | **Concurrent requests** | Parallel request count | 10โ€“10 000 (per backend) | | **Batch window** | Time to accumulate batch | 0โ€“20 s (vLLM, TGI) | | **TTFB timeout** | Max latency to first token | 30โ€“120 s | | **Idle timeout** | GPU idle โ†’ scale to 0 | 5โ€“15 min (cloud) | ### Limits per deployment model | Dimension | On-prem HW | Managed cloud (SageMaker, Vertex) | API (OpenAI, Anthropic) | |-----------|--------------|----------------------------------|------------------------| | **Model size** | Limited by HBM (max 192 GB/GPU) | Unlimited (cluster scaling) | Unlimited | | **Queries** | Limited by GPU count | Auto-scaling | Rate limit (per tier) | | **Latency** | < 10 ms (same node) | 10โ€“100 ms (network hop) | 100 ms โ€“ 10 s | | **Customization** | Full (fine-tuning, quantization) | Managed (SageMaker, Bedrock) | Prompt engineering only | | **Data privacy** | Yes (on-prem) | Contractual (region, encryption) | Limited | | **Cost per 1M tokens** | ~$0.10โ€“0.50 (FP16 inference) | ~$0.20โ€“1.00 | ~$0.15โ€“15.00 | | **Max context** | 128K+ (depending on GPU count) | 128K+ | 32Kโ€“2M | | **Cold start** | 0 (always-on) | 30 s โ€“ 5 min | 0 (shared infra) | --- ## GPU pricing and price/performance (2026) > Prices are approximate โ€” NVIDIA does not publish official datacenter GPU price lists. Cloud prices from public providers (Q2 2026). HW purchase prices vary by volume, reseller, and region. ### Purchase price (buy) | GPU | Price/GPU | Price 8ร— GPU baseboard | $/PFLOPS (FP16) | Note | |-----|---------|----------------------|----------------|------| | **H100 SXM** | $27,000โ€“40,000 | ~$200,000 | $25,000 | Scarcity 2023โ€“2024, now stabilized | | **H200 SXM** | $35,000โ€“50,000 | ~$280,000 | ~$35,000 | H100 upgrade, HBM3e | | **B200** | ~$60,000โ€“70,000 | ~$500,000+ | ~$31,000 | Blackwell, FP4 support | | **B100** | ~$30,000 | ~$240,000 | ~$20,000 | Lower price than B200, similar FP8 perf | | **GB200** (Grace+Blackwell) | ~$70,000โ€“100,000 | ~$2,000,000 (rack) | โ€” | CPU+GPU unified, high-density | | **A100 80GB** | ~$10,000โ€“15,000 | ~$120,000 | ~$19,200 | Previous gen, still relevant | | **MI300X** | ~$12,000โ€“18,000 | ~$100,000 | ~$9,600 | AMD, 192 GB HBM3 | | **Gaudi 3** | ~$15,625 | ~$125,000 | **$8,515** | Intel, best $/PFLOPS | | **L40S** | ~$8,000โ€“10,000 | โ€” | โ€” | Inference, enterprise | ### Cloud pricing (on-demand $/GPU/hr) | GPU | Cheapest | Mid-range (CoreWeave, Lambda) | Hyperscaler (AWS, GCP, Azure) | |-----|----------|-----------------------------|-------------------------------| | **H100 SXM** | $1.38 (Thunder) | $2.89โ€“3.29 | $4.15โ€“6.88 | | **H100 PCIe** | $2.01 (Spheron) | $2.50 | โ€” | | **H200 SXM** | $3.89 (Spheron) | $4.54 | $5.00+ | | **B200** | **$3.39** (Spheron) | $6.02 | $14.24 (AWS) | | **B200 spot** | **$2.12** (Spheron) | โ€” | โ€” | | **GB200** | $3.50 (Runcrate) | $5.85 (Oracle) | $6.95 (GCP) | | **MI300X** | **$1.50** (TensorWave) | $1.85 (Vultr) | $7.86 (Azure) | | **A100 80GB** | $1.07 (Spheron) | $1.50โ€“2.00 | $3.00+ | | **Gaudi 3** | ~$1.50โ€“2.50 | โ€” | โ€” | | **L40S** | $0.91 (Spheron) | $1.50โ€“2.00 | โ€” | ### Inference cost ($/M tokens) | GPU | Provider | $/hr | Est. tok/s | $/M tok | |-----|----------|------|-----------|--------| | **B200** | Spheron | $3.39 | ~4,000 | **$0.42** | | **B200 spot** | Spheron | $2.12 | ~4,000 | **$0.15** | | **H100 PCIe** | Spheron | $2.01 | ~1,200 | $0.47 | | **A100 80GB** | Spheron | $1.07 | ~520 | $0.57 | | **H100 SXM** | AWS | $6.88 | ~1,200 | $1.59 | | **H200 SXM** | Spheron | $4.54 | ~1,800 | $0.70 | | **L40S** | Spheron | $0.91 | ~450 | $0.56 | *Values for Llama 3 70B (INT8, batch=1, output 1K tok). Actual values vary by batch size, context, and quantization.* ### Cost per GB HBM | GPU | HBM | Price/hr cloud | $/GB/hr | Best for memory-bound workloads | |-----|-----|-------------|--------|--------------------------------| | **MI300X** | 192 GB | $1.50 | **$0.0078** | โœ… Best | | **B200** | 192 GB | $3.39 | $0.0177 | โœ… Good | | **H200** | 141 GB | $3.89 | $0.0276 | โš ๏ธ | | **H100 SXM** | 80 GB | $1.38 | $0.0173 | โš ๏ธ Only up to 70B models | | **GB200** | 384 GB | $3.50 | $0.0091 | โœ…โœ… (2ร— MI300X capacity) | ### Price/performance by scenario | Scenario | Winner | Rationale | |----------|--------|-----------| | **Absolute performance** (cost no object) | **GB200 DGX NVL72** | 72ร— GPU, 18 PFLOPS FP8, 384 GB HBM/GPU | | **Cloud inference** โ€” best $/token | **B200 spot** | $0.15/M tok; 4ร— H100 throughput at lower cost | | **Cloud inference** โ€” on-demand | **B200** | $0.42/M tok | | **Cloud inference** โ€” budget | **A100 / L40S** | $0.57โ€“0.56/M tok | | **Training** โ€” price/perf on purchase | **Gaudi 3** | $8,515/PFLOPS, 2.5โ€“3ร— better than H100 | | **Training** โ€” cloud | **H100 SXM** | $1.38/hr, CUDA ecosystem, NCCL | | **Memory-bound** โ€” long context, 70B+ | **MI300X / GB200** | 192โ€“384 GB, $0.0078โ€“0.0091/GB | | **Ecosystem + safe choice** | **H100/H200** | CUDA, widest SW, NVIDIA tools | | **Spot / preemptible** โ€” lowest cost | **A100 / H100** | $1.07โ€“1.38/hr, 50โ€“90% off on-demand | ### 2026 Trends - **H100** โ€” price dropped 64% from peak $8/hr to $1.38โ€“2.89/hr, then 40% rebound from inference demand - **B200** โ€” new high-end, $3.39/hr cloud โ†’ ~$0.15/M tok on spot โ€” new inference benchmark - **MI300X** โ€” supply growing (TensorWave, Vultr, CoreWeave, Oracle, Azure), from $1.50/hr - **Gaudi 3** โ€” best $/PFLOPS on purchase, but narrow ecosystem and limited cloud availability - **Market bifurcation** โ€” prior gen (H100, A100) commoditizing, new gen (B200, GB200) commanding premium - [GPU.en.md](GPU.en.md) โ€” GPU architecture, NVIDIA/AMD, vGPU, MIG - [NETWORKING.en.md](NETWORKING.en.md) โ€” InfiniBand, RoCE, network topology - [STORAGE.en.md](STORAGE.en.md) โ€” parallel filesystem, object store - [DATACENTERS.en.md](DATACENTERS.en.md) โ€” DC layout, power, cooling - [CLOUD.en.md](CLOUD.en.md) โ€” cloud AI services (SageMaker, Vertex AI) ## Sources Links, books, and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md) *Last revision: 2026-06-18*