Files
knowledge-base/AI-INFRASTRUCTURE.en.md
Stanislav Hubacek ef3c2f75b1 18.6.2026
2026-06-18 16:25:33 +02:00

30 KiB
Raw Permalink Blame History

🧠 AI/ML Infrastructure

Component overview

flowchart TD
    subgraph Compute
        GPU["GPU (H100/B200/Instinct)"]
        CPU["CPU (AMD EPYC / Intel Xeon)"]
        ASIC["ASIC (TPU, Trainium, Inferentia)"]
    end
    subgraph Network
        IB["InfiniBand NDR/XDR"]
        ROCE["RoCEv2"]
        NVL["NVLink / NVSwitch"]
    end
    subgraph Storage
        FS["Parallel FS (Lustre, GPFS, Weka)"]
        OBJ["Object Store (S3, MinIO)"]
        NVME["Local NVMe cache"]
    end
    subgraph Orchestration
        S["Slurm"]
        K["Kubernetes + Volcano/Kueue"]
    end
    subgraph Cooling
        DLC["Direct-to-chip liquid"]
        IMM["Immersion"]
        AIR["Air (high-density)"]
    end

    Compute --> Network --> Storage
    Orchestration --> Compute
    Cooling --> Compute

GPU compute

NVIDIA

GPU Architecture FP8 FP16/BF16 FP64 HBM NVLink TDP Rack config
H100 SXM Hopper 3,958 TFLOPS 1,979 TFLOPS 67 TFLOPS 80 GB HBM3 900 GB/s 700 W 68× in DGX H100
H200 SXM Hopper (HBM3e) 3,958 TFLOPS 1,979 TFLOPS 67 TFLOPS 141 GB HBM3e 900 GB/s 700 W 68× in DGX H200
B200 Blackwell ~9,000 TFLOPS ~4,500 TFLOPS ~40 TFLOPS 192 GB HBM3e 1,800 GB/s 1,000 W 68× in DGX B200
GB200 Grace Hopper Blackwell ~18,000 TFLOPS ~9,000 TFLOPS 192 GB + 480 GB (Grace) NVLink-C2C 1,000 W (GPU) + 500 W (CPU) DGX GB200 (36× GPU)
L40S Ada Lovelace 733 TFLOPS 367 TFLOPS 48 GB GDDR6 N/A 350 W Inference, enterprise
A100 SXM Ampere 1,248 TFLOPS 624 TFLOPS 19.5 TFLOPS 80 GB HBM2e 600 GB/s 400 W DGX A100

AMD

GPU Architecture FP8 FP16/BF16 FP64 HBM Infinity Fabric TDP
MI300X CDNA 3 2,615 TFLOPS 1,307 TFLOPS 81 TFLOPS 192 GB HBM3 896 GB/s 750 W
MI250 CDNA 2 383 TFLOPS 95.7 TFLOPS 128 GB HBM2e 400 GB/s 500 W

Intel

GPU Architecture FP16/BF16 FP32 HBM TDP
Gaudi 3 Custom 1,835 TFLOPS 144 GB HBM2e 600 W
Max 1550 Xe HPC 600+ TFLOPS 200 TFLOPS 128 GB HBM2e 600 W

Cloud ASIC

ASIC Provider Use case Performance
TPU v5p Google Training ~4,600 TFLOPS (BF16) per pod
Trainium 2 AWS Training ~1,000 TFLOPS (BF16) per chip
Inferentia 2 AWS Inference ~400 TOPS (INT8) per chip
Maia 100 Microsoft Training + inference Custom, 800 W TDP

AI networking

Technology comparison

Technology Bandwidth per link Latency Topology Use case
InfiniBand NDR200 200 Gb/s < 1 µs Fat-tree, Dragonfly+ Training (NVIDIA)
InfiniBand NDR400 400 Gb/s < 1 µs Fat-tree, Dragonfly+ Training (NVIDIA)
InfiniBand XDR 800 Gb/s (planned) < 1 µs Dragonfly+ Next-gen training
RoCEv2 (CX-7/8) 200400 Gb/s 12 µs Fat-tree, Spine-leaf Training (AMD, Intel, open)
NVLink 4.0 900 GB/s per GPU < 0.5 µs NVSwitch full-mesh Intra-node GPU comm
NVLink 5.0 1,800 GB/s per GPU < 0.5 µs NVSwitch full-mesh Intra-node (Blackwell)
Ethernet (400 GbE) 400 Gb/s 25 µs Spine-leaf Inference, data pipeline

AI fabric principles

  • Rail-optimized topology — each GPU communicates on dedicated "rails" (same GPU indices across nodes connect to the same switch)
  • Fat-tree (Clos) — standard for InfiniBand and RoCE, non-blocking bisection bandwidth
  • Dragonfly+ — reduces hop count while maintaining bandwidth (used in largest clusters)
  • GPU Direct RDMA — direct GPU ↔ GPU communication without CPU involvement, supports InfiniBand and RoCE
  • SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) — in-network reduction for AllReduce (InfiniBand only)

Bandwidth sizing

Rule of thumb: InfiniBand bandwidth ≥ 50 % GPU HBM bandwidth for scalable training

Example: H100 has 3.35 TB/s HBM
  → Needs min. 1.6 TB/s bisection bandwidth per GPU
  → 8× H100 in DGX: 4× NDR400 IB per GPU = 4 × 50 GB/s = 200 GB/s
  → Reality: 8× 200 Gb/s (25 GB/s) per GPU in typical config = ~6 % HBM → bottleneck

AI storage

Requirements

Dataset size IO pattern Recommended storage Bandwidth
< 10 TB Sequential read (data loading) Local NVMe > 10 GB/s per node
10100 TB Random read (checkpointing) Parallel FS (Lustre, Weka) > 100 GB/s cluster-wide
100 TB10 PB Mixed (training + checkpoint) Parallel FS + object store > 500 GB/s
10 PB+ Multi-modal, video, LLM Tiered (NVMe cache + parallel FS + object) > 1 TB/s

Storage solution comparison

Solution Type Bandwidth per node Max capacity Scaling Use case
Lustre Parallel FS (POSIX) > 100 GB/s (cluster) 100s PB OST + MDS HPC, LLM training (standard)
GPFS / StorageScale Parallel FS (POSIX) > 100 GB/s 100s PB NSD servers HPC, AI (IBM)
WekaFS Parallel FS (POSIX + NFS/SMB) ~80 GB/s per 10 nodes 10s PB Container-native AI/ML, NVIDIA DGX preferred
VAST Data Universal storage (NVMe + QLC) ~100 GB/s per cluster 10s PB Scale-out AI, checkpoint, data lake
Pure Storage//E All-flash (NVMe) ~50 GB/s ~30 PB Scale-out Enterprise AI, database
MinIO / S3 Object store ~20 GB/s per gateway EB Erasure coding Dataset repository, checkpoint
NetApp AFF NAS + S3 ~10 GB/s per controller ~50 PB HA pair Enterprise, NFS baseline

Checkpointing strategies

Strategy RPO Storage impact Description
Full checkpoint every N steps High (stops training) Full model + optimizer state
Async checkpoint every N steps Medium (non-blocking) Copy to staging buffer, async write
Distributed checkpoint (NVIDIA NeMo) every N steps Low Each rank writes its own shard
In-memory checkpoint (IBM) on failover Minimal (DRAM) Replication to another node's DRAM
Continuous checkpoint (Microsoft) every 15 min Low (delta) Changed shards only

AI cluster architecture

Physical topology — DGX H100 example

┌──────── DGX H100 (8× GPU) ────────┐
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│  │GPU 0│ │GPU 1│ │GPU 2│ │GPU 3│ │
│  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│  ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │
│  │GPU 4│ │GPU 5│ │GPU 6│ │GPU 7│ │
│  └─────┘ └─────┘ └─────┘ └─────┘ │
│  NVSwitch (NVLink 4.0, 900 GB/s)  │
│  InfiniBand CX-7: 8× NDR400       │
└────────────────────────────────────┘
         │ 8× IB rails
    ┌────┴──────────────┐
    │  IB NDR400 Switches │  (rail-optimized)
    └────────────────────┘

Kubernetes for AI

Component Role
Volcano Batch scheduling, gang scheduling, queue management
Kueue Multi-tenant admission, resource quotas, fair sharing
NVIDIA GPU Operator Driver, container toolkit, MIG, DCGM, monitoring
HAMi (ex k8s-vGPU-scheduler) GPU sharing, MIG partitioning, fractional GPU
Node Feature Discovery GPU type detection, NUMA topology
Topology Manager NUMA-aware pod placement
DPDK / SR-IOV High-performance networking for GPU Direct RDMA

Slurm for AI

Component Role
slurm.conf Partition for GPU nodes, GRES (Generic Resource)
gres.conf GPU type, GPU count per node
srun --gres=gpu:8 Allocate 8 GPUs per job
sbatch --nodes=64 --ntasks=512 64 nodes, 512 ranks (8 GPU/node)
Pixis NVIDIA orchestration plugin for Slurm

AI cluster cooling

Power density comparison

Configuration TDP per node Racks kW/rack Note
Standard server (2U) 1 kW 20 510 Typical DC
GPU server (DGX H100, 6×) 42 kW 6 4550 Air cooling limit
GPU server (DGX B200, 6×) 72 kW 6 90100 Liquid cooling required
GPU server (GB200 NVL72) 120 kW ~120 Liquid cooling mandatory
NVIDIA NVL72 rack 120 kW 1 120 Fully liquid cooled

Cooling technologies

Method Max kW/rack CAPEX OPEX Complexity
Air cooling (CRAC/CRAH) < 15 Low Medium Low
Air cooling (in-row) 1530 Medium Medium Low
Rear-door heat exchanger 3050 Medium Low Medium
Direct-to-chip liquid (cold plate) 50150 High Low High
Immersion (single-phase) 100200 High Low High
Immersion (two-phase) 200+ Very high Low Very high

Inference infrastructure

Inference server comparison

Tool Frameworks Optimization Use case
vLLM Megatron, HF, AWQ, GPTQ PagedAttention, KV cache, continuous batching LLM inference (open source)
TensorRT-LLM TensorRT INT4/INT8/FP8, inflight batching, attention optimizations Production (NVIDIA)
Triton Inference Server All (TensorRT, vLLM, PyTorch) Model ensemble, model caching, concurrent execution Enterprise, multi-model
SageMaker Managed Auto-scaling, model parallelism AWS managed
OpenAI API / TGI HF Transformers Continuous batching, flash attention Hosting

Inference optimization

Technique Latency improvement Throughput improvement Memory reduction
FP8/INT8 quantization 2× 2×
INT4 quantization 4× 4×
Flash Attention 2/3 24× 50 % (KV cache)
PagedAttention 25× 95 % (KV cache fragmentation)
Continuous batching 1020×
Speculative decoding 23×
Multi-LoRA / S-LoRA 816×

Distributed training techniques

Technique Description Frameworks
Data Parallelism (DDP/FSDP) Each GPU has model copy, different batch PyTorch DDP, FSDP
Tensor Parallelism (TP) Model split across layers (intra-node) Megatron-LM, DeepSpeed
Pipeline Parallelism (PP) Layers split across nodes Megatron-LM, DeepSpeed
Sequence Parallelism (SP) Sequence split across GPUs Megatron-LM
Expert Parallelism (EP) Different expert subnets on different GPUs Mixture-of-Experts (MoE)
3D Parallelism TP + PP + DP combination Megatron-LM, NeMo
ZeRO (1/2/3) Optimizer/gradient/parameter sharding DeepSpeed
NCCL / RCCL GPU collective communication library NVIDIA/AMD

Operating systems for AI

Distribution comparison

OS GPU driver CUDA Container toolkit IB/RoCE Lustre client Production support
Ubuntu 22.04 LTS NVIDIA 525+ 12.x nvidia-container-toolkit MLNX_OFED, rdma-core Yes (lustre-client) NVIDIA DGX standard
Ubuntu 24.04 LTS NVIDIA 550+ 12.5+ nvidia-container-toolkit MLNX_OFED, rdma-core Yes Latest GPU support
RHEL 9 / Rocky 9 NVIDIA 525+ 12.x nvidia-container-toolkit MLNX_OFED Yes (EL repo) Red Hat, enterprise
DGX OS (Ubuntu-based) NVIDIA custom 12.x Pre-installed Pre-configured Yes NVIDIA DGX only supported
SLES 15 SP5 NVIDIA 525+ 12.x nvidia-container-toolkit MLNX_OFED Yes HPC, some Lustre clusters
Debian 12 NVIDIA 525+ 12.x nvidia-container-toolkit rdma-core Yes (backports) Community, research
Flatcar / Bottlerocket Container-host nvidia-container-toolkit Limited No K8s-only, minimal footprint

Limitations and constraints

GPU drivers and CUDA

Constraint Detail
Driver-CUDA compatibility NVIDIA driver major version must match CUDA toolkit (driver ≥ CUDA req). E.g., CUDA 12.5 requires driver ≥ 550
Kernel version NVIDIA driver not compatible with all kernels. New kernel (6.8+) may require DKMS build or delayed support
Secure Boot NVIDIA driver requires signed module (MOK, shim) or disabled Secure Boot — common enterprise issue
Open vs Proprietary driver NVIDIA nvidia-open (since R515) — open source kernel module. GPU support: DC (H100+) → OK, older GPUs → proprietary required
nvidia-persistenced Required to maintain GPU initialization; without it GPUs may sleep after idle timeout (nvidia-smi -pm 1)
GPU reset After crashed training job, GPU may hang. nvidia-smi --gpu-reset or reboot node, sometimes power cycle
Multi-instance GPU (MIG) Requires specific driver, MIG mode on GPU, GPU restart. Cannot be changed at runtime. A100, H100, B200 only

Network (InfiniBand / RoCE)

Constraint Detail
MLNX_OFED vs rdma-core MLNX_OFED (NVIDIA) — full support, but own kernel modules, kernel version compatibility needed. rdma-core (open) — limited support, no custom modules
Kernel compatibility MLNX_OFED supports only specific kernel versions (major.minor). Kernel upgrade → MLNX_OFED rebuild required
NCCL NCCL version must be compatible with CUDA and IB firmware. nccl-tests for validation
SHARP In-network reduction requires specific MLNX_OFED + IB switch firmware combination
GPU Direct RDMA Requires nvidia-peermem module + MLNX_OFED. Does not work with all GPU and IB card combinations
RoCE PFC/ECN RoCE requires lossless fabric (PFC, ECN, DCQCN). Switch and host configuration — complex tuning

Storage

Constraint Detail
Lustre client Client version must match server. Server upgrade → upgrade all clients. Compatible with RHEL/Debian derivatives only
POSIX locking NFS and Lustre have different POSIX locking behavior. Distributed training relies on flock → problematic with mixed FS
Filesystem cache Page cache can mask IO bottlenecks. Training jobs often require O_DIRECT or sync IO
Local NVMe vs parallel FS Dataset staging on local NVMe eliminates network dependency but requires space and pre-fetch pipeline

Container runtime

Constraint Detail
Docker + GPU nvidia-container-toolkit (formerly nvidia-docker2). Requires runtime installation and config in /etc/docker/daemon.json
Podman + GPU Requires nvidia-container-toolkit + podman hook. Less tested than Docker
containerd + GPU Standard for K8s. Requires cdi (Container Device Interface) or nvidia-container-runtime
Enroot + Pyxis NVIDIA container stack for Slurm (Enroot = daemonless container runtime, Pyxis = Slurm plugin)
User namespace mapping Container GPU access requires device cgroup; rootless may fail (exception for /dev/dri and /dev/nvidia*)

Kernel parameters

# AI workload recommended sysctl
net.core.rmem_max = 134217728       # sufficient for NCCL
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.core.netdev_budget = 600        # for high packet rate
vm.max_map_count = 1048576          # PyTorch DataLoader workers
kernel.numa_balancing = 0           # disable NUMA balancing (breaks locality)
kernel.sched_min_granularity_ns = 10000000

# Disable security mitigations for perf (dedicated AI clusters only)
mitigations=off
transparent_hugepages=never         # or madvise — THP may cause latency spikes
intel_idle.max_cstate=1             # reduce C-state transition latency

Firmware and HW

Constraint Detail
GPU firmware (VBIOS) NVIDIA datacenter GPUs (H100, B200) have VBIOS updates via NVFlash. Without update → missing partitioning support or newer CUDA features
InfiniBand firmware IB switch and HCA firmware must be compatible. Mix old switch + new HCA → degraded perf
NVSwitch firmware DGX systems have NVSwitch firmware updatable only via NVIDIA DGX tools
Power capping (nvidia-smi) nvidia-smi -pl <power> — limit TDP for power budget management. Test impact on training throughput
GPU clock locking nvidia-smi -ac <clock,mem> — locked clock frequency for stable benchmarks. Apply after nvidia-persistenced
PCIe Gen GPU in PCIe Gen4 slot (instead of Gen5) → bottleneck for CPU↔GPU data transfer. Important for FSDP sharding
Use case OS Rationale
DGX cluster (production) DGX OS / Ubuntu 22.04 LTS NVIDIA standard, best driver support
Enterprise K8s (OpenShift) RHEL 9 / RHCOS Red Hat support, GPU Operator compatible
Vanilla K8s (on-prem) Ubuntu 22.04 LTS + Flatcar (workers) Widest community support, Flatcar for minimal footprint
Slurm cluster (HPC/AI) Rocky Linux 9 / Ubuntu 22.04 LTS EL ecosystem (Lustre, OFED) or Ubuntu (community)
Research / rapid prototyping Ubuntu 24.04 LTS Latest CUDA, PyTorch, driver support
Edge inference NVIDIA JetPack / Ubuntu (ARM) Embedded GPU (Jetson Orin, AGX)

AI-ready data center — check-list

Area Requirement
Power 30120 kW/rack, HVDC (400 V DC), UPS supporting GPU spikes
Cooling Liquid cooling ready (direct-to-chip), rear-door for 30+ kW
Network InfiniBand (NDR/XDR) or RoCEv2, rail-optimized fat-tree
Storage Parallel FS (Lustre/Weka), checkpoint bandwidth > 100 GB/s
GPU density Max GPU/rack, minimize NVSwitch hops
Physical Floor load 1,500+ kg/m², rack 52U60U
Security Tenant isolation, network segmentation, data encryption
Monitoring DCGM, NCCL health checks, thermals, power capping

Model and throughput limitations

Model size per GPU

Maximum model size fitting on a single GPU depends on HBM capacity and precision:

GPU HBM FP32 FP16/BF16 INT8 INT4
H100 80GB 80 GB ~10B ~40B ~80B ~160B
H200 141GB 141 GB ~18B ~70B ~140B ~280B
B200 192GB 192 GB ~24B ~96B ~192B ~384B
MI300X 192GB 192 GB ~24B ~96B ~192B ~384B
A100 80GB 80 GB ~10B ~40B ~80B ~160B
GB200 (192+480) 192 GB GPU + 480 GB Grace ~96B + CPU offload

Approximate: 1B params ≈ 2 GB FP16 ≈ 4 GB FP32 ≈ 1 GB INT8 ≈ 0.5 GB INT4. Subtract ~1015 % HBM for activations, KV cache, optimizer states.

Memory breakdown inference

Component Llama 3 70B (FP16) Llama 3 8B (FP16)
Model weights 140 GB 16 GB
KV cache (4K context, batch 1) ~2 GB ~0.2 GB
KV cache (128K context, batch 1) ~60 GB ~6.5 GB
Activations (peak) ~5 GB ~1 GB
Total 4K ctx ~147 GB ~17 GB
Total 128K ctx ~205 GB ~23 GB

Conclusion: Llama 3 70B FP16 does not fit on a single H100 (80 GB). Required: INT8 (170 GB → 2× H100), INT4 (85 GB → 1× H200), or tensor parallelism.

Context length vs memory

Context KV cache 70B (FP16) KV cache 8B (FP16) Note
4K ~2.2 GB ~0.25 GB Typical chat
32K ~18 GB ~2 GB Documents
128K ~72 GB ~8 GB Long-context (Claude, Gemini)
1M ~560 GB ~64 GB Experimental (Gemini 1.5 Pro)

KV cache is linear with context length and quadratic with attention head count. Critical for long-context inference.

Throughput inference

Model GPU Precision Batch size Tokens/s QPS (1K output)
Llama 3 8B H100 FP16 1 ~800 ~0.8
Llama 3 8B H100 FP16 128 ~4 500 ~35
Llama 3 8B H100 INT4 128 ~8 000 ~62
Llama 3 70B 4× H100 FP16 1 ~180 ~0.18
Llama 3 70B 4× H100 INT4 64 ~1 200 ~19
Llama 3 70B 8× H100 FP16 (TP=8) 128 ~2 500 ~20
DeepSeek-R1 671B 8× H200 FP8 (MoE) 64 ~500 ~8
GPT-4 class (est.) ~100300 ~13

Notes:

  • QPS (queries per second) depends on output length (1K tokens ≈ ~1 query)
  • Larger batch increases throughput but increases TTFB (time to first token)
  • Tensor Parallelism (TP) scales, but communication overhead grows linearly

Training limits

Scaling efficiency

GPU count Model Efficiency Reason
8 (1 node) Llama 3 8B ~95 % NVLink intra-node
64 (8 nodes) Llama 3 8B ~85 % IB inter-node
512 (64 nodes) Llama 3 70B ~75 % Communication overhead
4 096 (512 nodes) Llama 3 70B ~60 % Pipeline bubble, network
16 384 (2 048 nodes) Llama 3 405B ~45 % Synchronous SGD overhead

Note: Efficiency = (actual throughput) / (ideal linear speedup). Decreases logarithmically with GPU count.

Memory breakdown training

Component Llama 3 70B (BF16) Llama 3 8B (BF16)
Model weights 140 GB 16 GB
Optimizer states (Adam) 280 GB 32 GB
Gradients 140 GB 16 GB
Activations (peak) ~30 GB ~4 GB
Total (DDP) ~590 GB ~68 GB
Total (FSDP shard=8) ~74 GB ~8.5 GB

Conclusion: FSDP (Fully Sharded Data Parallelism) is required for training models > 10B. Adam optimizer doubles memory vs inference (weights + optimizer + gradients).

Time to train

Model GPU count GPU type Precision Time Cost (on-prem estimate)
Llama 3 8B 64 H100 BF16 ~3 days ~$5 000
Llama 3 70B 512 H100 BF16 ~14 days ~$100 000
Llama 3 405B 16 384 H100 BF16 ~60 days ~$14 M
DeepSeek-R1 671B (MoE) 2 048 H800 BF16 ~30 days ~$6 M
GPT-4 (est.) 25 000 A100/H100 Mixed ~90100 days ~$100 M

Power and thermal limits

Configuration TDP limit Throughput loss Reason
H100 SXM 700 W (default) 0 % Nominal
H100 SXM 600 W (-15 %) ~58 % Power capping
H100 SXM 500 W (-30 %) ~1525 % Significant throttling
H100 SXM 400 W (-43 %) ~3050 % Emergency only
DGX H100 (8×) 5.6 kW (max) 0 % Liquid cooling required
DGX H100 (8×) 4.5 kW (air) ~1015 % Rear-door heat exchanger

GPU throttles when exceeding TDP or temperature (85°C+). Power capping correlates linearly with frequency but non-linearly with throughput.

API and operational limits

Limit Description Typical value
Rate limit Max requests per minute/hour 10010 000 RPM (per tier)
Tokens per minute (TPM) Max tokens per minute 1M300M (per model)
Context window Max input tokens 4K2M (per model)
Max output tokens Max generated tokens 4K32K (per model)
Concurrent requests Parallel request count 1010 000 (per backend)
Batch window Time to accumulate batch 020 s (vLLM, TGI)
TTFB timeout Max latency to first token 30120 s
Idle timeout GPU idle → scale to 0 515 min (cloud)

Limits per deployment model

Dimension On-prem HW Managed cloud (SageMaker, Vertex) API (OpenAI, Anthropic)
Model size Limited by HBM (max 192 GB/GPU) Unlimited (cluster scaling) Unlimited
Queries Limited by GPU count Auto-scaling Rate limit (per tier)
Latency < 10 ms (same node) 10100 ms (network hop) 100 ms 10 s
Customization Full (fine-tuning, quantization) Managed (SageMaker, Bedrock) Prompt engineering only
Data privacy Yes (on-prem) Contractual (region, encryption) Limited
Cost per 1M tokens ~$0.100.50 (FP16 inference) ~$0.201.00 ~$0.1515.00
Max context 128K+ (depending on GPU count) 128K+ 32K2M
Cold start 0 (always-on) 30 s 5 min 0 (shared infra)

GPU pricing and price/performance (2026)

Prices are approximate — NVIDIA does not publish official datacenter GPU price lists. Cloud prices from public providers (Q2 2026). HW purchase prices vary by volume, reseller, and region.

Purchase price (buy)

GPU Price/GPU Price 8× GPU baseboard $/PFLOPS (FP16) Note
H100 SXM $27,00040,000 ~$200,000 $25,000 Scarcity 20232024, now stabilized
H200 SXM $35,00050,000 ~$280,000 ~$35,000 H100 upgrade, HBM3e
B200 ~$60,00070,000 ~$500,000+ ~$31,000 Blackwell, FP4 support
B100 ~$30,000 ~$240,000 ~$20,000 Lower price than B200, similar FP8 perf
GB200 (Grace+Blackwell) ~$70,000100,000 ~$2,000,000 (rack) CPU+GPU unified, high-density
A100 80GB ~$10,00015,000 ~$120,000 ~$19,200 Previous gen, still relevant
MI300X ~$12,00018,000 ~$100,000 ~$9,600 AMD, 192 GB HBM3
Gaudi 3 ~$15,625 ~$125,000 $8,515 Intel, best $/PFLOPS
L40S ~$8,00010,000 Inference, enterprise

Cloud pricing (on-demand $/GPU/hr)

GPU Cheapest Mid-range (CoreWeave, Lambda) Hyperscaler (AWS, GCP, Azure)
H100 SXM $1.38 (Thunder) $2.893.29 $4.156.88
H100 PCIe $2.01 (Spheron) $2.50
H200 SXM $3.89 (Spheron) $4.54 $5.00+
B200 $3.39 (Spheron) $6.02 $14.24 (AWS)
B200 spot $2.12 (Spheron)
GB200 $3.50 (Runcrate) $5.85 (Oracle) $6.95 (GCP)
MI300X $1.50 (TensorWave) $1.85 (Vultr) $7.86 (Azure)
A100 80GB $1.07 (Spheron) $1.502.00 $3.00+
Gaudi 3 ~$1.502.50
L40S $0.91 (Spheron) $1.502.00

Inference cost ($/M tokens)

GPU Provider $/hr Est. tok/s $/M tok
B200 Spheron $3.39 ~4,000 $0.42
B200 spot Spheron $2.12 ~4,000 $0.15
H100 PCIe Spheron $2.01 ~1,200 $0.47
A100 80GB Spheron $1.07 ~520 $0.57
H100 SXM AWS $6.88 ~1,200 $1.59
H200 SXM Spheron $4.54 ~1,800 $0.70
L40S Spheron $0.91 ~450 $0.56

Values for Llama 3 70B (INT8, batch=1, output 1K tok). Actual values vary by batch size, context, and quantization.

Cost per GB HBM

GPU HBM Price/hr cloud $/GB/hr Best for memory-bound workloads
MI300X 192 GB $1.50 $0.0078 Best
B200 192 GB $3.39 $0.0177 Good
H200 141 GB $3.89 $0.0276 ⚠️
H100 SXM 80 GB $1.38 $0.0173 ⚠️ Only up to 70B models
GB200 384 GB $3.50 $0.0091 (2× MI300X capacity)

Price/performance by scenario

Scenario Winner Rationale
Absolute performance (cost no object) GB200 DGX NVL72 72× GPU, 18 PFLOPS FP8, 384 GB HBM/GPU
Cloud inference — best $/token B200 spot $0.15/M tok; 4× H100 throughput at lower cost
Cloud inference — on-demand B200 $0.42/M tok
Cloud inference — budget A100 / L40S $0.570.56/M tok
Training — price/perf on purchase Gaudi 3 $8,515/PFLOPS, 2.53× better than H100
Training — cloud H100 SXM $1.38/hr, CUDA ecosystem, NCCL
Memory-bound — long context, 70B+ MI300X / GB200 192384 GB, $0.00780.0091/GB
Ecosystem + safe choice H100/H200 CUDA, widest SW, NVIDIA tools
Spot / preemptible — lowest cost A100 / H100 $1.071.38/hr, 5090% off on-demand
  • H100 — price dropped 64% from peak $8/hr to $1.382.89/hr, then 40% rebound from inference demand

  • B200 — new high-end, $3.39/hr cloud → ~$0.15/M tok on spot — new inference benchmark

  • MI300X — supply growing (TensorWave, Vultr, CoreWeave, Oracle, Azure), from $1.50/hr

  • Gaudi 3 — best $/PFLOPS on purchase, but narrow ecosystem and limited cloud availability

  • Market bifurcation — prior gen (H100, A100) commoditizing, new gen (B200, GB200) commanding premium

  • GPU.en.md — GPU architecture, NVIDIA/AMD, vGPU, MIG

  • NETWORKING.en.md — InfiniBand, RoCE, network topology

  • STORAGE.en.md — parallel filesystem, object store

  • DATACENTERS.en.md — DC layout, power, cooling

  • CLOUD.en.md — cloud AI services (SageMaker, Vertex AI)

Sources

Links, books, and standards: sources/infrastructure/sources.en.md

Last revision: 2026-06-18