Files

Stanislav Hubacek ef3c2f75b1 18.6.2026

2026-06-18 16:25:33 +02:00

30 KiB

Raw Permalink Blame History

🧠 AI/ML Infrastructure

Component overview

flowchart TD
    subgraph Compute
        GPU["GPU (H100/B200/Instinct)"]
        CPU["CPU (AMD EPYC / Intel Xeon)"]
        ASIC["ASIC (TPU, Trainium, Inferentia)"]
    end
    subgraph Network
        IB["InfiniBand NDR/XDR"]
        ROCE["RoCEv2"]
        NVL["NVLink / NVSwitch"]
    end
    subgraph Storage
        FS["Parallel FS (Lustre, GPFS, Weka)"]
        OBJ["Object Store (S3, MinIO)"]
        NVME["Local NVMe cache"]
    end
    subgraph Orchestration
        S["Slurm"]
        K["Kubernetes + Volcano/Kueue"]
    end
    subgraph Cooling
        DLC["Direct-to-chip liquid"]
        IMM["Immersion"]
        AIR["Air (high-density)"]
    end

    Compute --> Network --> Storage
    Orchestration --> Compute
    Cooling --> Compute

GPU compute

NVIDIA

GPU	Architecture	FP8	FP16/BF16	FP64	HBM	NVLink	TDP	Rack config
H100 SXM	Hopper	3,958 TFLOPS	1,979 TFLOPS	67 TFLOPS	80 GB HBM3	900 GB/s	700 W	6–8× in DGX H100
H200 SXM	Hopper (HBM3e)	3,958 TFLOPS	1,979 TFLOPS	67 TFLOPS	141 GB HBM3e	900 GB/s	700 W	6–8× in DGX H200
B200	Blackwell	~9,000 TFLOPS	~4,500 TFLOPS	~40 TFLOPS	192 GB HBM3e	1,800 GB/s	1,000 W	6–8× in DGX B200
GB200 Grace Hopper	Blackwell	~18,000 TFLOPS	~9,000 TFLOPS	—	192 GB + 480 GB (Grace)	NVLink-C2C	1,000 W (GPU) + 500 W (CPU)	DGX GB200 (36× GPU)
L40S	Ada Lovelace	733 TFLOPS	367 TFLOPS	—	48 GB GDDR6	N/A	350 W	Inference, enterprise
A100 SXM	Ampere	1,248 TFLOPS	624 TFLOPS	19.5 TFLOPS	80 GB HBM2e	600 GB/s	400 W	DGX A100

AMD

GPU	Architecture	FP8	FP16/BF16	FP64	HBM	Infinity Fabric	TDP
MI300X	CDNA 3	2,615 TFLOPS	1,307 TFLOPS	81 TFLOPS	192 GB HBM3	896 GB/s	750 W
MI250	CDNA 2	—	383 TFLOPS	95.7 TFLOPS	128 GB HBM2e	400 GB/s	500 W

Intel

GPU	Architecture	FP16/BF16	FP32	HBM	TDP
Gaudi 3	Custom	1,835 TFLOPS	—	144 GB HBM2e	600 W
Max 1550	Xe HPC	600+ TFLOPS	200 TFLOPS	128 GB HBM2e	600 W

Cloud ASIC

ASIC	Provider	Use case	Performance
TPU v5p	Google	Training	~4,600 TFLOPS (BF16) per pod
Trainium 2	AWS	Training	~1,000 TFLOPS (BF16) per chip
Inferentia 2	AWS	Inference	~400 TOPS (INT8) per chip
Maia 100	Microsoft	Training + inference	Custom, 800 W TDP

AI networking

Technology comparison

Technology	Bandwidth per link	Latency	Topology	Use case
InfiniBand NDR200	200 Gb/s	< 1 µs	Fat-tree, Dragonfly+	Training (NVIDIA)
InfiniBand NDR400	400 Gb/s	< 1 µs	Fat-tree, Dragonfly+	Training (NVIDIA)
InfiniBand XDR	800 Gb/s (planned)	< 1 µs	Dragonfly+	Next-gen training
RoCEv2 (CX-7/8)	200–400 Gb/s	1–2 µs	Fat-tree, Spine-leaf	Training (AMD, Intel, open)
NVLink 4.0	900 GB/s per GPU	< 0.5 µs	NVSwitch full-mesh	Intra-node GPU comm
NVLink 5.0	1,800 GB/s per GPU	< 0.5 µs	NVSwitch full-mesh	Intra-node (Blackwell)
Ethernet (400 GbE)	400 Gb/s	2–5 µs	Spine-leaf	Inference, data pipeline

AI fabric principles

Rail-optimized topology — each GPU communicates on dedicated "rails" (same GPU indices across nodes connect to the same switch)
Fat-tree (Clos) — standard for InfiniBand and RoCE, non-blocking bisection bandwidth
Dragonfly+ — reduces hop count while maintaining bandwidth (used in largest clusters)
GPU Direct RDMA — direct GPU ↔ GPU communication without CPU involvement, supports InfiniBand and RoCE
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) — in-network reduction for AllReduce (InfiniBand only)

Bandwidth sizing

Rule of thumb: InfiniBand bandwidth ≥ 50 % GPU HBM bandwidth for scalable training

Example: H100 has 3.35 TB/s HBM
  → Needs min. 1.6 TB/s bisection bandwidth per GPU
  → 8× H100 in DGX: 4× NDR400 IB per GPU = 4 × 50 GB/s = 200 GB/s
  → Reality: 8× 200 Gb/s (25 GB/s) per GPU in typical config = ~6 % HBM → bottleneck

AI storage

Requirements

Dataset size	IO pattern	Recommended storage	Bandwidth
< 10 TB	Sequential read (data loading)	Local NVMe	> 10 GB/s per node
10–100 TB	Random read (checkpointing)	Parallel FS (Lustre, Weka)	> 100 GB/s cluster-wide
100 TB–10 PB	Mixed (training + checkpoint)	Parallel FS + object store	> 500 GB/s
10 PB+	Multi-modal, video, LLM	Tiered (NVMe cache + parallel FS + object)	> 1 TB/s

Storage solution comparison

Solution	Type	Bandwidth per node	Max capacity	Scaling	Use case
Lustre	Parallel FS (POSIX)	> 100 GB/s (cluster)	100s PB	OST + MDS	HPC, LLM training (standard)
GPFS / StorageScale	Parallel FS (POSIX)	> 100 GB/s	100s PB	NSD servers	HPC, AI (IBM)
WekaFS	Parallel FS (POSIX + NFS/SMB)	~80 GB/s per 10 nodes	10s PB	Container-native	AI/ML, NVIDIA DGX preferred
VAST Data	Universal storage (NVMe + QLC)	~100 GB/s per cluster	10s PB	Scale-out	AI, checkpoint, data lake
Pure Storage//E	All-flash (NVMe)	~50 GB/s	~30 PB	Scale-out	Enterprise AI, database
MinIO / S3	Object store	~20 GB/s per gateway	EB	Erasure coding	Dataset repository, checkpoint
NetApp AFF	NAS + S3	~10 GB/s per controller	~50 PB	HA pair	Enterprise, NFS baseline

Checkpointing strategies

Strategy	RPO	Storage impact	Description
Full checkpoint	every N steps	High (stops training)	Full model + optimizer state
Async checkpoint	every N steps	Medium (non-blocking)	Copy to staging buffer, async write
Distributed checkpoint (NVIDIA NeMo)	every N steps	Low	Each rank writes its own shard
In-memory checkpoint (IBM)	on failover	Minimal (DRAM)	Replication to another node's DRAM
Continuous checkpoint (Microsoft)	every 1–5 min	Low (delta)	Changed shards only

AI cluster architecture

Physical topology — DGX H100 example

┌──────── DGX H100 (8× GPU) ────────┐
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│  │GPU 0│ │GPU 1│ │GPU 2│ │GPU 3│ │
│  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│  ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ │
│  │GPU 4│ │GPU 5│ │GPU 6│ │GPU 7│ │
│  └─────┘ └─────┘ └─────┘ └─────┘ │
│  NVSwitch (NVLink 4.0, 900 GB/s)  │
│  InfiniBand CX-7: 8× NDR400       │
└────────────────────────────────────┘
         │ 8× IB rails
    ┌────┴──────────────┐
    │  IB NDR400 Switches │  (rail-optimized)
    └────────────────────┘

Kubernetes for AI

Component	Role
Volcano	Batch scheduling, gang scheduling, queue management
Kueue	Multi-tenant admission, resource quotas, fair sharing
NVIDIA GPU Operator	Driver, container toolkit, MIG, DCGM, monitoring
HAMi (ex k8s-vGPU-scheduler)	GPU sharing, MIG partitioning, fractional GPU
Node Feature Discovery	GPU type detection, NUMA topology
Topology Manager	NUMA-aware pod placement
DPDK / SR-IOV	High-performance networking for GPU Direct RDMA

Slurm for AI

Component	Role
slurm.conf	Partition for GPU nodes, GRES (Generic Resource)
gres.conf	GPU type, GPU count per node
srun --gres=gpu:8	Allocate 8 GPUs per job
sbatch --nodes=64 --ntasks=512	64 nodes, 512 ranks (8 GPU/node)
Pixis	NVIDIA orchestration plugin for Slurm

AI cluster cooling

Power density comparison

Configuration	TDP per node	Racks	kW/rack	Note
Standard server (2U)	1 kW	20	5–10	Typical DC
GPU server (DGX H100, 6×)	42 kW	6	45–50	Air cooling limit
GPU server (DGX B200, 6×)	72 kW	6	90–100	Liquid cooling required
GPU server (GB200 NVL72)	120 kW	—	~120	Liquid cooling mandatory
NVIDIA NVL72 rack	120 kW	1	120	Fully liquid cooled

Cooling technologies

Method	Max kW/rack	CAPEX	OPEX	Complexity
Air cooling (CRAC/CRAH)	< 15	Low	Medium	Low
Air cooling (in-row)	15–30	Medium	Medium	Low
Rear-door heat exchanger	30–50	Medium	Low	Medium
Direct-to-chip liquid (cold plate)	50–150	High	Low	High
Immersion (single-phase)	100–200	High	Low	High
Immersion (two-phase)	200+	Very high	Low	Very high

Inference infrastructure

Inference server comparison

Tool	Frameworks	Optimization	Use case
vLLM	Megatron, HF, AWQ, GPTQ	PagedAttention, KV cache, continuous batching	LLM inference (open source)
TensorRT-LLM	TensorRT	INT4/INT8/FP8, inflight batching, attention optimizations	Production (NVIDIA)
Triton Inference Server	All (TensorRT, vLLM, PyTorch)	Model ensemble, model caching, concurrent execution	Enterprise, multi-model
SageMaker	Managed	Auto-scaling, model parallelism	AWS managed
OpenAI API / TGI	HF Transformers	Continuous batching, flash attention	Hosting

Inference optimization

Technique	Latency improvement	Throughput improvement	Memory reduction
FP8/INT8 quantization	—	2×	2×
INT4 quantization	—	4×	4×
Flash Attention 2/3	2–4×	—	50 % (KV cache)
PagedAttention	—	2–5×	95 % (KV cache fragmentation)
Continuous batching	—	10–20×	—
Speculative decoding	2–3×	—	—
Multi-LoRA / S-LoRA	—	8–16×	—

Distributed training techniques

Technique	Description	Frameworks
Data Parallelism (DDP/FSDP)	Each GPU has model copy, different batch	PyTorch DDP, FSDP
Tensor Parallelism (TP)	Model split across layers (intra-node)	Megatron-LM, DeepSpeed
Pipeline Parallelism (PP)	Layers split across nodes	Megatron-LM, DeepSpeed
Sequence Parallelism (SP)	Sequence split across GPUs	Megatron-LM
Expert Parallelism (EP)	Different expert subnets on different GPUs	Mixture-of-Experts (MoE)
3D Parallelism	TP + PP + DP combination	Megatron-LM, NeMo
ZeRO (1/2/3)	Optimizer/gradient/parameter sharding	DeepSpeed
NCCL / RCCL	GPU collective communication library	NVIDIA/AMD

Operating systems for AI

Distribution comparison

OS	GPU driver	CUDA	Container toolkit	IB/RoCE	Lustre client	Production support
Ubuntu 22.04 LTS	NVIDIA 525+	12.x	nvidia-container-toolkit	MLNX_OFED, rdma-core	Yes (lustre-client)	NVIDIA DGX standard
Ubuntu 24.04 LTS	NVIDIA 550+	12.5+	nvidia-container-toolkit	MLNX_OFED, rdma-core	Yes	Latest GPU support
RHEL 9 / Rocky 9	NVIDIA 525+	12.x	nvidia-container-toolkit	MLNX_OFED	Yes (EL repo)	Red Hat, enterprise
DGX OS (Ubuntu-based)	NVIDIA custom	12.x	Pre-installed	Pre-configured	Yes	NVIDIA DGX only supported
SLES 15 SP5	NVIDIA 525+	12.x	nvidia-container-toolkit	MLNX_OFED	Yes	HPC, some Lustre clusters
Debian 12	NVIDIA 525+	12.x	nvidia-container-toolkit	rdma-core	Yes (backports)	Community, research
Flatcar / Bottlerocket	Container-host	—	nvidia-container-toolkit	Limited	No	K8s-only, minimal footprint

Limitations and constraints

GPU drivers and CUDA

Constraint	Detail
Driver-CUDA compatibility	NVIDIA driver major version must match CUDA toolkit (driver ≥ CUDA req). E.g., CUDA 12.5 requires driver ≥ 550
Kernel version	NVIDIA driver not compatible with all kernels. New kernel (6.8+) may require DKMS build or delayed support
Secure Boot	NVIDIA driver requires signed module (MOK, shim) or disabled Secure Boot — common enterprise issue
Open vs Proprietary driver	NVIDIA `nvidia-open` (since R515) — open source kernel module. GPU support: DC (H100+) → OK, older GPUs → proprietary required
nvidia-persistenced	Required to maintain GPU initialization; without it GPUs may sleep after idle timeout (`nvidia-smi -pm 1`)
GPU reset	After crashed training job, GPU may hang. `nvidia-smi --gpu-reset` or reboot node, sometimes power cycle
Multi-instance GPU (MIG)	Requires specific driver, MIG mode on GPU, GPU restart. Cannot be changed at runtime. A100, H100, B200 only

Network (InfiniBand / RoCE)

Constraint	Detail
MLNX_OFED vs rdma-core	MLNX_OFED (NVIDIA) — full support, but own kernel modules, kernel version compatibility needed. `rdma-core` (open) — limited support, no custom modules
Kernel compatibility	MLNX_OFED supports only specific kernel versions (major.minor). Kernel upgrade → MLNX_OFED rebuild required
NCCL	NCCL version must be compatible with CUDA and IB firmware. `nccl-tests` for validation
SHARP	In-network reduction requires specific MLNX_OFED + IB switch firmware combination
GPU Direct RDMA	Requires `nvidia-peermem` module + MLNX_OFED. Does not work with all GPU and IB card combinations
RoCE PFC/ECN	RoCE requires lossless fabric (PFC, ECN, DCQCN). Switch and host configuration — complex tuning

Storage

Constraint	Detail
Lustre client	Client version must match server. Server upgrade → upgrade all clients. Compatible with RHEL/Debian derivatives only
POSIX locking	NFS and Lustre have different POSIX locking behavior. Distributed training relies on flock → problematic with mixed FS
Filesystem cache	Page cache can mask IO bottlenecks. Training jobs often require `O_DIRECT` or sync IO
Local NVMe vs parallel FS	Dataset staging on local NVMe eliminates network dependency but requires space and pre-fetch pipeline

Container runtime

Constraint	Detail
Docker + GPU	`nvidia-container-toolkit` (formerly nvidia-docker2). Requires runtime installation and config in `/etc/docker/daemon.json`
Podman + GPU	Requires `nvidia-container-toolkit` + podman hook. Less tested than Docker
containerd + GPU	Standard for K8s. Requires `cdi` (Container Device Interface) or `nvidia-container-runtime`
Enroot + Pyxis	NVIDIA container stack for Slurm (Enroot = daemonless container runtime, Pyxis = Slurm plugin)
User namespace mapping	Container GPU access requires device cgroup; rootless may fail (exception for /dev/dri and /dev/nvidia*)

Kernel parameters

# AI workload recommended sysctl
net.core.rmem_max = 134217728       # sufficient for NCCL
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.core.netdev_budget = 600        # for high packet rate
vm.max_map_count = 1048576          # PyTorch DataLoader workers
kernel.numa_balancing = 0           # disable NUMA balancing (breaks locality)
kernel.sched_min_granularity_ns = 10000000

# Disable security mitigations for perf (dedicated AI clusters only)
mitigations=off
transparent_hugepages=never         # or madvise — THP may cause latency spikes
intel_idle.max_cstate=1             # reduce C-state transition latency

Firmware and HW

Constraint	Detail
GPU firmware (VBIOS)	NVIDIA datacenter GPUs (H100, B200) have VBIOS updates via NVFlash. Without update → missing partitioning support or newer CUDA features
InfiniBand firmware	IB switch and HCA firmware must be compatible. Mix old switch + new HCA → degraded perf
NVSwitch firmware	DGX systems have NVSwitch firmware updatable only via NVIDIA DGX tools
Power capping (nvidia-smi)	`nvidia-smi -pl <power>` — limit TDP for power budget management. Test impact on training throughput
GPU clock locking	`nvidia-smi -ac <clock,mem>` — locked clock frequency for stable benchmarks. Apply after `nvidia-persistenced`
PCIe Gen	GPU in PCIe Gen4 slot (instead of Gen5) → bottleneck for CPU↔GPU data transfer. Important for FSDP sharding

Recommended OS per use case

Use case	OS	Rationale
DGX cluster (production)	DGX OS / Ubuntu 22.04 LTS	NVIDIA standard, best driver support
Enterprise K8s (OpenShift)	RHEL 9 / RHCOS	Red Hat support, GPU Operator compatible
Vanilla K8s (on-prem)	Ubuntu 22.04 LTS + Flatcar (workers)	Widest community support, Flatcar for minimal footprint
Slurm cluster (HPC/AI)	Rocky Linux 9 / Ubuntu 22.04 LTS	EL ecosystem (Lustre, OFED) or Ubuntu (community)
Research / rapid prototyping	Ubuntu 24.04 LTS	Latest CUDA, PyTorch, driver support
Edge inference	NVIDIA JetPack / Ubuntu (ARM)	Embedded GPU (Jetson Orin, AGX)

AI-ready data center — check-list

Area	Requirement
Power	30–120 kW/rack, HVDC (400 V DC), UPS supporting GPU spikes
Cooling	Liquid cooling ready (direct-to-chip), rear-door for 30+ kW
Network	InfiniBand (NDR/XDR) or RoCEv2, rail-optimized fat-tree
Storage	Parallel FS (Lustre/Weka), checkpoint bandwidth > 100 GB/s
GPU density	Max GPU/rack, minimize NVSwitch hops
Physical	Floor load 1,500+ kg/m², rack 52U–60U
Security	Tenant isolation, network segmentation, data encryption
Monitoring	DCGM, NCCL health checks, thermals, power capping

Model and throughput limitations

Model size per GPU

Maximum model size fitting on a single GPU depends on HBM capacity and precision:

GPU	HBM	FP32	FP16/BF16	INT8	INT4
H100 80GB	80 GB	~10B	~40B	~80B	~160B
H200 141GB	141 GB	~18B	~70B	~140B	~280B
B200 192GB	192 GB	~24B	~96B	~192B	~384B
MI300X 192GB	192 GB	~24B	~96B	~192B	~384B
A100 80GB	80 GB	~10B	~40B	~80B	~160B
GB200 (192+480)	192 GB GPU + 480 GB Grace	—	~96B + CPU offload	—	—

Approximate: 1B params ≈ 2 GB FP16 ≈ 4 GB FP32 ≈ 1 GB INT8 ≈ 0.5 GB INT4. Subtract ~10–15 % HBM for activations, KV cache, optimizer states.

Memory breakdown inference

Component	Llama 3 70B (FP16)	Llama 3 8B (FP16)
Model weights	140 GB	16 GB
KV cache (4K context, batch 1)	~2 GB	~0.2 GB
KV cache (128K context, batch 1)	~60 GB	~6.5 GB
Activations (peak)	~5 GB	~1 GB
Total 4K ctx	~147 GB	~17 GB
Total 128K ctx	~205 GB	~23 GB

Conclusion: Llama 3 70B FP16 does not fit on a single H100 (80 GB). Required: INT8 (170 GB → 2× H100), INT4 (85 GB → 1× H200), or tensor parallelism.

Context length vs memory

Context	KV cache 70B (FP16)	KV cache 8B (FP16)	Note
4K	~2.2 GB	~0.25 GB	Typical chat
32K	~18 GB	~2 GB	Documents
128K	~72 GB	~8 GB	Long-context (Claude, Gemini)
1M	~560 GB	~64 GB	Experimental (Gemini 1.5 Pro)

KV cache is linear with context length and quadratic with attention head count. Critical for long-context inference.

Throughput inference

Model	GPU	Precision	Batch size	Tokens/s	QPS (1K output)
Llama 3 8B	H100	FP16	1	~800	~0.8
Llama 3 8B	H100	FP16	128	~4 500	~35
Llama 3 8B	H100	INT4	128	~8 000	~62
Llama 3 70B	4× H100	FP16	1	~180	~0.18
Llama 3 70B	4× H100	INT4	64	~1 200	~19
Llama 3 70B	8× H100	FP16 (TP=8)	128	~2 500	~20
DeepSeek-R1 671B	8× H200	FP8 (MoE)	64	~500	~8
GPT-4 class (est.)	—	—	—	~100–300	~1–3

Notes:

QPS (queries per second) depends on output length (1K tokens ≈ ~1 query)
Larger batch increases throughput but increases TTFB (time to first token)
Tensor Parallelism (TP) scales, but communication overhead grows linearly

Training limits

Scaling efficiency

GPU count	Model	Efficiency	Reason
8 (1 node)	Llama 3 8B	~95 %	NVLink intra-node
64 (8 nodes)	Llama 3 8B	~85 %	IB inter-node
512 (64 nodes)	Llama 3 70B	~75 %	Communication overhead
4 096 (512 nodes)	Llama 3 70B	~60 %	Pipeline bubble, network
16 384 (2 048 nodes)	Llama 3 405B	~45 %	Synchronous SGD overhead

Note: Efficiency = (actual throughput) / (ideal linear speedup). Decreases logarithmically with GPU count.

Memory breakdown training

Component	Llama 3 70B (BF16)	Llama 3 8B (BF16)
Model weights	140 GB	16 GB
Optimizer states (Adam)	280 GB	32 GB
Gradients	140 GB	16 GB
Activations (peak)	~30 GB	~4 GB
Total (DDP)	~590 GB	~68 GB
Total (FSDP shard=8)	~74 GB	~8.5 GB

Conclusion: FSDP (Fully Sharded Data Parallelism) is required for training models > 10B. Adam optimizer doubles memory vs inference (weights + optimizer + gradients).

Time to train

Model	GPU count	GPU type	Precision	Time	Cost (on-prem estimate)
Llama 3 8B	64	H100	BF16	~3 days	~$5 000
Llama 3 70B	512	H100	BF16	~14 days	~$100 000
Llama 3 405B	16 384	H100	BF16	~60 days	~$14 M
DeepSeek-R1 671B (MoE)	2 048	H800	BF16	~30 days	~$6 M
GPT-4 (est.)	25 000	A100/H100	Mixed	~90–100 days	~$100 M

Power and thermal limits

Configuration	TDP limit	Throughput loss	Reason
H100 SXM	700 W (default)	0 %	Nominal
H100 SXM	600 W (-15 %)	~5–8 %	Power capping
H100 SXM	500 W (-30 %)	~15–25 %	Significant throttling
H100 SXM	400 W (-43 %)	~30–50 %	Emergency only
DGX H100 (8×)	5.6 kW (max)	0 %	Liquid cooling required
DGX H100 (8×)	4.5 kW (air)	~10–15 %	Rear-door heat exchanger

GPU throttles when exceeding TDP or temperature (85°C+). Power capping correlates linearly with frequency but non-linearly with throughput.

API and operational limits

Limit	Description	Typical value
Rate limit	Max requests per minute/hour	100–10 000 RPM (per tier)
Tokens per minute (TPM)	Max tokens per minute	1M–300M (per model)
Context window	Max input tokens	4K–2M (per model)
Max output tokens	Max generated tokens	4K–32K (per model)
Concurrent requests	Parallel request count	10–10 000 (per backend)
Batch window	Time to accumulate batch	0–20 s (vLLM, TGI)
TTFB timeout	Max latency to first token	30–120 s
Idle timeout	GPU idle → scale to 0	5–15 min (cloud)

Limits per deployment model

Dimension	On-prem HW	Managed cloud (SageMaker, Vertex)	API (OpenAI, Anthropic)
Model size	Limited by HBM (max 192 GB/GPU)	Unlimited (cluster scaling)	Unlimited
Queries	Limited by GPU count	Auto-scaling	Rate limit (per tier)
Latency	< 10 ms (same node)	10–100 ms (network hop)	100 ms – 10 s
Customization	Full (fine-tuning, quantization)	Managed (SageMaker, Bedrock)	Prompt engineering only
Data privacy	Yes (on-prem)	Contractual (region, encryption)	Limited
Cost per 1M tokens	~$0.10–0.50 (FP16 inference)	~$0.20–1.00	~$0.15–15.00
Max context	128K+ (depending on GPU count)	128K+	32K–2M
Cold start	0 (always-on)	30 s – 5 min	0 (shared infra)

GPU pricing and price/performance (2026)

Prices are approximate — NVIDIA does not publish official datacenter GPU price lists. Cloud prices from public providers (Q2 2026). HW purchase prices vary by volume, reseller, and region.

Purchase price (buy)

GPU	Price/GPU	Price 8× GPU baseboard	$/PFLOPS (FP16)	Note
H100 SXM	$27,000–40,000	~$200,000	$25,000	Scarcity 2023–2024, now stabilized
H200 SXM	$35,000–50,000	~$280,000	~$35,000	H100 upgrade, HBM3e
B200	~$60,000–70,000	~$500,000+	~$31,000	Blackwell, FP4 support
B100	~$30,000	~$240,000	~$20,000	Lower price than B200, similar FP8 perf
GB200 (Grace+Blackwell)	~$70,000–100,000	~$2,000,000 (rack)	—	CPU+GPU unified, high-density
A100 80GB	~$10,000–15,000	~$120,000	~$19,200	Previous gen, still relevant
MI300X	~$12,000–18,000	~$100,000	~$9,600	AMD, 192 GB HBM3
Gaudi 3	~$15,625	~$125,000	$8,515	Intel, best $/PFLOPS
L40S	~$8,000–10,000	—	—	Inference, enterprise

Cloud pricing (on-demand $/GPU/hr)

GPU	Cheapest	Mid-range (CoreWeave, Lambda)	Hyperscaler (AWS, GCP, Azure)
H100 SXM	$1.38 (Thunder)	$2.89–3.29	$4.15–6.88
H100 PCIe	$2.01 (Spheron)	$2.50	—
H200 SXM	$3.89 (Spheron)	$4.54	$5.00+
B200	$3.39 (Spheron)	$6.02	$14.24 (AWS)
B200 spot	$2.12 (Spheron)	—	—
GB200	$3.50 (Runcrate)	$5.85 (Oracle)	$6.95 (GCP)
MI300X	$1.50 (TensorWave)	$1.85 (Vultr)	$7.86 (Azure)
A100 80GB	$1.07 (Spheron)	$1.50–2.00	$3.00+
Gaudi 3	~$1.50–2.50	—	—
L40S	$0.91 (Spheron)	$1.50–2.00	—

Inference cost ($/M tokens)

GPU	Provider	$/hr	Est. tok/s	$/M tok
B200	Spheron	$3.39	~4,000	$0.42
B200 spot	Spheron	$2.12	~4,000	$0.15
H100 PCIe	Spheron	$2.01	~1,200	$0.47
A100 80GB	Spheron	$1.07	~520	$0.57
H100 SXM	AWS	$6.88	~1,200	$1.59
H200 SXM	Spheron	$4.54	~1,800	$0.70
L40S	Spheron	$0.91	~450	$0.56

Values for Llama 3 70B (INT8, batch=1, output 1K tok). Actual values vary by batch size, context, and quantization.

Cost per GB HBM

GPU	HBM	Price/hr cloud	$/GB/hr	Best for memory-bound workloads
MI300X	192 GB	$1.50	$0.0078	✅ Best
B200	192 GB	$3.39	$0.0177	✅ Good
H200	141 GB	$3.89	$0.0276	⚠️
H100 SXM	80 GB	$1.38	$0.0173	⚠️ Only up to 70B models
GB200	384 GB	$3.50	$0.0091	✅✅ (2× MI300X capacity)

Price/performance by scenario

Scenario	Winner	Rationale
Absolute performance (cost no object)	GB200 DGX NVL72	72× GPU, 18 PFLOPS FP8, 384 GB HBM/GPU
Cloud inference — best $/token	B200 spot	$0.15/M tok; 4× H100 throughput at lower cost
Cloud inference — on-demand	B200	$0.42/M tok
Cloud inference — budget	A100 / L40S	$0.57–0.56/M tok
Training — price/perf on purchase	Gaudi 3	$8,515/PFLOPS, 2.5–3× better than H100
Training — cloud	H100 SXM	$1.38/hr, CUDA ecosystem, NCCL
Memory-bound — long context, 70B+	MI300X / GB200	192–384 GB, $0.0078–0.0091/GB
Ecosystem + safe choice	H100/H200	CUDA, widest SW, NVIDIA tools
Spot / preemptible — lowest cost	A100 / H100	$1.07–1.38/hr, 50–90% off on-demand

2026 Trends

H100 — price dropped 64% from peak $8/hr to $1.38–2.89/hr, then 40% rebound from inference demand
B200 — new high-end, $3.39/hr cloud → ~$0.15/M tok on spot — new inference benchmark
MI300X — supply growing (TensorWave, Vultr, CoreWeave, Oracle, Azure), from $1.50/hr
Gaudi 3 — best $/PFLOPS on purchase, but narrow ecosystem and limited cloud availability
Market bifurcation — prior gen (H100, A100) commoditizing, new gen (B200, GB200) commanding premium
GPU.en.md — GPU architecture, NVIDIA/AMD, vGPU, MIG
NETWORKING.en.md — InfiniBand, RoCE, network topology
STORAGE.en.md — parallel filesystem, object store
DATACENTERS.en.md — DC layout, power, cooling
CLOUD.en.md — cloud AI services (SageMaker, Vertex AI)

Sources

Links, books, and standards: sources/infrastructure/sources.en.md

Last revision: 2026-06-18

30 KiB Raw Permalink Blame History Unescape Escape