🧠 AI/ML Infrastructure
Component overview
GPU compute
NVIDIA
| GPU |
Architecture |
FP8 |
FP16/BF16 |
FP64 |
HBM |
NVLink |
TDP |
Rack config |
| H100 SXM |
Hopper |
3,958 TFLOPS |
1,979 TFLOPS |
67 TFLOPS |
80 GB HBM3 |
900 GB/s |
700 W |
6–8× in DGX H100 |
| H200 SXM |
Hopper (HBM3e) |
3,958 TFLOPS |
1,979 TFLOPS |
67 TFLOPS |
141 GB HBM3e |
900 GB/s |
700 W |
6–8× in DGX H200 |
| B200 |
Blackwell |
~9,000 TFLOPS |
~4,500 TFLOPS |
~40 TFLOPS |
192 GB HBM3e |
1,800 GB/s |
1,000 W |
6–8× in DGX B200 |
| GB200 Grace Hopper |
Blackwell |
~18,000 TFLOPS |
~9,000 TFLOPS |
— |
192 GB + 480 GB (Grace) |
NVLink-C2C |
1,000 W (GPU) + 500 W (CPU) |
DGX GB200 (36× GPU) |
| L40S |
Ada Lovelace |
733 TFLOPS |
367 TFLOPS |
— |
48 GB GDDR6 |
N/A |
350 W |
Inference, enterprise |
| A100 SXM |
Ampere |
1,248 TFLOPS |
624 TFLOPS |
19.5 TFLOPS |
80 GB HBM2e |
600 GB/s |
400 W |
DGX A100 |
AMD
| GPU |
Architecture |
FP8 |
FP16/BF16 |
FP64 |
HBM |
Infinity Fabric |
TDP |
| MI300X |
CDNA 3 |
2,615 TFLOPS |
1,307 TFLOPS |
81 TFLOPS |
192 GB HBM3 |
896 GB/s |
750 W |
| MI250 |
CDNA 2 |
— |
383 TFLOPS |
95.7 TFLOPS |
128 GB HBM2e |
400 GB/s |
500 W |
Intel
| GPU |
Architecture |
FP16/BF16 |
FP32 |
HBM |
TDP |
| Gaudi 3 |
Custom |
1,835 TFLOPS |
— |
144 GB HBM2e |
600 W |
| Max 1550 |
Xe HPC |
600+ TFLOPS |
200 TFLOPS |
128 GB HBM2e |
600 W |
Cloud ASIC
| ASIC |
Provider |
Use case |
Performance |
| TPU v5p |
Google |
Training |
~4,600 TFLOPS (BF16) per pod |
| Trainium 2 |
AWS |
Training |
~1,000 TFLOPS (BF16) per chip |
| Inferentia 2 |
AWS |
Inference |
~400 TOPS (INT8) per chip |
| Maia 100 |
Microsoft |
Training + inference |
Custom, 800 W TDP |
AI networking
Technology comparison
| Technology |
Bandwidth per link |
Latency |
Topology |
Use case |
| InfiniBand NDR200 |
200 Gb/s |
< 1 µs |
Fat-tree, Dragonfly+ |
Training (NVIDIA) |
| InfiniBand NDR400 |
400 Gb/s |
< 1 µs |
Fat-tree, Dragonfly+ |
Training (NVIDIA) |
| InfiniBand XDR |
800 Gb/s (planned) |
< 1 µs |
Dragonfly+ |
Next-gen training |
| RoCEv2 (CX-7/8) |
200–400 Gb/s |
1–2 µs |
Fat-tree, Spine-leaf |
Training (AMD, Intel, open) |
| NVLink 4.0 |
900 GB/s per GPU |
< 0.5 µs |
NVSwitch full-mesh |
Intra-node GPU comm |
| NVLink 5.0 |
1,800 GB/s per GPU |
< 0.5 µs |
NVSwitch full-mesh |
Intra-node (Blackwell) |
| Ethernet (400 GbE) |
400 Gb/s |
2–5 µs |
Spine-leaf |
Inference, data pipeline |
AI fabric principles
- Rail-optimized topology — each GPU communicates on dedicated "rails" (same GPU indices across nodes connect to the same switch)
- Fat-tree (Clos) — standard for InfiniBand and RoCE, non-blocking bisection bandwidth
- Dragonfly+ — reduces hop count while maintaining bandwidth (used in largest clusters)
- GPU Direct RDMA — direct GPU ↔ GPU communication without CPU involvement, supports InfiniBand and RoCE
- SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) — in-network reduction for AllReduce (InfiniBand only)
Bandwidth sizing
AI storage
Requirements
| Dataset size |
IO pattern |
Recommended storage |
Bandwidth |
| < 10 TB |
Sequential read (data loading) |
Local NVMe |
> 10 GB/s per node |
| 10–100 TB |
Random read (checkpointing) |
Parallel FS (Lustre, Weka) |
> 100 GB/s cluster-wide |
| 100 TB–10 PB |
Mixed (training + checkpoint) |
Parallel FS + object store |
> 500 GB/s |
| 10 PB+ |
Multi-modal, video, LLM |
Tiered (NVMe cache + parallel FS + object) |
> 1 TB/s |
Storage solution comparison
| Solution |
Type |
Bandwidth per node |
Max capacity |
Scaling |
Use case |
| Lustre |
Parallel FS (POSIX) |
> 100 GB/s (cluster) |
100s PB |
OST + MDS |
HPC, LLM training (standard) |
| GPFS / StorageScale |
Parallel FS (POSIX) |
> 100 GB/s |
100s PB |
NSD servers |
HPC, AI (IBM) |
| WekaFS |
Parallel FS (POSIX + NFS/SMB) |
~80 GB/s per 10 nodes |
10s PB |
Container-native |
AI/ML, NVIDIA DGX preferred |
| VAST Data |
Universal storage (NVMe + QLC) |
~100 GB/s per cluster |
10s PB |
Scale-out |
AI, checkpoint, data lake |
| Pure Storage//E |
All-flash (NVMe) |
~50 GB/s |
~30 PB |
Scale-out |
Enterprise AI, database |
| MinIO / S3 |
Object store |
~20 GB/s per gateway |
EB |
Erasure coding |
Dataset repository, checkpoint |
| NetApp AFF |
NAS + S3 |
~10 GB/s per controller |
~50 PB |
HA pair |
Enterprise, NFS baseline |
Checkpointing strategies
| Strategy |
RPO |
Storage impact |
Description |
| Full checkpoint |
every N steps |
High (stops training) |
Full model + optimizer state |
| Async checkpoint |
every N steps |
Medium (non-blocking) |
Copy to staging buffer, async write |
| Distributed checkpoint (NVIDIA NeMo) |
every N steps |
Low |
Each rank writes its own shard |
| In-memory checkpoint (IBM) |
on failover |
Minimal (DRAM) |
Replication to another node's DRAM |
| Continuous checkpoint (Microsoft) |
every 1–5 min |
Low (delta) |
Changed shards only |
AI cluster architecture
Physical topology — DGX H100 example
Kubernetes for AI
| Component |
Role |
| Volcano |
Batch scheduling, gang scheduling, queue management |
| Kueue |
Multi-tenant admission, resource quotas, fair sharing |
| NVIDIA GPU Operator |
Driver, container toolkit, MIG, DCGM, monitoring |
| HAMi (ex k8s-vGPU-scheduler) |
GPU sharing, MIG partitioning, fractional GPU |
| Node Feature Discovery |
GPU type detection, NUMA topology |
| Topology Manager |
NUMA-aware pod placement |
| DPDK / SR-IOV |
High-performance networking for GPU Direct RDMA |
Slurm for AI
| Component |
Role |
| slurm.conf |
Partition for GPU nodes, GRES (Generic Resource) |
| gres.conf |
GPU type, GPU count per node |
| srun --gres=gpu:8 |
Allocate 8 GPUs per job |
| sbatch --nodes=64 --ntasks=512 |
64 nodes, 512 ranks (8 GPU/node) |
| Pixis |
NVIDIA orchestration plugin for Slurm |
AI cluster cooling
Power density comparison
| Configuration |
TDP per node |
Racks |
kW/rack |
Note |
| Standard server (2U) |
1 kW |
20 |
5–10 |
Typical DC |
| GPU server (DGX H100, 6×) |
42 kW |
6 |
45–50 |
Air cooling limit |
| GPU server (DGX B200, 6×) |
72 kW |
6 |
90–100 |
Liquid cooling required |
| GPU server (GB200 NVL72) |
120 kW |
— |
~120 |
Liquid cooling mandatory |
| NVIDIA NVL72 rack |
120 kW |
1 |
120 |
Fully liquid cooled |
Cooling technologies
| Method |
Max kW/rack |
CAPEX |
OPEX |
Complexity |
| Air cooling (CRAC/CRAH) |
< 15 |
Low |
Medium |
Low |
| Air cooling (in-row) |
15–30 |
Medium |
Medium |
Low |
| Rear-door heat exchanger |
30–50 |
Medium |
Low |
Medium |
| Direct-to-chip liquid (cold plate) |
50–150 |
High |
Low |
High |
| Immersion (single-phase) |
100–200 |
High |
Low |
High |
| Immersion (two-phase) |
200+ |
Very high |
Low |
Very high |
Inference infrastructure
Inference server comparison
| Tool |
Frameworks |
Optimization |
Use case |
| vLLM |
Megatron, HF, AWQ, GPTQ |
PagedAttention, KV cache, continuous batching |
LLM inference (open source) |
| TensorRT-LLM |
TensorRT |
INT4/INT8/FP8, inflight batching, attention optimizations |
Production (NVIDIA) |
| Triton Inference Server |
All (TensorRT, vLLM, PyTorch) |
Model ensemble, model caching, concurrent execution |
Enterprise, multi-model |
| SageMaker |
Managed |
Auto-scaling, model parallelism |
AWS managed |
| OpenAI API / TGI |
HF Transformers |
Continuous batching, flash attention |
Hosting |
Inference optimization
| Technique |
Latency improvement |
Throughput improvement |
Memory reduction |
| FP8/INT8 quantization |
— |
2× |
2× |
| INT4 quantization |
— |
4× |
4× |
| Flash Attention 2/3 |
2–4× |
— |
50 % (KV cache) |
| PagedAttention |
— |
2–5× |
95 % (KV cache fragmentation) |
| Continuous batching |
— |
10–20× |
— |
| Speculative decoding |
2–3× |
— |
— |
| Multi-LoRA / S-LoRA |
— |
8–16× |
— |
Distributed training techniques
| Technique |
Description |
Frameworks |
| Data Parallelism (DDP/FSDP) |
Each GPU has model copy, different batch |
PyTorch DDP, FSDP |
| Tensor Parallelism (TP) |
Model split across layers (intra-node) |
Megatron-LM, DeepSpeed |
| Pipeline Parallelism (PP) |
Layers split across nodes |
Megatron-LM, DeepSpeed |
| Sequence Parallelism (SP) |
Sequence split across GPUs |
Megatron-LM |
| Expert Parallelism (EP) |
Different expert subnets on different GPUs |
Mixture-of-Experts (MoE) |
| 3D Parallelism |
TP + PP + DP combination |
Megatron-LM, NeMo |
| ZeRO (1/2/3) |
Optimizer/gradient/parameter sharding |
DeepSpeed |
| NCCL / RCCL |
GPU collective communication library |
NVIDIA/AMD |
Operating systems for AI
Distribution comparison
| OS |
GPU driver |
CUDA |
Container toolkit |
IB/RoCE |
Lustre client |
Production support |
| Ubuntu 22.04 LTS |
NVIDIA 525+ |
12.x |
nvidia-container-toolkit |
MLNX_OFED, rdma-core |
Yes (lustre-client) |
NVIDIA DGX standard |
| Ubuntu 24.04 LTS |
NVIDIA 550+ |
12.5+ |
nvidia-container-toolkit |
MLNX_OFED, rdma-core |
Yes |
Latest GPU support |
| RHEL 9 / Rocky 9 |
NVIDIA 525+ |
12.x |
nvidia-container-toolkit |
MLNX_OFED |
Yes (EL repo) |
Red Hat, enterprise |
| DGX OS (Ubuntu-based) |
NVIDIA custom |
12.x |
Pre-installed |
Pre-configured |
Yes |
NVIDIA DGX only supported |
| SLES 15 SP5 |
NVIDIA 525+ |
12.x |
nvidia-container-toolkit |
MLNX_OFED |
Yes |
HPC, some Lustre clusters |
| Debian 12 |
NVIDIA 525+ |
12.x |
nvidia-container-toolkit |
rdma-core |
Yes (backports) |
Community, research |
| Flatcar / Bottlerocket |
Container-host |
— |
nvidia-container-toolkit |
Limited |
No |
K8s-only, minimal footprint |
Limitations and constraints
GPU drivers and CUDA
| Constraint |
Detail |
| Driver-CUDA compatibility |
NVIDIA driver major version must match CUDA toolkit (driver ≥ CUDA req). E.g., CUDA 12.5 requires driver ≥ 550 |
| Kernel version |
NVIDIA driver not compatible with all kernels. New kernel (6.8+) may require DKMS build or delayed support |
| Secure Boot |
NVIDIA driver requires signed module (MOK, shim) or disabled Secure Boot — common enterprise issue |
| Open vs Proprietary driver |
NVIDIA nvidia-open (since R515) — open source kernel module. GPU support: DC (H100+) → OK, older GPUs → proprietary required |
| nvidia-persistenced |
Required to maintain GPU initialization; without it GPUs may sleep after idle timeout (nvidia-smi -pm 1) |
| GPU reset |
After crashed training job, GPU may hang. nvidia-smi --gpu-reset or reboot node, sometimes power cycle |
| Multi-instance GPU (MIG) |
Requires specific driver, MIG mode on GPU, GPU restart. Cannot be changed at runtime. A100, H100, B200 only |
Network (InfiniBand / RoCE)
| Constraint |
Detail |
| MLNX_OFED vs rdma-core |
MLNX_OFED (NVIDIA) — full support, but own kernel modules, kernel version compatibility needed. rdma-core (open) — limited support, no custom modules |
| Kernel compatibility |
MLNX_OFED supports only specific kernel versions (major.minor). Kernel upgrade → MLNX_OFED rebuild required |
| NCCL |
NCCL version must be compatible with CUDA and IB firmware. nccl-tests for validation |
| SHARP |
In-network reduction requires specific MLNX_OFED + IB switch firmware combination |
| GPU Direct RDMA |
Requires nvidia-peermem module + MLNX_OFED. Does not work with all GPU and IB card combinations |
| RoCE PFC/ECN |
RoCE requires lossless fabric (PFC, ECN, DCQCN). Switch and host configuration — complex tuning |
Storage
| Constraint |
Detail |
| Lustre client |
Client version must match server. Server upgrade → upgrade all clients. Compatible with RHEL/Debian derivatives only |
| POSIX locking |
NFS and Lustre have different POSIX locking behavior. Distributed training relies on flock → problematic with mixed FS |
| Filesystem cache |
Page cache can mask IO bottlenecks. Training jobs often require O_DIRECT or sync IO |
| Local NVMe vs parallel FS |
Dataset staging on local NVMe eliminates network dependency but requires space and pre-fetch pipeline |
Container runtime
| Constraint |
Detail |
| Docker + GPU |
nvidia-container-toolkit (formerly nvidia-docker2). Requires runtime installation and config in /etc/docker/daemon.json |
| Podman + GPU |
Requires nvidia-container-toolkit + podman hook. Less tested than Docker |
| containerd + GPU |
Standard for K8s. Requires cdi (Container Device Interface) or nvidia-container-runtime |
| Enroot + Pyxis |
NVIDIA container stack for Slurm (Enroot = daemonless container runtime, Pyxis = Slurm plugin) |
| User namespace mapping |
Container GPU access requires device cgroup; rootless may fail (exception for /dev/dri and /dev/nvidia*) |
Kernel parameters
Firmware and HW
| Constraint |
Detail |
| GPU firmware (VBIOS) |
NVIDIA datacenter GPUs (H100, B200) have VBIOS updates via NVFlash. Without update → missing partitioning support or newer CUDA features |
| InfiniBand firmware |
IB switch and HCA firmware must be compatible. Mix old switch + new HCA → degraded perf |
| NVSwitch firmware |
DGX systems have NVSwitch firmware updatable only via NVIDIA DGX tools |
| Power capping (nvidia-smi) |
nvidia-smi -pl <power> — limit TDP for power budget management. Test impact on training throughput |
| GPU clock locking |
nvidia-smi -ac <clock,mem> — locked clock frequency for stable benchmarks. Apply after nvidia-persistenced |
| PCIe Gen |
GPU in PCIe Gen4 slot (instead of Gen5) → bottleneck for CPU↔GPU data transfer. Important for FSDP sharding |
Recommended OS per use case
| Use case |
OS |
Rationale |
| DGX cluster (production) |
DGX OS / Ubuntu 22.04 LTS |
NVIDIA standard, best driver support |
| Enterprise K8s (OpenShift) |
RHEL 9 / RHCOS |
Red Hat support, GPU Operator compatible |
| Vanilla K8s (on-prem) |
Ubuntu 22.04 LTS + Flatcar (workers) |
Widest community support, Flatcar for minimal footprint |
| Slurm cluster (HPC/AI) |
Rocky Linux 9 / Ubuntu 22.04 LTS |
EL ecosystem (Lustre, OFED) or Ubuntu (community) |
| Research / rapid prototyping |
Ubuntu 24.04 LTS |
Latest CUDA, PyTorch, driver support |
| Edge inference |
NVIDIA JetPack / Ubuntu (ARM) |
Embedded GPU (Jetson Orin, AGX) |
AI-ready data center — check-list
| Area |
Requirement |
| Power |
30–120 kW/rack, HVDC (400 V DC), UPS supporting GPU spikes |
| Cooling |
Liquid cooling ready (direct-to-chip), rear-door for 30+ kW |
| Network |
InfiniBand (NDR/XDR) or RoCEv2, rail-optimized fat-tree |
| Storage |
Parallel FS (Lustre/Weka), checkpoint bandwidth > 100 GB/s |
| GPU density |
Max GPU/rack, minimize NVSwitch hops |
| Physical |
Floor load 1,500+ kg/m², rack 52U–60U |
| Security |
Tenant isolation, network segmentation, data encryption |
| Monitoring |
DCGM, NCCL health checks, thermals, power capping |
Model and throughput limitations
Model size per GPU
Maximum model size fitting on a single GPU depends on HBM capacity and precision:
| GPU |
HBM |
FP32 |
FP16/BF16 |
INT8 |
INT4 |
| H100 80GB |
80 GB |
~10B |
~40B |
~80B |
~160B |
| H200 141GB |
141 GB |
~18B |
~70B |
~140B |
~280B |
| B200 192GB |
192 GB |
~24B |
~96B |
~192B |
~384B |
| MI300X 192GB |
192 GB |
~24B |
~96B |
~192B |
~384B |
| A100 80GB |
80 GB |
~10B |
~40B |
~80B |
~160B |
| GB200 (192+480) |
192 GB GPU + 480 GB Grace |
— |
~96B + CPU offload |
— |
— |
Approximate: 1B params ≈ 2 GB FP16 ≈ 4 GB FP32 ≈ 1 GB INT8 ≈ 0.5 GB INT4. Subtract ~10–15 % HBM for activations, KV cache, optimizer states.
Memory breakdown inference
| Component |
Llama 3 70B (FP16) |
Llama 3 8B (FP16) |
| Model weights |
140 GB |
16 GB |
| KV cache (4K context, batch 1) |
~2 GB |
~0.2 GB |
| KV cache (128K context, batch 1) |
~60 GB |
~6.5 GB |
| Activations (peak) |
~5 GB |
~1 GB |
| Total 4K ctx |
~147 GB |
~17 GB |
| Total 128K ctx |
~205 GB |
~23 GB |
Conclusion: Llama 3 70B FP16 does not fit on a single H100 (80 GB). Required: INT8 (170 GB → 2× H100), INT4 (85 GB → 1× H200), or tensor parallelism.
Context length vs memory
| Context |
KV cache 70B (FP16) |
KV cache 8B (FP16) |
Note |
| 4K |
~2.2 GB |
~0.25 GB |
Typical chat |
| 32K |
~18 GB |
~2 GB |
Documents |
| 128K |
~72 GB |
~8 GB |
Long-context (Claude, Gemini) |
| 1M |
~560 GB |
~64 GB |
Experimental (Gemini 1.5 Pro) |
KV cache is linear with context length and quadratic with attention head count. Critical for long-context inference.
Throughput inference
| Model |
GPU |
Precision |
Batch size |
Tokens/s |
QPS (1K output) |
| Llama 3 8B |
H100 |
FP16 |
1 |
~800 |
~0.8 |
| Llama 3 8B |
H100 |
FP16 |
128 |
~4 500 |
~35 |
| Llama 3 8B |
H100 |
INT4 |
128 |
~8 000 |
~62 |
| Llama 3 70B |
4× H100 |
FP16 |
1 |
~180 |
~0.18 |
| Llama 3 70B |
4× H100 |
INT4 |
64 |
~1 200 |
~19 |
| Llama 3 70B |
8× H100 |
FP16 (TP=8) |
128 |
~2 500 |
~20 |
| DeepSeek-R1 671B |
8× H200 |
FP8 (MoE) |
64 |
~500 |
~8 |
| GPT-4 class (est.) |
— |
— |
— |
~100–300 |
~1–3 |
Notes:
- QPS (queries per second) depends on output length (1K tokens ≈ ~1 query)
- Larger batch increases throughput but increases TTFB (time to first token)
- Tensor Parallelism (TP) scales, but communication overhead grows linearly
Training limits
Scaling efficiency
| GPU count |
Model |
Efficiency |
Reason |
| 8 (1 node) |
Llama 3 8B |
~95 % |
NVLink intra-node |
| 64 (8 nodes) |
Llama 3 8B |
~85 % |
IB inter-node |
| 512 (64 nodes) |
Llama 3 70B |
~75 % |
Communication overhead |
| 4 096 (512 nodes) |
Llama 3 70B |
~60 % |
Pipeline bubble, network |
| 16 384 (2 048 nodes) |
Llama 3 405B |
~45 % |
Synchronous SGD overhead |
Note: Efficiency = (actual throughput) / (ideal linear speedup). Decreases logarithmically with GPU count.
Memory breakdown training
| Component |
Llama 3 70B (BF16) |
Llama 3 8B (BF16) |
| Model weights |
140 GB |
16 GB |
| Optimizer states (Adam) |
280 GB |
32 GB |
| Gradients |
140 GB |
16 GB |
| Activations (peak) |
~30 GB |
~4 GB |
| Total (DDP) |
~590 GB |
~68 GB |
| Total (FSDP shard=8) |
~74 GB |
~8.5 GB |
Conclusion: FSDP (Fully Sharded Data Parallelism) is required for training models > 10B. Adam optimizer doubles memory vs inference (weights + optimizer + gradients).
Time to train
| Model |
GPU count |
GPU type |
Precision |
Time |
Cost (on-prem estimate) |
| Llama 3 8B |
64 |
H100 |
BF16 |
~3 days |
~$5 000 |
| Llama 3 70B |
512 |
H100 |
BF16 |
~14 days |
~$100 000 |
| Llama 3 405B |
16 384 |
H100 |
BF16 |
~60 days |
~$14 M |
| DeepSeek-R1 671B (MoE) |
2 048 |
H800 |
BF16 |
~30 days |
~$6 M |
| GPT-4 (est.) |
25 000 |
A100/H100 |
Mixed |
~90–100 days |
~$100 M |
Power and thermal limits
| Configuration |
TDP limit |
Throughput loss |
Reason |
| H100 SXM |
700 W (default) |
0 % |
Nominal |
| H100 SXM |
600 W (-15 %) |
~5–8 % |
Power capping |
| H100 SXM |
500 W (-30 %) |
~15–25 % |
Significant throttling |
| H100 SXM |
400 W (-43 %) |
~30–50 % |
Emergency only |
| DGX H100 (8×) |
5.6 kW (max) |
0 % |
Liquid cooling required |
| DGX H100 (8×) |
4.5 kW (air) |
~10–15 % |
Rear-door heat exchanger |
GPU throttles when exceeding TDP or temperature (85°C+). Power capping correlates linearly with frequency but non-linearly with throughput.
API and operational limits
| Limit |
Description |
Typical value |
| Rate limit |
Max requests per minute/hour |
100–10 000 RPM (per tier) |
| Tokens per minute (TPM) |
Max tokens per minute |
1M–300M (per model) |
| Context window |
Max input tokens |
4K–2M (per model) |
| Max output tokens |
Max generated tokens |
4K–32K (per model) |
| Concurrent requests |
Parallel request count |
10–10 000 (per backend) |
| Batch window |
Time to accumulate batch |
0–20 s (vLLM, TGI) |
| TTFB timeout |
Max latency to first token |
30–120 s |
| Idle timeout |
GPU idle → scale to 0 |
5–15 min (cloud) |
Limits per deployment model
| Dimension |
On-prem HW |
Managed cloud (SageMaker, Vertex) |
API (OpenAI, Anthropic) |
| Model size |
Limited by HBM (max 192 GB/GPU) |
Unlimited (cluster scaling) |
Unlimited |
| Queries |
Limited by GPU count |
Auto-scaling |
Rate limit (per tier) |
| Latency |
< 10 ms (same node) |
10–100 ms (network hop) |
100 ms – 10 s |
| Customization |
Full (fine-tuning, quantization) |
Managed (SageMaker, Bedrock) |
Prompt engineering only |
| Data privacy |
Yes (on-prem) |
Contractual (region, encryption) |
Limited |
| Cost per 1M tokens |
~$0.10–0.50 (FP16 inference) |
~$0.20–1.00 |
~$0.15–15.00 |
| Max context |
128K+ (depending on GPU count) |
128K+ |
32K–2M |
| Cold start |
0 (always-on) |
30 s – 5 min |
0 (shared infra) |
GPU pricing and price/performance (2026)
Prices are approximate — NVIDIA does not publish official datacenter GPU price lists. Cloud prices from public providers (Q2 2026). HW purchase prices vary by volume, reseller, and region.
Purchase price (buy)
| GPU |
Price/GPU |
Price 8× GPU baseboard |
$/PFLOPS (FP16) |
Note |
| H100 SXM |
$27,000–40,000 |
~$200,000 |
$25,000 |
Scarcity 2023–2024, now stabilized |
| H200 SXM |
$35,000–50,000 |
~$280,000 |
~$35,000 |
H100 upgrade, HBM3e |
| B200 |
~$60,000–70,000 |
~$500,000+ |
~$31,000 |
Blackwell, FP4 support |
| B100 |
~$30,000 |
~$240,000 |
~$20,000 |
Lower price than B200, similar FP8 perf |
| GB200 (Grace+Blackwell) |
~$70,000–100,000 |
~$2,000,000 (rack) |
— |
CPU+GPU unified, high-density |
| A100 80GB |
~$10,000–15,000 |
~$120,000 |
~$19,200 |
Previous gen, still relevant |
| MI300X |
~$12,000–18,000 |
~$100,000 |
~$9,600 |
AMD, 192 GB HBM3 |
| Gaudi 3 |
~$15,625 |
~$125,000 |
$8,515 |
Intel, best $/PFLOPS |
| L40S |
~$8,000–10,000 |
— |
— |
Inference, enterprise |
Cloud pricing (on-demand $/GPU/hr)
| GPU |
Cheapest |
Mid-range (CoreWeave, Lambda) |
Hyperscaler (AWS, GCP, Azure) |
| H100 SXM |
$1.38 (Thunder) |
$2.89–3.29 |
$4.15–6.88 |
| H100 PCIe |
$2.01 (Spheron) |
$2.50 |
— |
| H200 SXM |
$3.89 (Spheron) |
$4.54 |
$5.00+ |
| B200 |
$3.39 (Spheron) |
$6.02 |
$14.24 (AWS) |
| B200 spot |
$2.12 (Spheron) |
— |
— |
| GB200 |
$3.50 (Runcrate) |
$5.85 (Oracle) |
$6.95 (GCP) |
| MI300X |
$1.50 (TensorWave) |
$1.85 (Vultr) |
$7.86 (Azure) |
| A100 80GB |
$1.07 (Spheron) |
$1.50–2.00 |
$3.00+ |
| Gaudi 3 |
~$1.50–2.50 |
— |
— |
| L40S |
$0.91 (Spheron) |
$1.50–2.00 |
— |
Inference cost ($/M tokens)
| GPU |
Provider |
$/hr |
Est. tok/s |
$/M tok |
| B200 |
Spheron |
$3.39 |
~4,000 |
$0.42 |
| B200 spot |
Spheron |
$2.12 |
~4,000 |
$0.15 |
| H100 PCIe |
Spheron |
$2.01 |
~1,200 |
$0.47 |
| A100 80GB |
Spheron |
$1.07 |
~520 |
$0.57 |
| H100 SXM |
AWS |
$6.88 |
~1,200 |
$1.59 |
| H200 SXM |
Spheron |
$4.54 |
~1,800 |
$0.70 |
| L40S |
Spheron |
$0.91 |
~450 |
$0.56 |
Values for Llama 3 70B (INT8, batch=1, output 1K tok). Actual values vary by batch size, context, and quantization.
Cost per GB HBM
| GPU |
HBM |
Price/hr cloud |
$/GB/hr |
Best for memory-bound workloads |
| MI300X |
192 GB |
$1.50 |
$0.0078 |
✅ Best |
| B200 |
192 GB |
$3.39 |
$0.0177 |
✅ Good |
| H200 |
141 GB |
$3.89 |
$0.0276 |
⚠️ |
| H100 SXM |
80 GB |
$1.38 |
$0.0173 |
⚠️ Only up to 70B models |
| GB200 |
384 GB |
$3.50 |
$0.0091 |
✅✅ (2× MI300X capacity) |
Price/performance by scenario
| Scenario |
Winner |
Rationale |
| Absolute performance (cost no object) |
GB200 DGX NVL72 |
72× GPU, 18 PFLOPS FP8, 384 GB HBM/GPU |
| Cloud inference — best $/token |
B200 spot |
$0.15/M tok; 4× H100 throughput at lower cost |
| Cloud inference — on-demand |
B200 |
$0.42/M tok |
| Cloud inference — budget |
A100 / L40S |
$0.57–0.56/M tok |
| Training — price/perf on purchase |
Gaudi 3 |
$8,515/PFLOPS, 2.5–3× better than H100 |
| Training — cloud |
H100 SXM |
$1.38/hr, CUDA ecosystem, NCCL |
| Memory-bound — long context, 70B+ |
MI300X / GB200 |
192–384 GB, $0.0078–0.0091/GB |
| Ecosystem + safe choice |
H100/H200 |
CUDA, widest SW, NVIDIA tools |
| Spot / preemptible — lowest cost |
A100 / H100 |
$1.07–1.38/hr, 50–90% off on-demand |
2026 Trends
-
H100 — price dropped 64% from peak $8/hr to $1.38–2.89/hr, then 40% rebound from inference demand
-
B200 — new high-end, $3.39/hr cloud → ~$0.15/M tok on spot — new inference benchmark
-
MI300X — supply growing (TensorWave, Vultr, CoreWeave, Oracle, Azure), from $1.50/hr
-
Gaudi 3 — best $/PFLOPS on purchase, but narrow ecosystem and limited cloud availability
-
Market bifurcation — prior gen (H100, A100) commoditizing, new gen (B200, GB200) commanding premium
-
GPU.en.md — GPU architecture, NVIDIA/AMD, vGPU, MIG
-
NETWORKING.en.md — InfiniBand, RoCE, network topology
-
STORAGE.en.md — parallel filesystem, object store
-
DATACENTERS.en.md — DC layout, power, cooling
-
CLOUD.en.md — cloud AI services (SageMaker, Vertex AI)
Sources
Links, books, and standards: sources/infrastructure/sources.en.md
Last revision: 2026-06-18