Fossil/knowledge-base

Fork 0

Files

Stanislav Hubacek 3fa11ef0f6 comiiit

2026-06-11 15:27:28 +02:00

6.7 KiB

Raw Blame History

🎮 GPU — architecture, models, virtualization

GPU models

NVIDIA

GPU	Architecture	VRAM	HBM	FP16 (TFLOPS)	FP8 (TFLOPS)	Interconnect	TDP
A100	Ampere (2020)	40/80 GB	HBM2e	312	—	NVLink 3 (600 GB/s)	400 W
H100	Hopper (2022)	80 GB	HBM3	1000	2000 (sparse)	NVLink 4 (900 GB/s)	700 W
H200	Hopper (2023)	141 GB	HBM3e	1650	~3300	NVLink 4 (900 GB/s)	700 W
B200	Blackwell (2024)	192 GB	HBM3e	2250	~4500	NVLink 5 (1800 GB/s)	700 W
B100	Blackwell (2024)	192 GB	HBM3e	~1800	~3600	NVLink 5	700 W
GB200	Blackwell (2024)	—	HBM3e	4500 (dual)	9000 (dual)	NVLink 5	2700 W

AMD

GPU	Architecture	VRAM	HBM	FP16 (TFLOPS)	Interconnect	TDP
MI250X	CDNA 2 (2021)	128 GB	HBM2e	383	Infinity Fabric	500 W
MI300X	CDNA 3 (2023)	192 GB	HBM3	~2600	Infinity Fabric (896 GB/s)	750 W
MI350	CDNA 4 (2025)	288 GB	HBM3e	~3500	Infinity Fabric	750 W

GPU interconnects

Technology	Provider	Bandwidth	Topology	Use case
NVLink 4	NVIDIA	900 GB/s (18× 50 GB/s)	GPU-GPU direct	AI training (H100, H200)
NVLink 5	NVIDIA	1800 GB/s (18× 100 GB/s)	GPU-GPU direct	AI training (B200, GB200)
Infinity Fabric	AMD	896 GB/s	GPU-GPU + CPU-GPU	AI training (MI300X, MI350)
NVSwitch	NVIDIA	900 GB/s per GPU (NVLink)	Full-mesh (256 GPU)	DGX SuperPOD, HGX
InfiniBand (NDR)	NVIDIA/Mellanox	400 Gbps per port	GPU-NIC direct, RDMA	Distributed training, HPC
PCIe 5.0	Standard	63 GB/s per x16	CPU-GPU	Inference, rendering
Ethernet (RoCE v2)	Standard	100/200/400 GbE	GPU-NIC, RDMA over converged ethernet	AI inference, storage

GPU direct communication

GPU 0 ──NVLink── GPU 1        GPU 0 ───PCIe─── CPU ───PCIe─── GPU 1
  │                            │
  │                            │
NVSwitch                     InfiniBand
  │                            │
  │                            │
GPU 2 ──NVLink── GPU 3        GPU 2 ───PCIe─── CPU ───PCIe─── GPU 3

NVLink topologie (GPU direct)   PCIe topologie (CPU mediated)

GPU Direct RDMA — GPU ↔ NIC without CPU (InfiniBand, RoCE)
GPU Direct Storage — GPU ↔ NVMe without CPU (NVIDIA Magnum IO)
NVSwitch — full bisection bandwidth between all GPUs in a node

GPU virtualization

Technology	Description	GPU support	Use case
NVIDIA vGPU (Grid)	Time slicing + dedicated profiles	A-series (VDI), Q-series (pro viz), B-series (AI)	VDI, virtualized AI
NVIDIA MIG	Hardware GPU partitioning	A100 (7 inst.), H100/H200/B200	AI inference, multi-tenant GPU
AMD MxGPU	SR-IOV, hardware partitioning	AMD MI (pro), Radeon Pro	VDI, cloud gaming
Intel SG (SG1)	SR-IOV, hardware partitioning	Intel SG1, Flex, Arc	VDI, media transcoding
GPU passthrough	Dedicated GPU to whole VM (VFIO-pci)	All GPUs	AI training, HPC, highest performance

MIG partition table (A100 / H100)

GPU	Partition profile	GPU Memory	Compute units
A100 80 GB	1g.5gb	5 GB	1
A100 80 GB	2g.10gb	10 GB	2
A100 80 GB	3g.20gb	20 GB	3
A100 80 GB	7g.40gb	40 GB	7
A100 80 GB	Full (7× 1g)	7 × 5 GB	7 instances
H100 80 GB	1g.6gb+me	6 GB	1
H100 80 GB	2g.12gb+me	12 GB	2
H100 80 GB	3g.24gb+me	24 GB	3
H100 80 GB	7g.80gb	80 GB	7

GPU use cases

AI Training

Models: LLM (70B-405B+), vision, multimodal
GPU: H100, B200, GB200, MI300X
Interconnect: NVLink 5 / Infinity Fabric (within node), InfiniBand NDR (between nodes)
Parallelism: Data Parallel (DDP), Tensor Parallel (TP), Pipeline Parallel (PP), Fully Sharded (FSDP)
Framework: PyTorch (NCCL), JAX (XLA), DeepSpeed, Megatron-LM
Tips:
- GB200: 2× B200 connected via NVLink, 8 GPU → 4 GB200
- DGX B200 / HGX B200: standard building block
- InfiniBand: fat tree topology for all-reduce optimization

AI Inference

Models: LLM serving, embedding, image gen
GPU: A100, H200, B200 (larger VRAM for larger models)
Techniques: MIG partition, TensorRT-LLM, vLLM, Triton Inference Server
Quantization: FP8, INT8, INT4 → lower VRAM, higher throughput
Latency: batch size optimization, dynamic batching, continuous batching
Scale: on-prem (2-32 GPU) / cloud (elastic)

VDI (Virtual Desktop Infrastructure)

GPU: NVIDIA A16 (1 GPU = 16 users), A10 (1 GPU = 4 users)
Technology: vGPU (Grid), AMD MxGPU
Protocols: VMware Blast, Citrix HDX, Microsoft RDP, PC-over-IP (HP Teradici)
Use case: CAD (CATIA, SolidWorks), Office, engineering, healthcare (PACS)

Rendering and VFX

GPU: NVIDIA RTX 6000 Ada, RTX A6000, AMD Radeon Pro W7900
Rendering: Blender (Cycles/OptiX), V-Ray, Octane Render, Redshift
Denoising: AI-accelerated denoising on GPU
Farm rendering: Deadline, Qube! (job scheduler)

GPU server form factors

Form factor	GPU count	Power	Cooling	Example
1U	1-2	700-1400 W	Air (high-RPM)	Dell XR4510c
2U	4-8	3-6 kW	Air / Liquid	Dell R760xa, HPE DL380a
4U	8-10	5-8 kW	Liquid	NVIDIA DGX H100, Dell R760xa
8U / Chassis	8-16	10-20 kW	Liquid (CDU)	NVIDIA HGX, Supermicro SYS-821GE

OpenStack Cyborg (GPU lifecycle management)

Cyborg is an OpenStack service for managing accelerators (GPU, FPGA, DPU, NPU).

Key capabilities

Discovery — automatic GPU detection on compute nodes (NVIDIA, AMD, Intel)
Inventory — tracking available accelerators in the cluster
Lifecycle — attach/detach GPU to VM, firmware update, reset
Scheduling — Placement API for GPU-aware scheduling (Nova)
Cyborg API — REST API for accelerator management

Integration

Component	Role
Nova	VM scheduling with GPU requirements (extra_specs: `accel:device_profile`)
Placement	Resource provider for GPU (inventory, traits)
Neutron	SR-IOV VF passthrough for GPU networking
Ironic	Bare metal + GPU provisioning

Sources

Links, books and standards: sources/infrastructure/sources.md

Last revision: 2026-06-03

6.7 KiB Raw Blame History Unescape Escape

🎮 GPU — architecture, models, virtualization

GPU models

NVIDIA

AMD

GPU interconnects

GPU direct communication

GPU virtualization

MIG partition table (A100 / H100)

GPU use cases

AI Training

AI Inference

VDI (Virtual Desktop Infrastructure)

Rendering and VFX

GPU server form factors

OpenStack Cyborg (GPU lifecycle management)

Key capabilities

Integration

Sources

6.7 KiB

Raw Blame History