Files
knowledge-base/GPU.en.md
Stanislav Hubacek 3fa11ef0f6 comiiit
2026-06-11 15:27:28 +02:00

6.7 KiB
Raw Blame History

🎮 GPU — architecture, models, virtualization

GPU models

NVIDIA

GPU Architecture VRAM HBM FP16 (TFLOPS) FP8 (TFLOPS) Interconnect TDP
A100 Ampere (2020) 40/80 GB HBM2e 312 NVLink 3 (600 GB/s) 400 W
H100 Hopper (2022) 80 GB HBM3 1000 2000 (sparse) NVLink 4 (900 GB/s) 700 W
H200 Hopper (2023) 141 GB HBM3e 1650 ~3300 NVLink 4 (900 GB/s) 700 W
B200 Blackwell (2024) 192 GB HBM3e 2250 ~4500 NVLink 5 (1800 GB/s) 700 W
B100 Blackwell (2024) 192 GB HBM3e ~1800 ~3600 NVLink 5 700 W
GB200 Blackwell (2024) HBM3e 4500 (dual) 9000 (dual) NVLink 5 2700 W

AMD

GPU Architecture VRAM HBM FP16 (TFLOPS) Interconnect TDP
MI250X CDNA 2 (2021) 128 GB HBM2e 383 Infinity Fabric 500 W
MI300X CDNA 3 (2023) 192 GB HBM3 ~2600 Infinity Fabric (896 GB/s) 750 W
MI350 CDNA 4 (2025) 288 GB HBM3e ~3500 Infinity Fabric 750 W

GPU interconnects

Technology Provider Bandwidth Topology Use case
NVLink 4 NVIDIA 900 GB/s (18× 50 GB/s) GPU-GPU direct AI training (H100, H200)
NVLink 5 NVIDIA 1800 GB/s (18× 100 GB/s) GPU-GPU direct AI training (B200, GB200)
Infinity Fabric AMD 896 GB/s GPU-GPU + CPU-GPU AI training (MI300X, MI350)
NVSwitch NVIDIA 900 GB/s per GPU (NVLink) Full-mesh (256 GPU) DGX SuperPOD, HGX
InfiniBand (NDR) NVIDIA/Mellanox 400 Gbps per port GPU-NIC direct, RDMA Distributed training, HPC
PCIe 5.0 Standard 63 GB/s per x16 CPU-GPU Inference, rendering
Ethernet (RoCE v2) Standard 100/200/400 GbE GPU-NIC, RDMA over converged ethernet AI inference, storage

GPU direct communication

GPU 0 ──NVLink── GPU 1        GPU 0 ───PCIe─── CPU ───PCIe─── GPU 1
  │                            │
  │                            │
NVSwitch                     InfiniBand
  │                            │
  │                            │
GPU 2 ──NVLink── GPU 3        GPU 2 ───PCIe─── CPU ───PCIe─── GPU 3

NVLink topologie (GPU direct)   PCIe topologie (CPU mediated)
  • GPU Direct RDMA — GPU ↔ NIC without CPU (InfiniBand, RoCE)
  • GPU Direct Storage — GPU ↔ NVMe without CPU (NVIDIA Magnum IO)
  • NVSwitch — full bisection bandwidth between all GPUs in a node

GPU virtualization

Technology Description GPU support Use case
NVIDIA vGPU (Grid) Time slicing + dedicated profiles A-series (VDI), Q-series (pro viz), B-series (AI) VDI, virtualized AI
NVIDIA MIG Hardware GPU partitioning A100 (7 inst.), H100/H200/B200 AI inference, multi-tenant GPU
AMD MxGPU SR-IOV, hardware partitioning AMD MI (pro), Radeon Pro VDI, cloud gaming
Intel SG (SG1) SR-IOV, hardware partitioning Intel SG1, Flex, Arc VDI, media transcoding
GPU passthrough Dedicated GPU to whole VM (VFIO-pci) All GPUs AI training, HPC, highest performance

MIG partition table (A100 / H100)

GPU Partition profile GPU Memory Compute units
A100 80 GB 1g.5gb 5 GB 1
A100 80 GB 2g.10gb 10 GB 2
A100 80 GB 3g.20gb 20 GB 3
A100 80 GB 7g.40gb 40 GB 7
A100 80 GB Full (7× 1g) 7 × 5 GB 7 instances
H100 80 GB 1g.6gb+me 6 GB 1
H100 80 GB 2g.12gb+me 12 GB 2
H100 80 GB 3g.24gb+me 24 GB 3
H100 80 GB 7g.80gb 80 GB 7

GPU use cases

AI Training

  • Models: LLM (70B-405B+), vision, multimodal
  • GPU: H100, B200, GB200, MI300X
  • Interconnect: NVLink 5 / Infinity Fabric (within node), InfiniBand NDR (between nodes)
  • Parallelism: Data Parallel (DDP), Tensor Parallel (TP), Pipeline Parallel (PP), Fully Sharded (FSDP)
  • Framework: PyTorch (NCCL), JAX (XLA), DeepSpeed, Megatron-LM
  • Tips:
    • GB200: 2× B200 connected via NVLink, 8 GPU → 4 GB200
    • DGX B200 / HGX B200: standard building block
    • InfiniBand: fat tree topology for all-reduce optimization

AI Inference

  • Models: LLM serving, embedding, image gen
  • GPU: A100, H200, B200 (larger VRAM for larger models)
  • Techniques: MIG partition, TensorRT-LLM, vLLM, Triton Inference Server
  • Quantization: FP8, INT8, INT4 → lower VRAM, higher throughput
  • Latency: batch size optimization, dynamic batching, continuous batching
  • Scale: on-prem (2-32 GPU) / cloud (elastic)

VDI (Virtual Desktop Infrastructure)

  • GPU: NVIDIA A16 (1 GPU = 16 users), A10 (1 GPU = 4 users)
  • Technology: vGPU (Grid), AMD MxGPU
  • Protocols: VMware Blast, Citrix HDX, Microsoft RDP, PC-over-IP (HP Teradici)
  • Use case: CAD (CATIA, SolidWorks), Office, engineering, healthcare (PACS)

Rendering and VFX

  • GPU: NVIDIA RTX 6000 Ada, RTX A6000, AMD Radeon Pro W7900
  • Rendering: Blender (Cycles/OptiX), V-Ray, Octane Render, Redshift
  • Denoising: AI-accelerated denoising on GPU
  • Farm rendering: Deadline, Qube! (job scheduler)

GPU server form factors

Form factor GPU count Power Cooling Example
1U 1-2 700-1400 W Air (high-RPM) Dell XR4510c
2U 4-8 3-6 kW Air / Liquid Dell R760xa, HPE DL380a
4U 8-10 5-8 kW Liquid NVIDIA DGX H100, Dell R760xa
8U / Chassis 8-16 10-20 kW Liquid (CDU) NVIDIA HGX, Supermicro SYS-821GE

OpenStack Cyborg (GPU lifecycle management)

Cyborg is an OpenStack service for managing accelerators (GPU, FPGA, DPU, NPU).

Key capabilities

  • Discovery — automatic GPU detection on compute nodes (NVIDIA, AMD, Intel)
  • Inventory — tracking available accelerators in the cluster
  • Lifecycle — attach/detach GPU to VM, firmware update, reset
  • Scheduling — Placement API for GPU-aware scheduling (Nova)
  • Cyborg API — REST API for accelerator management

Integration

Component Role
Nova VM scheduling with GPU requirements (extra_specs: accel:device_profile)
Placement Resource provider for GPU (inventory, traits)
Neutron SR-IOV VF passthrough for GPU networking
Ironic Bare metal + GPU provisioning

Sources

Links, books and standards: sources/infrastructure/sources.md

Last revision: 2026-06-03