Fossil/knowledge-base

Fork 0

Files

Stanislav Hubacek 3fa11ef0f6 comiiit

2026-06-11 15:27:28 +02:00

6.7 KiB

Raw Blame History

🎮 GPU — architektura, modely, virtualizace

GPU modely

NVIDIA

GPU	Architektura	VRAM	HBM	FP16 (TFLOPS)	FP8 (TFLOPS)	Interconnect	TDP
A100	Ampere (2020)	40/80 GB	HBM2e	312	—	NVLink 3 (600 GB/s)	400 W
H100	Hopper (2022)	80 GB	HBM3	1000	2000 (sparse)	NVLink 4 (900 GB/s)	700 W
H200	Hopper (2023)	141 GB	HBM3e	1650	~3300	NVLink 4 (900 GB/s)	700 W
B200	Blackwell (2024)	192 GB	HBM3e	2250	~4500	NVLink 5 (1800 GB/s)	700 W
B100	Blackwell (2024)	192 GB	HBM3e	~1800	~3600	NVLink 5	700 W
GB200	Blackwell (2024)	—	HBM3e	4500 (dual)	9000 (dual)	NVLink 5	2700 W

AMD

GPU	Architektura	VRAM	HBM	FP16 (TFLOPS)	Interconnect	TDP
MI250X	CDNA 2 (2021)	128 GB	HBM2e	383	Infinity Fabric	500 W
MI300X	CDNA 3 (2023)	192 GB	HBM3	~2600	Infinity Fabric (896 GB/s)	750 W
MI350	CDNA 4 (2025)	288 GB	HBM3e	~3500	Infinity Fabric	750 W

GPU interconnects

Technologie	Poskytovatel	Bandwidth	Topologie	Use case
NVLink 4	NVIDIA	900 GB/s (18× 50 GB/s)	GPU-GPU direct	AI training (H100, H200)
NVLink 5	NVIDIA	1800 GB/s (18× 100 GB/s)	GPU-GPU direct	AI training (B200, GB200)
Infinity Fabric	AMD	896 GB/s	GPU-GPU + CPU-GPU	AI training (MI300X, MI350)
NVSwitch	NVIDIA	900 GB/s per GPU (NVLink)	Full-mesh (256 GPU)	DGX SuperPOD, HGX
InfiniBand (NDR)	NVIDIA/Mellanox	400 Gbps per port	GPU-NIC direct, RDMA	Distributed training, HPC
PCIe 5.0	Standard	63 GB/s per x16	CPU-GPU	Inference, rendering
Ethernet (RoCE v2)	Standard	100/200/400 GbE	GPU-NIC, RDMA over converged ethernet	AI inference, storage

GPU direct communication

GPU 0 ──NVLink── GPU 1        GPU 0 ───PCIe─── CPU ───PCIe─── GPU 1
  │                            │
  │                            │
NVSwitch                     InfiniBand
  │                            │
  │                            │
GPU 2 ──NVLink── GPU 3        GPU 2 ───PCIe─── CPU ───PCIe─── GPU 3

NVLink topologie (GPU direct)   PCIe topologie (CPU mediated)

GPU Direct RDMA — GPU ↔ NIC bez CPU (InfiniBand, RoCE)
GPU Direct Storage — GPU ↔ NVMe bez CPU (NVIDIA Magnum IO)
NVSwitch — full bisection bandwidth mezi všemi GPU v node

Virtualizace GPU

Technologie	Popis	GPU support	Use case
NVIDIA vGPU (Grid)	Časové slicing + dedikované profily	A-series (VDI), Q-series (pro viz), B-series (AI)	VDI, virtualizované AI
NVIDIA MIG	Hardwarové partition GPU	A100 (7 inst.), H100/H200/B200	AI inference, multi-tenant GPU
AMD MxGPU	SR-IOV, hardwarové partition	AMD MI (pro), Radeon Pro	VDI, cloud gaming
Intel SG (SG1)	SR-IOV, hardwarové partition	Intel SG1, Flex, Arc	VDI, media transcoding
GPU passthrough	Dedikovaný GPU celé VM (VFIO-pci)	Všechny GPU	AI training, HPC, nejvyšší výkon

MIG partition table (A100 / H100)

GPU	Partition profile	GPU Memory	Compute units
A100 80 GB	1g.5gb	5 GB	1
A100 80 GB	2g.10gb	10 GB	2
A100 80 GB	3g.20gb	20 GB	3
A100 80 GB	7g.40gb	40 GB	7
A100 80 GB	Full (7× 1g)	7 × 5 GB	7 instances
H100 80 GB	1g.6gb+me	6 GB	1
H100 80 GB	2g.12gb+me	12 GB	2
H100 80 GB	3g.24gb+me	24 GB	3
H100 80 GB	7g.80gb	80 GB	7

GPU use cases

AI Training

Modely: LLM (70B-405B+), vision, multimodal
GPU: H100, B200, GB200, MI300X
Interconnect: NVLink 5 / Infinity Fabric (v rámci node), InfiniBand NDR (mezi nody)
Parallelism: Data Parallel (DDP), Tensor Parallel (TP), Pipeline Parallel (PP), Fully Sharded (FSDP)
Framework: PyTorch (NCCL), JAX (XLA), DeepSpeed, Megatron-LM
Tipy:
- GB200: 2× B200 propojené NVLink, 8 GPU → 4 GB200
- DGX B200 / HGX B200: standardní building block
- InfiniBand: fat tree topology pro all-reduce optimalizaci

AI Inference

Modely: LLM serving, embedding, image gen
GPU: A100, H200, B200 (larger VRAM pro větší modely)
Techniky: MIG partition, TensorRT-LLM, vLLM, Triton Inference Server
Kvantizace: FP8, INT8, INT4 → nižší VRAM, vyšší throughput
Latency: batch size optimalizace, dynamic batching, continuous batching
Scale: on-prem (2-32 GPU) / cloud (elastic)

VDI (Virtual Desktop Infrastructure)

GPU: NVIDIA A16 (1 GPU = 16 users), A10 (1 GPU = 4 users)
Technologie: vGPU (Grid), AMD MxGPU
Protokoly: VMware Blast, Citrix HDX, Microsoft RDP, PC-over-IP (HP Teradici)
Use case: CAD (CATIA, SolidWorks), Office, engineering, healthcare (PACS)

Rendering a VFX

GPU: NVIDIA RTX 6000 Ada, RTX A6000, AMD Radeon Pro W7900
Rendering: Blender (Cycles/OptiX), V-Ray, Octane Render, Redshift
Denoising: AI-accelerated denoising na GPU
Farm rendering: Deadline, Qube! (job scheduler)

GPU server form factors

Form factor	GPU count	Power	Cooling	Příklad
1U	1-2	700-1400 W	Air (high-RPM)	Dell XR4510c
2U	4-8	3-6 kW	Air / Liquid	Dell R760xa, HPE DL380a
4U	8-10	5-8 kW	Liquid	NVIDIA DGX H100, Dell R760xa
8U / Chassis	8-16	10-20 kW	Liquid (CDU)	NVIDIA HGX, Supermicro SYS-821GE

OpenStack Cyborg (GPU lifecycle management)

Cyborg je OpenStack service pro správu akcelerátorů (GPU, FPGA, DPU, NPU).

Klíčové schopnosti

Discovery — automatická detekce GPU na compute node (NVIDIA, AMD, Intel)
Inventory — tracking dostupných akcelerátorů v clusteru
Lifecycle — attach/detach GPU k VM, firmware update, reset
Scheduling — Placement API pro GPU-aware scheduling (Nova)
Cyborg API — REST API pro správu akcelerátorů

Integrace

Komponenta	Role
Nova	VM scheduling s GPU požadavky (extra_specs: `accel:device_profile`)
Placement	Resource provider pro GPU (inventory, traits)
Neutron	SR-IOV VF passthrough pro GPU networking
Ironic	Bare metal + GPU provisioning

Zdroje

Odkazy, knihy a standardy: sources/infrastructure/sources.md

Poslední revize: 2026-06-03

6.7 KiB Raw Blame History Unescape Escape

🎮 GPU — architektura, modely, virtualizace

GPU modely

NVIDIA

AMD

GPU interconnects

GPU direct communication

Virtualizace GPU

MIG partition table (A100 / H100)

GPU use cases

AI Training

AI Inference

VDI (Virtual Desktop Infrastructure)

Rendering a VFX

GPU server form factors

OpenStack Cyborg (GPU lifecycle management)

Klíčové schopnosti

Integrace

Zdroje

6.7 KiB

Raw Blame History