Files
knowledge-base/GPU.md
2026-06-03 22:50:25 +02:00

5.9 KiB
Raw Blame History

🎮 GPU — architektura, modely, virtualizace

GPU modely

NVIDIA

GPU Architektura VRAM HBM FP16 (TFLOPS) FP8 (TFLOPS) Interconnect TDP
A100 Ampere (2020) 40/80 GB HBM2e 312 NVLink 3 (600 GB/s) 400 W
H100 Hopper (2022) 80 GB HBM3 1000 2000 (sparse) NVLink 4 (900 GB/s) 700 W
H200 Hopper (2023) 141 GB HBM3e 1650 ~3300 NVLink 4 (900 GB/s) 700 W
B200 Blackwell (2024) 192 GB HBM3e 2250 ~4500 NVLink 5 (1800 GB/s) 700 W
B100 Blackwell (2024) 192 GB HBM3e ~1800 ~3600 NVLink 5 700 W
GB200 Blackwell (2024) HBM3e 4500 (dual) 9000 (dual) NVLink 5 2700 W

AMD

GPU Architektura VRAM HBM FP16 (TFLOPS) Interconnect TDP
MI250X CDNA 2 (2021) 128 GB HBM2e 383 Infinity Fabric 500 W
MI300X CDNA 3 (2023) 192 GB HBM3 ~2600 Infinity Fabric (896 GB/s) 750 W
MI350 CDNA 4 (2025) 288 GB HBM3e ~3500 Infinity Fabric 750 W

GPU interconnects

Technologie Poskytovatel Bandwidth Topologie Use case
NVLink 4 NVIDIA 900 GB/s (18× 50 GB/s) GPU-GPU direct AI training (H100, H200)
NVLink 5 NVIDIA 1800 GB/s (18× 100 GB/s) GPU-GPU direct AI training (B200, GB200)
Infinity Fabric AMD 896 GB/s GPU-GPU + CPU-GPU AI training (MI300X, MI350)
NVSwitch NVIDIA 900 GB/s per GPU (NVLink) Full-mesh (256 GPU) DGX SuperPOD, HGX
InfiniBand (NDR) NVIDIA/Mellanox 400 Gbps per port GPU-NIC direct, RDMA Distributed training, HPC
PCIe 5.0 Standard 63 GB/s per x16 CPU-GPU Inference, rendering
Ethernet (RoCE v2) Standard 100/200/400 GbE GPU-NIC, RDMA over converged ethernet AI inference, storage

GPU direct communication

GPU 0 ──NVLink── GPU 1        GPU 0 ───PCIe─── CPU ───PCIe─── GPU 1
  │                            │
  │                            │
NVSwitch                     InfiniBand
  │                            │
  │                            │
GPU 2 ──NVLink── GPU 3        GPU 2 ───PCIe─── CPU ───PCIe─── GPU 3

NVLink topologie (GPU direct)   PCIe topologie (CPU mediated)
  • GPU Direct RDMA — GPU ↔ NIC bez CPU (InfiniBand, RoCE)
  • GPU Direct Storage — GPU ↔ NVMe bez CPU (NVIDIA Magnum IO)
  • NVSwitch — full bisection bandwidth mezi všemi GPU v node

Virtualizace GPU

Technologie Popis GPU support Use case
NVIDIA vGPU (Grid) Časové slicing + dedikované profily A-series (VDI), Q-series (pro viz), B-series (AI) VDI, virtualizované AI
NVIDIA MIG Hardwarové partition GPU A100 (7 inst.), H100/H200/B200 AI inference, multi-tenant GPU
AMD MxGPU SR-IOV, hardwarové partition AMD MI (pro), Radeon Pro VDI, cloud gaming
Intel SG (SG1) SR-IOV, hardwarové partition Intel SG1, Flex, Arc VDI, media transcoding
GPU passthrough Dedikovaný GPU celé VM (VFIO-pci) Všechny GPU AI training, HPC, nejvyšší výkon

MIG partition table (A100 / H100)

GPU Partition profile GPU Memory Compute units
A100 80 GB 1g.5gb 5 GB 1
A100 80 GB 2g.10gb 10 GB 2
A100 80 GB 3g.20gb 20 GB 3
A100 80 GB 7g.40gb 40 GB 7
A100 80 GB Full (7× 1g) 7 × 5 GB 7 instances
H100 80 GB 1g.6gb+me 6 GB 1
H100 80 GB 2g.12gb+me 12 GB 2
H100 80 GB 3g.24gb+me 24 GB 3
H100 80 GB 7g.80gb 80 GB 7

GPU use cases

AI Training

  • Modely: LLM (70B-405B+), vision, multimodal
  • GPU: H100, B200, GB200, MI300X
  • Interconnect: NVLink 5 / Infinity Fabric (v rámci node), InfiniBand NDR (mezi nody)
  • Parallelism: Data Parallel (DDP), Tensor Parallel (TP), Pipeline Parallel (PP), Fully Sharded (FSDP)
  • Framework: PyTorch (NCCL), JAX (XLA), DeepSpeed, Megatron-LM
  • Tipy:
    • GB200: 2× B200 propojené NVLink, 8 GPU → 4 GB200
    • DGX B200 / HGX B200: standardní building block
    • InfiniBand: fat tree topology pro all-reduce optimalizaci

AI Inference

  • Modely: LLM serving, embedding, image gen
  • GPU: A100, H200, B200 (larger VRAM pro větší modely)
  • Techniky: MIG partition, TensorRT-LLM, vLLM, Triton Inference Server
  • Kvantizace: FP8, INT8, INT4 → nižší VRAM, vyšší throughput
  • Latency: batch size optimalizace, dynamic batching, continuous batching
  • Scale: on-prem (2-32 GPU) / cloud (elastic)

VDI (Virtual Desktop Infrastructure)

  • GPU: NVIDIA A16 (1 GPU = 16 users), A10 (1 GPU = 4 users)
  • Technologie: vGPU (Grid), AMD MxGPU
  • Protokoly: VMware Blast, Citrix HDX, Microsoft RDP, PC-over-IP (HP Teradici)
  • Use case: CAD (CATIA, SolidWorks), Office, engineering, healthcare (PACS)

Rendering a VFX

  • GPU: NVIDIA RTX 6000 Ada, RTX A6000, AMD Radeon Pro W7900
  • Rendering: Blender (Cycles/OptiX), V-Ray, Octane Render, Redshift
  • Denoising: AI-accelerated denoising na GPU
  • Farm rendering: Deadline, Qube! (job scheduler)

GPU server form factors

Form factor GPU count Power Cooling Příklad
1U 1-2 700-1400 W Air (high-RPM) Dell XR4510c
2U 4-8 3-6 kW Air / Liquid Dell R760xa, HPE DL380a
4U 8-10 5-8 kW Liquid NVIDIA DGX H100, Dell R760xa
8U / Chassis 8-16 10-20 kW Liquid (CDU) NVIDIA HGX, Supermicro SYS-821GE

Zdroje

Odkazy, knihy a standardy: sources/infrastructure/sources.md

Poslední revize: 2026-06-03