Files
knowledge-base/CONNECTIVITY.en.md
Stanislav Hubacek 3fa11ef0f6 comiiit
2026-06-11 15:27:28 +02:00

271 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🔌 Server connectivity — network and storage connectivity
## Ethernet — network connectivity
### Speeds and formats
| Speed | Designation | Form factor | Cabling | Standard year | Use case |
|----------|----------|-------------|---------|---------------|----------|
| **1 GbE** | 1000BASE-T | RJ45 (copper) | Cat5e/Cat6 | 1999 | Management, legacy |
| **10 GbE** | 10GBASE-T / SFP+ | RJ45 / SFP+ | Cat6A (30m) / Cat7 (100m) / DAC / SR/LR | 2006 | Common server, storage |
| **25 GbE** | 25GBASE-R | SFP28 | Cat8 (30m) / DAC (5m) / SR/LR (100m/10km) | 2016 | Standard for servers (2020+) |
| **40 GbE** | 40GBASE-R | QSFP+ | DAC (7m) / SR (150m) / LR (10km) | 2010 | Legacy, spine |
| **50 GbE** | 50GBASE-R | SFP56 | DAC / SR / LR | 2018 | Emerging server |
| **100 GbE** | 100GBASE-R | QSFP28 | DAC (3m) / SR4 (100m) / LR4 (10km) / PSM4 (500m) | 2015 | Spine, storage, AI |
| **200 GbE** | 200GBASE-R | QSFP56 | DAC / SR4 / DR4 | 2019 | AI/ML, HPC |
| **400 GbE** | 400GBASE-R | QSFP-DD / OSFP | DAC (2.5m) / SR8 (100m) / DR4 (500m) / FR4 (2km) | 2017 | AI training, hyperscale |
| **800 GbE** | 800GBASE-R | QSFP-DD800 / OSFP | DAC (2m) / SR8 (100m) / DR8 (500m) | 2024 | Next-gen AI/ML |
**Recommendations for servers (2026)**:
- **Standard**: 2× 25 GbE (management + data) or 2× 100 GbE for demanding workloads
- **AI/ML training**: 8× 400 GbE (InfiniBand preferred for GPU communication)
- **Storage**: 2× 25/100 GbE (iSCSI/NFS) or dedicated FC (16/32 Gbps)
### NIC form factor
| Form factor | PCIe lanes | Speed | Use case |
|------------|-----------|----------|----------|
| **OCP 3.0** | x8/x16 | 25/100/200 GbE | Modern servers (Dell, HPE), small form factor |
| **PCIe HHHL** | x8 | 25/50 GbE | Standard 1U/2U servers |
| **PCIe FHHL** | x16 | 100/200/400 GbE | GPU servers, high-density |
| **Mezzanine** | x8 | 10/25 GbE | Blade servers (HPE Synergy, Dell MX) |
| **LOM (LAN on Motherboard)** | — | 1/10/25 GbE | Integrated, basic connectivity |
### NIC features
| Feature | Description | Benefit |
|---------|-------|---------|
| **TSO/GRO** | TCP Segmentation Offload / Generic Receive Offload | Reduced CPU load for TCP |
| **LRO/LSO** | Large Receive/Send Offload | Equivalent of TSO/GRO for legacy |
| **RSS** | Receive Side Scaling | Distribution of incoming packets across multiple CPU cores |
| **RPS/RFS** | Receive Packet Steering / Flow Steering | Software RSS, cache affinity |
| **XDP** | eXpress Data Path | BPF-based packet processing (DDoS, load balancer) |
| **RDMA (RoCE v2)** | RDMA over Converged Ethernet | GPU direct communication, storage (NVMe-oF) |
| **iWARP** | RDMA over TCP | RDMA without special switch (higher latency) |
| **DPDK** | Data Plane Development Kit | Userspace for packet processing (VNF, vSwitch) |
| **VXLAN/NVGRE offload** | HW offload for tunneling | Overlay networking (VMware NSX, OpenStack) |
| **SR-IOV** | Single Root I/O Virtualization | Direct NIC access for VMs (VF), low latency |
| **Flow Bifurcation** | Split NIC traffic between kernel and DPDK | Concurrent management and high-speed data path |
| **PTP (IEEE 1588)** | Precision Time Protocol | Financial services, 5G, telco |
### NIC selection per workload
| Workload | Recommended NIC | Rationale |
|----------|---------------|------------|
| **Web / API servers** | 2× 25 GbE SFP28, OCP | Low cost, sufficient bandwidth |
| **Virtualization (VMware)** | 2× 25 GbE (SR-IOV, VXLAN offload) | SR-IOV for VMs, VXLAN for NSX |
| **Database (OLTP)** | 2× 25/100 GbE (RSS, low latency) | Low latency, RSS for CPU scaling |
| **Storage (NFS/iSCSI)** | 2× 25/100 GbE (RoCE v2) | RDMA for NVMe-oF, low latency |
| **Storage (FC SAN)** | 2× 32 Gb FC HBA | SAN for VMware VMFS, block storage |
| **AI/ML training** | 8× 400 GbE + InfiniBand NDR | GPU communication, data ingestion |
| **AI/ML inference** | 4× 100 GbE (RoCE v2) | Model serving, GPU direct |
| **HPC** | InfiniBand NDR 400 Gbps | MPI communication, low latency |
| **Telco / Edge** | 2× 25 GbE (DPDK, PTP) | VNF, 5G UPF, low latency |
---
## Storage connectivity
### Fibre Channel (FC) SAN
| Generation | Speed | Designation | Form factor | Reach (SMF) | Use case |
|----------|----------|----------|-------------|-------------|----------|
| **Gen 5** | 16 Gbps | 16GFC | SFP+ | 10 km | Legacy SAN |
| **Gen 6** | 32 Gbps | 32GFC | SFP28 | 10 km | Current standard |
| **Gen 7** | 64 Gbps | 64GFC | SFP56 | 10 km | Emerging, high-performance |
| **Gen 8** | 128 Gbps | 128GFC | QSFP28 | 10 km | Emerging (first production deployments) |
**HBA (Host Bus Adapter)**:
| Manufacturer | Model | Speed | PCIe | Ports | Features |
|---------|-------|----------|------|-------|----------|
| **Broadcom / Emulex** | LPe35000 | 32 GFC | PCIe 3.0 x8 | 1-2 | NVMe-FC, T10-PI, SR-IOV |
| **Broadcom / Emulex** | LPe36000 | 64 GFC | PCIe 4.0 x16 | 1-2 | NVMe-FC, FC-NVMe |
| **Marvell / QLogic** | QLE2770 | 32 GFC | PCIe 3.0 x8 | 1-2 | FC-NVMe, T10-PI |
| **Marvell / QLogic** | QLE2870 | 64 GFC | PCIe 4.0 x8 | 1-2 | NVMe-FC, 64GFC |
**FC SAN topology**:
```
Server ──HBA── FC Switch ──── Storage Array (FC port)
│ │
│ ┌────┴────┐
│ │ Fabric │
│ └─────────┘
──── ISL (Inter-Switch Link) ──── backup fabric (B)
```
**Zoning** (FC):
```
Zone A: Server1_HBA1 + Storage_Port1 (production)
Zone B: Server1_HBA2 + Storage_Port2 (backup fabric)
Zone C: Backup_Server + Storage_Target (backup)
```
### iSCSI
| Property | iSCSI | Note |
|-----------|-------|----------|
| **Transport** | TCP/IP (port 3260) | Over standard Ethernet |
| **Speed** | 1/10/25/100 GbE | Same as Ethernet |
| **Initiator** | SW (OS) or HW (TOE) | SW initiator free, ~5-10 % CPU load |
| **Multipathing** | MPIO (Multiple Connections per Session) | Up to 8 paths, active/active or active/passive |
| **CHAP** | Authentication | Mutual CHAP recommended |
| **Jumbo frames** | Recommended MTU 9000 | Reduced CPU overhead, higher throughput |
| **Use case** | Small and medium SAN, backup, DR | Cheaper than FC, lower performance |
**iSCSI configuration**:
```
# Software initiator (Linux)
iscsiadm -m discovery -t sendtargets -p 10.0.0.100:3260
iscsiadm -m node --login -T iqn.2024-05.storage:array01
# Multipath (dm-multipath)
mpathconf --enable --with_multipathd y
# /etc/multipath.conf: aliases, failback, rr_min_io
```
### NVMe-oF (NVMe over Fabrics)
| Transport | Protocol | Latency | CPU overhead | Use case |
|-----------|----------|---------|-------------|----------|
| **NVMe over FC** | FC-NVMe (FC Gen 6/7) | <10 µs | Low | Enterprise SAN, VMware |
| **NVMe over RDMA (RoCE v2)** | RDMA (RoCE) | <5 µs | Very low | AI/ML, HPC, K8s (CSI) |
| **NVMe over TCP** | TCP | ~50 µs | Moderate (10-20 % CPU) | Standard Ethernet, no RDMA |
| **NVMe over InfiniBand** | IB RC/UC | <3 µs | Lowest | HPC, AI training |
**NVMe-oF comparison**:
| Property | FC-NVMe | NVMe/RoCE | NVMe/TCP | NVMe/IB |
|-----------|---------|-----------|----------|---------|
| **Latency (target)** | ~8 µs | ~4 µs | ~50 µs | ~3 µs |
| **Bandwidth** | 64 Gbps | 100/200 GbE | 25/100 GbE | NDR 400 Gbps |
| **Requires special HW** | FC HBA + switch | RoCE NIC + DCB switch | Standard NIC | IB HCA + switch |
| **Ecosystem** | Broadcom, Marvell | NVIDIA, Broadcom | OS built-in | NVIDIA Mellanox |
| **Use case** | VMware, enterprise SAN | AI/ML, K8s, HPC | SMB, K8s, cost-effective | HPC, large AI |
### SAS (Serial Attached SCSI)
| Generation | Speed | Cabling | Reach | Use case |
|----------|----------|---------|-------|----------|
| **SAS 3** | 12 Gbps | SAS cable (SFF-8644) | 6-10 m | Legacy storage, DAS |
| **SAS 4** | 22.5 Gbps | SAS cable (SFF-8644) | 6-10 m | Current standard |
| **SAS 5** | 45 Gbps | SAS cable (SFF-8644) | 6-10 m | Emerging |
**SAS topology**: Server → SAS HBA → SAS expander → SAS disk (point-to-point, not shared like FC)
---
## Server connectivity — decision matrix
| Workload | Primary | Secondary | Management |
|----------|----------|-----------|------------|
| **Web / API** | 2× 25 GbE (LACP) | — | 1× 1 GbE BMC |
| **Database** | 2× 25/100 GbE (RSS) | 2× 32 Gb FC (SAN) | 1× 1 GbE BMC |
| **Virtualization** | 4× 25 GbE (SR-IOV) | 2× 32 Gb FC (VMFS) | 1× 1 GbE BMC |
| **Kubernetes** | 2× 25/100 GbE | — | 1× 1 GbE BMC |
| **Storage node** | 2× 100 GbE (RoCE) | 2× 25 GbE (management) | 1× 1 GbE BMC |
| **AI training** | 8× 400 GbE + IB NDR | 4× 100 GbE (storage) | 1× 1 GbE BMC |
| **AI inference** | 4× 100 GbE (RoCE) | 2× 25 GbE (management) | 1× 1 GbE BMC |
| **HPC** | InfiniBand NDR | 2× 100 GbE (storage) | 1× 1 GbE BMC |
---
## Server NIC placement (PCIe slot optimization)
```
2U Server (GPU/AI):
┌─────────────────────────────────────────────────┐
│ PCIe 0: GPU (x16) — NVLink / InfiniBand (x16) │
│ PCIe 1: GPU (x16) — NIC 100 GbE (x16) │
│ PCIe 2: GPU (x16) │
│ PCIe 3: GPU (x16) │
│ PCIe 4: GPU (x16) │
│ PCIe 5: GPU (x16) — NIC 100 GbE (x16) │
│ PCIe 6: Storage HBA / NIC (x8) │
│ PCIe 7: Management / OCP (x8) │
└─────────────────────────────────────────────────┘
1U Standard:
┌─────────────────────────────────┐
│ OCP: 2× 25 GbE (management) │
│ PCIe 0: NIC 25 GbE (x8) │
│ PCIe 1: Storage HBA / FC (x8) │
│ PCIe 2: GPU (x16, optional) │
│ PCIe 3: NVMe (x4, M.2) │
└─────────────────────────────────┘
```
### NVIDIA Mellanox ConnectX NICs
NVIDIA Mellanox is a leading manufacturer of NIC adapters for AI/HPC and cloud data centers.
| Model | PCIe | Max speed | Form factor | Key features |
|-------|------|-------------|-------------|------------------|
| **ConnectX-5** | PCIe 3.0 x16 | 100 GbE (dual) | HHHL | RoCE, NVMe-oF target offload, MPI offload |
| **ConnectX-6 Dx** | PCIe 4.0 x16 | 200 GbE (1-port) / 100 GbE (2-port) | HHHL, OCP 3.0 | ASAP² vSwitch offload, IPsec/TLS inline crypto, AES-XTS, 215 Mpps DPDK |
| **ConnectX-6 Lx** | PCIe 4.0 x8 | 25 GbE (dual) | HHHL, OCP 3.0 | RoCE, Secure Boot, low-power |
| **ConnectX-7** | PCIe 5.0 x16 | 400 GbE (1-port) / 200 GbE (2-port) | HHHL | NDR InfiniBand + 400GbE, GPUDirect, SHARP |
| **ConnectX-8** | PCIe 6.0 x16 | 800 GbE (1-port) / 400 GbE (2-port) | HHHL | XDR InfiniBand, sub-500ns latency, in-network computing, multi-host |
**Platforms**: Spectrum-X Ethernet (end-to-end AI networking), Quantum InfiniBand, BlueField DPU.
### Broadcom Emulex FC HBA
| Model | Speed | PCIe | Ports | Features |
|-------|----------|------|-------|----------|
| **LPe35000** (Gen 7) | 32 GFC | PCIe 3.0 x8 | 1-2 | NVMe-FC, T10-PI (DIF), SR-IOV, Silicon Root of Trust |
| **LPe35002** (Gen 7) | 32 GFC | PCIe 3.0 x8 | 2 | NVMe-FC, Secure Boot, digitally signed firmware |
| **LPe36000** (Gen 7) | 64 GFC | PCIe 4.0 x16 | 1-2 | First 64GFC HBA on the market, 10M IOPS, 3× better latency than Gen 6 |
**Key features**: NVMe over FC support, T10 DIF (Data Integrity Field), 10M MTBF, NIST SP 800-193 compliant. Gen 7 delivers up to 10M IOPS and 3× lower latency compared to Gen 6.
### NVMe-oF specification
NVMe over Fabrics (NVMe-oF) extends the NVMe protocol from local PCIe to network transports. First specification 1.0 released in June 2016, currently part of NVMe 2.3 (August 2025). Supported transports:
| Transport | Specification | Use case |
|-----------|------------|----------|
| **NVMe over PCIe** | NVMe Base | Local NVMe SSD |
| **NVMe over RDMA** (RoCE, InfiniBand, iWARP) | NVMe Transport | AI/ML, HPC, lowest latency <5 µs |
| **NVMe over TCP** | NVMe Transport | Standard Ethernet, no RDMA, latency ~50 µs |
| **NVMe over FC** (FC-NVMe) | INCITS T11 | Enterprise SAN, FC fabric |
NVMe 2.3 adds Computational Programs Command Set, Storage Level Management (SLM), and Zoned Namespaces (ZNS). NVMe-MI defines the management interface.
### Dell PowerEdge R760 — NIC placement
Dell R760 server supports:
- **OCP 3.0** adapters (up to 2×) — 1/10/25/100 GbE
- **PCIe Gen5** slots — 8× slots (6× FHHL + 2× LP)
- **LOM** — 2× 1 GbE Broadcom 5720 on motherboard
- Maximum NIC speed: 100 GbE (QSFP56)
- Supported types: RJ45, SFP+, SFP28, QSFP28, QSFP56
Recommended configurations:
- Standard: OCP 3.0 2× 25 GbE + PCIe storage HBA
- AI/ML: PCIe 100 GbE (riser config 1, slot 1-2) + GPU in other slots
### HPE Gen11 NIC options
HPE ProLiant Gen11 (DL360/DL380) supports:
- **OCP 3.0** slots (up to 2) — 10/25/100/200 GbE (Broadcom, Intel, NVIDIA Mellanox)
- **PCIe Gen5** adapters — 8× slots (DL380) / 3× slots (DL360)
- **iLO 6** dedicated management port (1 GbE)
- Supported NICs: Broadcom BCM57412 (10GbE), BCM57504 (25GbE), NVIDIA ConnectX-6 Dx (100GbE)
## Sources
Links, books, and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
### Recommended literature
| Book | Authors | ISBN | Description |
|-------|--------|------|-------|
| AI Data Center Network Design and Technologies (1st ed., 2026) | Subramaniam, Styszynski, Tambakuwala | 978-0-13-543628-8 | First vendor-agnostic guide to network design for AI training and inference. Covers high-radix fabric, lossless Ethernet/IP, UEC technologies, cooling and power for AI clusters. Authors from HPE Juniper Networking. |
*Last revision: 2026-06-03*