Files
knowledge-base/SERVER-CONFIG.en.md
Stanislav Hubacek ef3c2f75b1 18.6.2026
2026-06-18 16:25:33 +02:00

758 lines
37 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ⚙️ Server configuration — best practices by workload
## General BIOS/UEFI settings
| Setting | Recommendation | Rationale |
|-----------|-----------|------------|
| **Boot mode** | UEFI | Secure Boot, GPT, larger disks |
| **Power profile** | Performance / OS Control | Max performance, C-States disabled |
| **Hyper-Threading** | Enabled | +30-50 % throughput for multi-thread |
| **Virtualization** | Enabled (VT-x/AMD-V) | Required for hypervisor, containers |
| **SR-IOV** | Enabled | GPU, NIC passthrough |
| **NUMA** | Enabled | NUMA-aware scheduling |
| **ACPI** | Enabled | Power management, OS-level |
| **Secure Boot** | Enabled | Secure boot chain |
| **TPM** | Enabled | Measured boot, key storage |
---
## 1. Database servers
### CPU Selection
| DB type | CPU preference | Rationale |
|--------|---------------|------------|
| **OLTP** (PostgreSQL, MySQL) | High clock, moderate cores | Low latency per transaction, limited parallelism |
| **OLAP** (ClickHouse, Snowflake) | Many cores, AVX-512 | Columnstore, high parallelism |
| **In-memory** (Redis, Memcached) | High clock, low cache latency | Single-threaded (Redis), RAM bandwidth |
| **Document** (MongoDB) | Balance (clock × cores) | Mixed workload |
| **Distributed** (Cassandra, Scylla) | Many cores, high cache | Shard-per-core (Scylla), compaction |
| **Oracle OLTP** | High clock, moderate cores, core-factor aware | CPU license cost (core factor 0.5 for AMD EPYC and Intel Xeon) |
| **Oracle OLAP / DW** | Many cores, large SGA, in-memory option | Parallel query, Exadata Smart Scan, compression |
### Oracle CPU licensing — core factor
Oracle licenses per core with a correction factor depending on the processor. Factor 0.5 means 2 cores = 1 Oracle license.
| Processor | Core factor | 64 physical cores → Oracle licenses |
|----------|-------------|--------------------------------------|
| AMD EPYC (all series) | 0.5 | 32 |
| Intel Xeon (Scalable) | 0.5 | 32 |
| IBM POWER | 1.0 | 64 |
| ARM (Ampere Altra) | 0.5 | 32 |
**Impact on CPU selection**: At the same Oracle license cost, EPYC with more cores is more advantageous — you get more compute power for the same license price.
### Configuration by company size and storage type
#### Variant A: Small company — local NVMe RAID
| Component | Recommendation | Note |
|-----------|-----------|----------|
| **CPU** | 1× EPYC 9124/9224 or Intel Xeon 4410Y (8-16C) | 1 socket, high clock |
| **RAM** | 64-256 GB (8-16 GB/core) | DDR5-4800, 1DPC |
| **OS disk** | 2× SATA/SAS SSD, RAID 1 (240-480 GB) | For OS + binaries |
| **Data disk** | 4-6× NVMe (U.2/E3.S), RAID 10 | Local data, no sharing |
| **WAL disk** | 2× NVMe RAID 1 (400-800 GB) | PostgreSQL only |
| **Network** | 2× 25 GbE (LACP) | Application traffic + management |
| **Form factor** | 1U or 2U | Single node, no cluster |
| **Storage backend** | Local RAID controller (PERC/Broadcom) | HW RAID 10 or SW RAID (mdadm) |
| **HA** | Application manages failover (patroni, repmgr, orchestrator) | Standby node on failure |
**Use case**: Startup, branch office, dev/test, < 500 users, single database server, low availability requirements.
#### Variant B: Medium company — local NVMe + asynchronous replication
| Component | Recommendation | Note |
|-----------|-----------|----------|
| **CPU** | 1-2× EPYC 9334/9374F or Intel Xeon 5418Y (16-24C) | 1-2 socket, balanced |
| **RAM** | 128-512 GB (8-16 GB/core) | DDR5-4800/5600, 1DPC |
| **OS disk** | 2× NVMe RAID 1 (2× 480 GB) | OS + binaries |
| **Data disk** | 6-8× NVMe, RAID 10 | Local NVMe, 3-6 TB usable |
| **WAL disk** | 2× NVMe RAID 1 (2× 800 GB) | Separate from data |
| **Network** | 2× 25 GbE (app) + 2× 25 GbE (replication) | Application and replication networks separated |
| **Form factor** | 2U | Primary + replica node |
| **Storage backend** | SW RAID (mdadm) or HW RAID (PERC H965) | Write-back cache with BBU |
| **HA** | Patroni / repmgr / MySQL InnoDB Cluster | Asynchronous replication to 1-2 standby |
**Use case**: E-commerce, medium SaaS, 500-5000 users, RPO < 1 min, RTO < 5 min.
#### Variant C: Large company — FC SAN (enterprise)
| Component | Recommendation | Note |
|-----------|-----------|----------|
| **CPU** | 2× EPYC 9654/9965 or Xeon 8592+/6980P (48-128C) | 2 socket, max cores, large cache |
| **RAM** | 512 GB - 2 TB (8-16 GB/core) | DDR5, 2DPC (speed penalty), 12 channels (EPYC) |
| **OS disk** | 2× SATA SSD RAID 1 (2× 480 GB) | OS only, data on SAN |
| **Data + WAL** | LUNs from FC SAN | Hitachi VSP / Dell PowerMax / Pure //X |
| **HBA** | 2× dual-port FC HBA (32/64 Gb) | Multipath (active-active), FC-NVMe |
| **Network** | 2× 25/100 GbE (app) + 2× 32/64 Gb FC (storage) | App and storage networks separated |
| **Form factor** | 2U | 2-8 node cluster (RAC, AlwaysOn AG) |
| **Storage backend** | FC SAN — LUN per database | Thin provisioning, RAID on SAN, snapshots |
| **HA** | Oracle RAC / SQL Server AOAG / PostgreSQL Patroni | Synchronous replication, FC multipath |
**SAN advantages**: Centralized management, snapshots, cloning, disaster recovery (SRDF/Metro), separate storage network, higher availability.
**Disadvantages**: Higher latency compared to local NVMe (~50-200 µs over SAN vs ~10 µs local NVMe), higher CAPEX, vendor lock-in.
#### Variant D: Large company — Ceph / SDS backend
| Component | Recommendation | Note |
|-----------|-----------|----------|
| **CPU** | 2× EPYC 9334/9654 (16-32C) | Fewer cores than SAN variant — part of CPU goes to Ceph client |
| **RAM** | 256-512 GB | Less RAM — Ceph client cache is not as effective as local buffer |
| **OS disk** | 2× SATA SSD RAID 1 (2× 480 GB) | OS |
| **Network** | 2× 25/100 GbE (app) + 2× 25/100 GbE (Ceph public) | App and Ceph traffic over Ethernet |
| **HBA** | Storage HBA in IT/HBA mode (no RAID) | For Ceph OSD node, not DB node |
| **Form factor** | 2U | DB node + separate Ceph OSD node |
| **Storage backend** | RBD (RADOS Block Device) over Ceph | 3× replication or erasure coding |
| **HA** | Application + Ceph inherent HA | Ceph self-healing, auto-rebalance |
**Ceph advantages**: No vendor lock-in, horizontal scaling, unified platform for block/file/object, lower CAPEX.
**Disadvantages**: Higher latency and CPU overhead (Ceph client → network → OSD), variable performance, more complex troubleshooting.
#### Variant E: Cloud — RDS / CloudSQL / Azure SQL
| Component | Recommendation | Note |
|-----------|-----------|----------|
| **Compute** | AWS RDS (db.r7g/r8g), Azure SQL (GP/BC/Hyperscale) | Managed service, no OS access |
| **Storage** | EBS gp3 / io2, Azure Premium SSD v2, Cloud SQL SSD | Automatic scaling, PITR, multi-AZ |
| **Network** | Security Group, Private Link, VPC peering | No HBA, no SAN — everything over Ethernet |
| **HA** | Multi-AZ (synchronous), read replicas | Managed failover, RTO < 60 s |
| **Backup** | Automated, PITR (7-35 days) | No management required |
**Use case**: No on-prem hardware, elastic scaling, pay-per-use, lower operational overhead.
**Disadvantages**: Higher long-term costs, data residency, network latency, limited customization.
### Variant comparison
| Aspect | Local NVMe (small) | Local NVMe (medium) | FC SAN | Ceph | Cloud |
|--------|---------------------|----------------------|--------|------|-------|
| **Latency** | ~10 µs | ~10 µs | ~50-200 µs | ~100-500 µs | ~100-1000 µs |
| **Scaling** | Vertical | Vertical | Horizontal | Horizontal | Elastic |
| **CAPEX** | Low | Medium | High | Medium | None (OPEX) |
| **Operational overhead** | Low | Low | High (SAN admin) | Medium | None |
| **HA** | Application | Patroni/Cluster | RAC/AOAG | Ceph HA | Managed |
| **RPO** | 1-5 min | < 1 min | < 10 s | < 30 s | < 60 s |
| **RTO** | 5-15 min | < 5 min | < 2 min | < 5 min | < 60 s |
| **Number of servers** | 1-2 | 2-4 | 4-16 | 6-20+ | 0 (managed) |
| **Company** | Startup/SME | SME/Enterprise | Enterprise | Enterprise | Any |
### PostgreSQL parameter matrix by storage type
| Parameter | Local NVMe | FC SAN | Ceph RBD |
|----------|-----------|--------|----------|
| `random_page_cost` | 1.1 | 1.5-2.0 | 2.0-3.0 |
| `effective_io_concurrency` | 300 | 100-200 | 50-100 |
| `synchronous_commit` | off (NVMe cache) | on (SAN cache) | off (Ceph cache) |
| `full_page_writes` | on | on | on (even over Ceph) |
### Storage layout by backend type
**Local NVMe (small/medium):**
```
Mount point FS RAID Disk Purpose
/ ext4 1 (mirror) 2× SATA SSD OS
/data xfs 10 4-8× NVMe Data
/wal xfs 1 (mirror) 2× NVMe WAL (PG)
```
**FC SAN (enterprise):**
```
Mount point FS Device Purpose
/ ext4 local RAID 1 (2× SSD) OS
/dev/sdb xfs FC LUN 1 (500 GB) WAL (PG)
/dev/sdc xfs FC LUN 2 (2 TB) Data
/dev/sdd xfs FC LUN 3 (2 TB) Indexes (separate)
```
**Ceph RBD:**
```
Mount point FS Ceph device Purpose
/ ext4 local RAID 1 (2× SSD) OS
/dev/rbd0 xfs rbd datastore-01 Data + WAL (Ceph RBD)
```
### Kernel tuning by variant
**Local NVMe:**
```
vm.dirty_ratio = 30
vm.dirty_background_ratio = 5
```
**FC SAN:**
```
# SAN storage — higher latency, less aggressive flush
vm.dirty_ratio = 20
vm.dirty_background_ratio = 3
vm.dirty_expire_centisecs = 3000 # Defer writes (SAN cache)
```
**Ceph RBD:**
```
# Ceph RBD — network storage, optimize for RBD cache
vm.dirty_ratio = 15
vm.dirty_background_ratio = 2
# RBD cache settings
# rbd cache = true (client-side)
# rbd cache size = 256-512 MB
```
### Database-specific tuning
| Parameter | PostgreSQL | MySQL | Oracle | MongoDB |
|----------|-----------|-------|--------|---------|
| **Cache** | `shared_buffers` 25 % RAM | `innodb_buffer_pool` 70-80 % RAM | `SGA_TARGET` 60-80 % RAM | `WiredTiger cache` 50-80 % RAM |
| **OS cache** | `effective_cache_size` 75 % RAM | OS cache + InnoDB | OS cache (double buffering risk with large SGA) | OS cache |
| **Write buffer** | `wal_buffers` 64-256 MB | `innodb_log_file_size` 1-4 GB | Redo log (2-4 groups, 200 MB-4 GB) | WiredTiger log |
| **Connections** | `max_connections` 50-500 | `max_connections` 100-500 | `processes` 200-2000 | maxIncomingConnections |
| **I/O** | `effective_io_concurrency` 200 | `innodb_io_capacity` 2000 | `db_file_multiblock_read_count` 128 | WiredTiger eviction |
| **Huge pages** | `huge_pages = try` | `large-pages = ON` | `use_large_pages = only` (mandatory) | transparent_hugepages=never |
| **Parallel query** | `max_parallel_workers` 4-8 | `innodb_parallel_read_threads` 4 | `parallel_degree_policy = auto` — up to 64 | — |
### Connectivity by variant
| Variant | App network | Storage network | Replication | Management |
|----------|---------|-------------|-----------|------------|
| **Local (small)** | 2× 25 GbE LACP | — | 2× 25 GbE (same) | iDRAC/iLO |
| **Local (medium)** | 2× 25 GbE LACP | — | 2× 25 GbE dedicated | iDRAC/iLO |
| **FC SAN** | 2× 25/100 GbE | 2× 32/64 Gb FC (multipath) | FC replication | iDRAC/iLO + SAN mgmt |
| **Ceph** | 2× 25/100 GbE | 2× 25/100 GbE (public net) | 2× 25/100 GbE (cluster net) | iDRAC/iLO + Ceph mgmt |
| **Cloud** | Elastic IP / Private Link | — | — | AWS Console / API |
| **Oracle Standalone** | 2× 25 GbE LACP | ASM (2× 25 GbE or FC 32G) | Data Guard 2× 25 GbE | iLO + ASM mgmt |
| **Oracle RAC** | 2-4× 25/100 GbE | 2× 64 Gb FC (multipath) | Cache Fusion interconnect | iLO + SAN mgmt |
| **Oracle Exadata** | 4-8× 100 GbE RoCE | NVMe over Fabric | RDMA interconnect | Exadata CLI + OEDA |
### Oracle-specific configuration
#### Oracle ASM — diskgroup layout
Oracle ASM (Automatic Storage Management) replaces traditional filesystem + volume manager:
| Diskgroup | Redundancy | Disks | Purpose |
|-----------|-----------|-------|-------|
| **DATA** | Normal (2× mirror) | 4-12× FC LUN/NVMe | Data files, temp files, control files |
| **FRA** (Flash Recovery Area) | Normal (2× mirror) | 2-6× FC LUN/NVMe | Archive logs, backup, flashback logs |
| **REDO** | High (3× mirror) | 2-4× FC LUN/NVMe | Online redo log groups (I/O critical) |
| **SPFILE** | Normal | 2× small LUN | Server parameter file |
**ASM striping**: Coarse (1 MB) for regular data, Fine (128 KB) for redo logs (lower write latency).
#### Variant O1: Standalone Oracle (small/medium, single instance)
| Parameter | Small (< 500 users) | Medium (500-2000 users) |
|----------|---------------------|------------------------|
| **CPU** | 1-2× EPYC 9124-9224 / Xeon 4410Y (8-16C) | 2× EPYC 9334-9374F / Xeon 5418Y (16-24C) |
| **RAM (SGA + PGA)** | 64-128 GB (SGA 70 %, PGA 30 %) | 128-512 GB (SGA 60-80 %, PGA 20-40 %) |
| **Huge pages** | Yes (vm.nr_hugepages) — mandatory for SGA | Yes |
| **OS disk** | 2× SATA SSD RAID 1 (240 GB) | 2× NVMe RAID 1 (480 GB) |
| **DATA + FRA** | 4-6× NVMe, ASM normal redundancy | 6-8× NVMe or FC LUN, ASM normal |
| **REDO** | 2-4× NVMe (separate from DATA), ASM high | 4× FC LUN (separate), ASM high |
| **Archive log** | Local FRA | FC LUN (FRA diskgroup) |
| **Network (app)** | 2× 25 GbE LACP | 2-4× 25/100 GbE LACP |
| **Network (storage)** | — (local NVMe) | 2× FC 32G multipath |
| **Network (Data Guard)** | — | 2× 25 GbE dedicated |
| **DB version** | Oracle SE2 (max 16 threads) | Oracle EE (unlimited) |
**Use case**: Dev/test, small production DBs, branch offices. SE2 license = max 16 CPU threads, limited parallel execution.
#### Variant O2: Oracle Data Guard (medium/large, HA + DR)
Primary + standby in active-passive mode, Active Data Guard possible for reporting.
| Parameter | Recommendation |
|----------|-----------|
| **CPU** | 2× EPYC 9654-9965 / Xeon 8592+ (32-64C) |
| **RAM** | 256-1024 GB (SGA 60-80 %, PGA 20-40 %) |
| **Huge pages** | Yes (50-80 % RAM allocated for SGA) |
| **OS disk** | 2× NVMe RAID 1 (480 GB) |
| **Storage** | FC SAN LUN (DATA + FRA + REDO separate) or NVMe + ASM |
| **HBA** | 2× dual-port FC 32/64 Gb (multipath active-active) |
| **App network** | 2-4× 25/100 GbE LACP |
| **Storage network** | 2× FC 32/64 Gb multipath |
| **Data Guard network** | 2× 25/100 GbE dedicated (sync or async) |
| **Data Guard mode** | Maximum Availability (sync, fallback to async) — RPO = 0 |
| **Topology** | 1 primary + 1-2 standby (physical), far sync for geo-DR |
| **Active Data Guard** | Standby open for read (reporting, backup) — requires ADG license |
**Data Guard latency**:
```text
Synchronous (Maximum Availability):
Primary COMMIT → LGWR flush REDO → sync over network → Standby LGWR → ACK → ~1-5 ms
RPO = 0, impact on write latency
Asynchronous (Maximum Performance):
Primary COMMIT → LGWR flush REDO → async to standby buffer → ~0.1-1 ms
RPO = a few seconds, negligible write impact
```
**Network requirements for Data Guard sync**:
- RTT < 2 ms for synchronous mode (recommended < 1 ms)
- Min. 10 GbE, recommended 25 GbE (throughput = REDO rate × 2)
- REDO rate: OLTP ~50-500 MB/s, batch ~500-2000 MB/s
- At REDO rate 500 MB/s and 25 GbE → ~20 % link utilization
#### Variant O3: Oracle RAC (large, enterprise)
Multi-instance cluster with shared storage and Cache Fusion.
| Parameter | Recommendation |
|----------|-----------|
| **Number of nodes** | 2-4 (typical), max 64 (RAC cluster) |
| **CPU per node** | 2× EPYC 9654-9965 / Xeon 8592+ (32-64C) |
| **RAM per node** | 512-2048 GB (SGA 60-80 %, PGA 20-40 %) |
| **Huge pages** | Yes (1 GB pages if RAM > 512 GB) |
| **Storage** | FC SAN — shared LUNs (ASM normal/high redundancy) |
| **HBA** | 2× dual-port FC 64 Gb (multipath, active-active) |
| **App network** | 2-4× 25/100 GbE LACP (VIP, SCAN listener) |
| **Storage network** | 2-4× FC 64 Gb (multipath per node) |
| **Cache Fusion interconnect** | 2× 100 GbE (RoCE v2 or InfiniBand) — dedicated |
| **RAC interconnect latency** | < 5 µs (recommended), max < 10 µs |
| **ASM** | Normal redundancy (2-way mirror) |
| **Oracle Clusterware** | Voting disk (3× 1 GB LUN), OCR (3× 500 MB LUN) |
| **Service** | OLTP_service, REPORT_service, BATCH_service |
**Cache Fusion — critical interconnect**:
```
Node A (DB instance) ←──→ Node B (DB instance)
│ │
└──────── ASM ───────────┘
FC SAN (shared storage)
Cache Fusion traffic: dirty block transfer between instances
→ Latency < 5 µs, otherwise RAC scaling degrades
→ Capacity: 2× 100 GbE, dedicated switch or InfiniBand HDR100
→ Recommended MTU: 9000 (jumbo frames)
```
**RAC sizing by transaction count**:
| TPS | Nodes | CPU per node | RAM per node | Interconnect |
|-----|------|-------------|-------------|-------------|
| < 10 000 | 2 | 16-24C | 256 GB | 2× 25 GbE |
| 10 000 - 50 000 | 2-4 | 32-48C | 512 GB | 2× 100 GbE RoCE |
| 50 000 - 200 000 | 4-8 | 48-64C | 1024 GB | 2× 100 GbE RoCE / InfiniBand |
| > 200 000 | 8+ | 64-128C | 2048 GB | InfiniBand HDR100/HDR200 |
**RAC sizing — license cost calculation**:
```text
Example: 4-node RAC, each node 2× EPYC 9654 (96C) = 192 cores per node
Core factor 0.5 → 96 Oracle licenses per node
4 × 96 = 384 Oracle EE licenses
At ~$47.5k/license → ~$18.2M (licenses only, without 22 % annual support)
```
#### Variant O4: Oracle Exadata (hyperscale)
Engineered system — optimal for hybrid workload (OLTP + DW).
| Parameter | X9M / X10M | Use case |
|----------|-----------|----------|
| **Database servers** | 2-8× (Xeon, 1.5-6 TB RAM, NVMe) | Compute |
| **Storage servers** | 3-18× (NVMe + HDD, Smart Scan) | Predicate offloading |
| **Smart Scan** | Filtering at storage layer | Less data over network, higher throughput |
| **RoCE interconnect** | 100 GbE (RDMA) | Low latency, high bandwidth |
| **In-Memory Column Store** | Optional license | Real-time analytics without ETL |
| **HCC (Hybrid Columnar Compression)** | Compression in storage servers | Up to 10-15× compression for DW |
| **Rack power** | ~15-30 kW (full rack) | Higher density |
**When to choose Exadata over standalone RAC**:
- OLTP > 50 000 TPS
- Consolidation needed (multiple DBs on one cluster)
- Smart Scan significantly accelerates reporting on production data
- HCC for storage savings on DW workloads
---
## 2. Hypervisor host (ESXi / KVM / Hyper-V)
### Configuration by size and storage type
#### Variant A: Small company — local storage (2-3 hosts)
| Component | Recommendation | Note |
|-----------|-----------|----------|
| **CPU** | 1× EPYC 9224/9254 or Xeon 4410Y/5418Y (12-24C) | 1 socket, enough cores for VM density |
| **RAM** | 128-256 GB (4-8 GB/core) | DDR5, 1DPC |
| **OS disk** | 2× SATA SSD RAID 1 (2× 240-480 GB) | ESXi / Proxmox / Hyper-V boot |
| **VM storage** | 4-6× SATA/SAS SSD, RAID 5/6 or 10 | Local RAID, 4-12 TB usable |
| **Network** | 2-4× 10/25 GbE (LACP) | Shared for everything (management + VM + storage) |
| **Hypervisor** | VMware vSphere Standard / Proxmox VE / Hyper-V | Basic license, no enterprise features |
| **Storage backend** | Local RAID controller (PERC H755, Broadcom 9560) | HW RAID with cache, write-back |
| **HA** | VMware HA / Proxmox HA | Restart VM on another host on failure |
| **Backup** | Veeam B&R Free / PBS (Proxmox Backup Server) | Local or USB disk |
**Use case**: Small office, branch office, dev/test, < 10 VMs, low budget, simple management.
**Limitations**: No vMotion without shared storage, outage during host failure (HA restart, not seamless).
#### Variant B: Medium company — vSAN / Ceph (3-6 hosts)
| Component | Recommendation | Note |
|-----------|-----------|----------|
| **CPU** | 1-2× EPYC 9334/9654 or Xeon 5418Y/8592+ (16-32C) | 1-2 socket |
| **RAM** | 256-512 GB (4-8 GB/core) | DDR5, 2DPC (minimal penalty) |
| **OS disk** | 2× SATA SSD RAID 1 or 2× M.2 NVMe (BOSS-S1) | Separate from VM storage |
| **Cache tier** | 1-2× NVMe (vSAN caching / Ceph WAL+DB) | For write performance |
| **Capacity tier** | 4-8× SATA/SAS SSD or HDD (vSAN capacity / Ceph OSD) | HDD for capacity, SSD for performance |
| **Network** | 4× 25/100 GbE — 2× VM + mgmt, 2× storage (vSAN/Ceph) | Separate storage network, RDMA (RoCE v2) |
| **Hypervisor** | VMware vSAN / Proxmox Ceph / StarWind HCI | HCI license (vSAN ~$2.5k/Core) |
| **Storage backend** | vSAN OSA/ESA or Ceph (RADOS) | Distributed storage, auto-rebalance |
| **HA** | vSphere HA + vSAN / Proxmox HA + Ceph | vMotion, DRS, automated failover |
| **Failover** | N+1 (one host as reserve) | vSAN requires min. 4 hosts (ESA min. 3) |
**Pure Ceph variant (Proxmox / OpenStack)**:
```
Proxmox node (3-6×):
├── CPU: 1× EPYC 9224-9334 (12-24C)
├── RAM: 128-256 GB
├── OS: 2× SATA SSD RAID 1
├── Ceph OSD: 4-8× NVMe/SATA SSD (RAW, HBA mode)
├── Network: 2× 25 GbE (public) + 2× 25 GbE (cluster)
└── Storage: Ceph 3× replication, CRUSH host failure domain
```
**VMware vSAN variant (4-6 hosts)**:
```
vSAN node (4-6×):
├── CPU: 1-2× EPYC/Xeon (16-32C)
├── RAM: 256-512 GB
├── OS: 2× M.2 NVMe (BOSS-S1) or SD card (deprecated)
├── vSAN cache: 1-2× NVMe (write buffer)
├── vSAN capacity: 4-8× SATA SSD (vSAN ESA) or HDD (vSAN OSA)
├── Network: 2× 25/100 GbE (VM) + 2× 25 GbE (vSAN)
└── Storage: vSAN ESA (all-NVMe) or OSA (hybrid)
```
**Use case**: SME, enterprise division, 10-100 VMs, need for vMotion, DRS, HA, simple storage management.
#### Variant C: Large company — FC SAN (6+ hosts)
| Component | Recommendation | Note |
|-----------|-----------|----------|
| **CPU** | 2× EPYC 9654/9965 or Xeon 8592+/6980P (32-64C) | 2 socket, max VM density |
| **RAM** | 512 GB - 2 TB (4-8 GB/core) | DDR5, 2DPC |
| **OS disk** | 2× SATA SSD RAID 1 or SD card (vSphere) | Boot, image storage |
| **VM storage** | LUNs from FC SAN — VMFS / NFS datastores | Hitachi, Dell, Pure, HPE storage |
| **HBA** | 2× dual-port FC HBA 32/64 Gb | Multipath, FC-NVMe |
| **Network** | 4-8× 25/100 GbE — split by traffic type | Management, VM, vMotion, FT separated |
| **Hypervisor** | VMware vSphere Enterprise+ / Hyper-V DC | Enterprise license, DRS, HA, FT |
| **Storage backend** | FC SAN — VMFS 8 datastores, VVols | Thin provisioning, storage DRS, array snapshots |
| **HA** | vSphere HA + DRS + vCenter | vMotion, DRS, FT, SRM for DR |
| **Failover** | N+1 or admission control (CPU/RAM reserve) | Reserved capacity for HA failover |
**Use case**: Enterprise, 100+ VMs, mix of DB and applications, centralized storage management, enterprise SLA.
#### Variant D: Hyperscale — Ceph / SDS (20+ hosts)
| Component | Recommendation | Note |
|-----------|-----------|----------|
| **CPU** | 2× EPYC 9654/9965 (64-128C) | 2 socket, compute optimal |
| **RAM** | 512 GB - 1 TB (2-4 GB/core) | Low overcommit ratio for consistency |
| **OS disk** | 2× M.2 NVMe RAID 1 (BOSS) | Boot |
| **Network** | 4-8× 100 GbE (compute + storage) | Separate OVN/OVS for SDN, VXLAN tunneling |
| **Hypervisor** | OpenStack (Nova) / OpenShift (KubeVirt) | Open source, API-driven, multi-tenant |
| **Storage backend** | Ceph (RADOS, RBD, RGW, CephFS) | Unified storage, erasure coding (8+3) |
| **Orchestration** | OpenStack / Kubernetes | Infrastructure-as-Code, autoscaling |
| **HA** | OpenStack HA / Kubernetes HA | Self-healing, auto-rebalance |
**Use case**: Cloud provider, hyperscale, 500+ VMs, multi-tenant, maximum automation.
### Hypervisor variant comparison
| Aspect | Local (small) | vSAN/Ceph (medium) | FC SAN (large) | Ceph hyperscale |
|--------|---------------|---------------------|----------------|-----------------|
| **Storage** | Local RAID | vSAN / Ceph (HCI) | FC SAN (centralized) | Ceph (distributed) |
| **Number of hosts** | 2-3 | 3-6 | 6-50+ | 20+ |
| **VM latency** | ~10 µs (local) | ~100-500 µs | ~200 µs (SAN) | ~500-2000 µs |
| **CAPEX/host** | Low | Medium | High | Medium |
| **CAPEX storage** | Low | None (part of hosts) | High (SAN array) | None (part of hosts) |
| **Management** | Simple (per host) | vCenter / Proxmox | vCenter + SAN mgmt | OpenStack / K8s |
| **vMotion** | No (no shared storage) | Yes (vSAN / Ceph RBD) | Yes (FC LUN) | Yes (Ceph RBD) |
| **DRS** | No | Yes (vSphere) | Yes (vSphere) | OpenStack scheduler |
| **Scaling** | Vertical | Horizontal (add host) | Horizontal (host + SAN) | Horizontal |
### Network design by variant
#### Small (local storage)
| Traffic | VLAN | Speed | Teaming | Note |
|---------|------|----------|---------|----------|
| Management | Mgmt | 1 GbE | Active/Passive | Dedicated port (iLO/iDRAC) |
| VM + Storage | All | 2-4× 10/25 GbE | LACP | Shared, VLAN tagging |
```
┌──────────────────────────────────────────┐
│ Host │
│ ┌──────┐ ┌─────────────────────────────┐│
│ │ iLO │ │ NIC1 NIC2 ││
│ │ 1 GbE │ │ [LACP] 25 GbE ││
│ └──────┘ └──────────┬──────────────────┘│
└──────────────────────┼───────────────────┘
┌─────┴─────┐
│ Switch │
└───────────┘
```
#### Medium (vSAN / Ceph)
| Traffic | VLAN | Speed | Teaming | Note |
|---------|------|----------|---------|----------|
| Management | Mgmt | 1 GbE | Active/Passive | Dedicated iLO/iDRAC |
| VM | VM | 2× 25/100 GbE | LACP | VM traffic, migration |
| Storage | vSAN/Ceph | 2× 25/100 GbE | LACP or RDMA | Separate, Jumbo frames (MTU 9000) |
```
┌──────────────────────────────────────────┐
│ Host │
│ ┌──────┐ ┌──────────┐ ┌───────────────┐│
│ │ iLO │ │ NIC1 NIC2│ │ NIC3 NIC4 ││
│ │ 1 GbE │ │ VM traffic│ │ Storage (vSAN)││
│ └──────┘ └──────────┘ └───────────────┘│
└──────────────────────────────────────────┘
```
#### Large (FC SAN)
| Traffic | VLAN | Speed | Teaming | Note |
|---------|------|----------|---------|----------|
| Management | Mgmt | 1 GbE | Active/Passive | Dedicated |
| VM | VM | 2-4× 25/100 GbE | LACP | VM traffic |
| vMotion | vMotion | 2× 25 GbE | Dedicated | Multi-NIC vMotion |
| FT | FT | 2× 10/25 GbE | Dedicated | Low latency |
| Storage | — | 2× 32/64 Gb FC | Multipath | FC SAN |
```
┌──────────────────────────────────────────────┐
│ Host │
│ ┌──────┐ ┌────────────┐ ┌────┐ ┌─────────┐│
│ │ iLO │ │ NIC1-4 │ │HBA1│ │ HBA2 ││
│ │ 1 GbE │ │ VM+vMotion+FT│ │32Gb│ │ 32Gb ││
│ └──────┘ └────────────┘ └─┬──┘ └──┬──────┘│
└────────────────────────────┼───────┼───────┘
│ │
┌───────┴───┐ ┌─┴────────┐
│ Ethernet │ │ FC Switch │
│ Switch │ │ (Brocade/ │
│ │ │ Cisco) │
└───────────┘ └──────────┘
```
### BIOS for hypervisor — all variants
| Setting | Value | Rationale |
|-----------|---------|------------|
| Hyper-Threading | Enabled | Higher VM density |
| Virtualization Technology | Enabled | VT-x/AMD-V |
| VT-d / IOMMU | Enabled | Passthrough, SR-IOV |
| Power Management | Performance / OS | Minimize VM exit latency |
| C-States | Disabled | Lower VM exit latency (important for real-time VMs) |
| NUMA | Enabled | NUMA-aware VM placement |
| SR-IOV | Enabled | NIC/GPU virtualization |
| Adjacent Sector Prefetch | Enabled (Intel) | Better sequential reads |
| DCU Streamer / IP Prefetcher | Enabled | HW prefetch for VM workload |
| Patrol Scrub | Disabled (vSAN/Ceph) | Can cause latency spikes with SDS |
### Hypervisor selection by variant
| Criterion | VMware vSphere | Proxmox VE | Hyper-V | OpenStack |
|-----------|---------------|------------|---------|-----------|
| **Size** | SME - Enterprise | SME | SME - Enterprise | Hyperscale |
| **Storage** | vSAN, SAN, NFS | Ceph, ZFS, NFS | Storage Spaces, SAN | Ceph, manila |
| **License** | ~$1-5k/core | Free (support ~$500/host) | Part of Windows Server | Open source |
| **Familiarity** | Highest | Medium | Windows admin | Low |
| **Automation** | Terraform, Ansible, PowerCLI | Ansible, Terraform, PBS | PowerShell, SCVMM | Terraform, Heat, Ansible |
| **Ecosystem** | Broadest (Veeam, Zerto, SRM) | Growing (PBS, remote migration) | Windows ecosystem | Open source (Kolla, TripleO) |
---
## 3. Kubernetes node
### Node profiles
| Role | CPU | RAM | Storage | Network | Use case |
|------|-----|-----|---------|---------|----------|
| **General purpose** | 16-32 cores | 64-128 GB | 1× NVMe OS + 1×NVMe local | Web, API, microservices |
| **Memory optimized** | 32-64 cores | 256-512 GB | 1× NVMe OS + 2×NVMe local | In-memory cache, DB |
| **Compute optimized** | 64-128 cores | 128-256 GB | 1× NVMe OS | Batch, CI/CD |
| **GPU node** | 32-64 cores | 512-1024 GB | 1× NVMe OS + 4-8×NVMe local | AI/ML training, inference |
| **Storage node** | 16-32 cores | 64-128 GB | 4-12× NVMe/SATA (Ceph/Longhorn) | SDS, persistent volumes |
### Kernel tuning
```
# /etc/sysctl.d/99-kubernetes.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
net.ipv4.conf.all.forwarding = 1
# Connection tracking (for NodePort, Service)
net.netfilter.nf_conntrack_max = 2097152
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
# File watchers (for kubelet, containerd)
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288
# Memory management
vm.swappiness = 0
vm.overcommit_memory = 1 # Allow overcommit (CRI-O, containerd)
vm.panic_on_oom = 0
kernel.panic = 10
kernel.panic_on_oops = 1
```
### Container storage
| Type | Recommendation | Note |
|-----|-----------|----------|
| **OS disk** | RAID 1 (2× NVMe) | Ext4/XFS, 100-200 GB |
| **Container runtime image** | RAID 1 (2× NVMe) | /var/lib/containerd, 200-500 GB |
| **Local PV** | Single NVMe | Raw device, no RAID |
| **Rook/Ceph OSD** | Raw NVMe/SATA | HBA/IT mode, no RAID |
| **Longhorn** | Raw NVMe/SATA | Ext4/XFS per volume |
---
## 4. Storage server (Ceph / MinIO / NAS)
### Ceph OSD node
| Component | Recommendation | Note |
|-----------|-----------|----------|
| **CPU** | 1-2 cores per OSD | Up to 12 OSD per node (24 cores) |
| **RAM** | 4-8 GB per OSD + OS | BlueStore cache, 16-64 GB min |
| **Network** | 2× 25/100 GbE | Public + Cluster network |
| **Storage** | 10-12× NVMe/SATA SSD OSD | HBA/IT mode, no RAID |
| **OS disk** | 2× SATA SSD RAID 1 | OS, Ceph MON/MGR |
**BIOS for Ceph:**
- SATA/NVMe: AHCI/NVMe mode (not RAID)
- C-States: Disabled (lower OSD latency)
- NUMA: Enabled
- Power: Performance
### MinIO node
| Component | Recommendation |
|-----------|-----------|
| **CPU** | 8-16 cores (32+ for erasure coding) |
| **RAM** | 32-64 GB + 1 GB per 1 TB storage |
| **Storage** | 4-16× NVMe (direct, no RAID) |
| **Network** | 2× 25/100 GbE |
| **OS** | Ubuntu / RHEL, XFS (for data) |
### NAS (TrueNAS / FreeNAS)
- **ZFS**: RAID-Z1/Z2/Z3, compression (lz4, zstd), dedup
- **ARC cache**: 1 GB per 1 TB storage (max 64 GB)
- **L2ARC**: NVMe cache (optional, read-heavy)
- **SLOG**: NVDIMM / Optane (sync write, ZIL)
- **Network**: 2-4× 10/25 GbE LACP
---
## 5. Web / API servers
| Parameter | Recommendation |
|----------|-----------|
| **CPU** | High clock, 8-32 cores |
| **RAM** | 32-128 GB |
| **Storage** | 2× NVMe RAID 1 (OS + app) |
| **OS** | Ubuntu / RHEL, optimized kernel |
| **Network** | 2× 10/25 GbE (bonding) |
**Kernel tuning:**
```
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535
```
---
## Quick decision tree — server selection by workload, size and storage
```mermaid
flowchart TD
W["What workload?"] --> DB["Database"]
W --> HV["Virtualization"]
W --> K8s["Kubernetes"]
W --> AI["AI/ML"]
W --> ST["Storage server"]
W --> WEB["Web / API"]
DB --> DBS{"Company size"}
DBS -->|"< 500"| DB1["1× EPYC 8-16C, 64-256 GB<br/>NVMe RAID10, 2× 25GbE"]
DBS -->|"500-5000"| DB2{"Storage"}
DB2 -->|"Local"| DB2L["1-2× EPYC 16-24C, 128-512 GB<br/>NVMe RAID10, 4× 25GbE"]
DB2 -->|"Ceph"| DB2C["2× EPYC 16-32C, 256-512 GB<br/>RBD, 4× 25/100GbE"]
DBS -->|"Enterprise"| DB3{"Storage"}
DB3 -->|"FC SAN"| DB3F["2× EPYC 48-128C, 512-2048 GB<br/>SAN LUN + 2× FC 32/64G"]
DB3 -->|"Ceph"| DB3C["2× EPYC 32-64C, 256-512 GB<br/>RBD, 4× 100GbE"]
DBS -->|"Cloud"| DBC["RDS/Azure SQL/CloudSQL<br/>Managed, Multi-AZ"]
DB --> ORACLE{"Oracle architecture?"}
ORACLE -->|"Standalone"| ORA1["1-2× EPYC 8-24C<br/>64-512 GB, ASM local/FC<br/>2× 25GbE + FC 32G"]
ORACLE -->|"Data Guard"| ORA2["2× EPYC 32-64C<br/>256-1024 GB, FC SAN<br/>2× 25/100GbE + 2× FC 64G<br/>2× 25GbE (DG sync)"]
ORACLE -->|"RAC 2-4 nodes"| ORA3["Per node: 2× EPYC 32-64C<br/>512-2048 GB, FC SAN<br/>2× 100GbE (app)<br/>2× FC 64G (storage)<br/>2× 100GbE RoCE (interconnect)"]
ORACLE -->|"Exadata"| ORA4["Engineered system<br/>2-8 DB servers + 3-18 storage<br/>RoCE 100GbE, Smart Scan<br/>15-30 kW/rack"]
HV --> HVS{"Number of hosts"}
HVS -->|"2-3"| HV1["1× EPYC 12-24C, 128-256 GB<br/>RAID5/6 SSD, 2-4× 10/25GbE"]
HVS -->|"3-6"| HV2{"HCI"}
HV2 -->|"vSAN"| HV2V["1-2× EPYC 16-32C, 256-512 GB<br/>NVMe cache + SSD, 4× 25GbE"]
HV2 -->|"Ceph"| HV2C["1× EPYC 12-24C, 128-256 GB<br/>4-8× HBA NVMe/SSD, 4× 25GbE"]
HVS -->|"6+"| HV3["2× EPYC 32-64C, 512-2048 GB<br/>FC SAN 32/64G, 4-8× 25/100GbE"]
HVS -->|"20+"| HV4["2× EPYC 64-128C, 512-1024 GB<br/>OpenStack + Ceph, 4-8× 100GbE"]
K8s --> K8T{"Node type"}
K8T -->|"General"| K8G["16-32C, 64-128 GB<br/>2× NVMe, 2× 25GbE"]
K8T -->|"Memory"| K8M["32-64C, 256-512 GB<br/>3× NVMe, 2× 25GbE"]
K8T -->|"GPU"| K8U["32-64C, 512-1024 GB<br/>6-10× NVMe, H100/B200, 4× 100GbE"]
K8T -->|"Storage"| K8S["16-32C, 64-128 GB<br/>6-14× HBA NVMe, 4× 25GbE"]
AI --> AIT{"Purpose"}
AIT -->|"Training"| AITR["GPU H100/B200, NVLink<br/>InfiniBand 400Gb/s, liquid cooling"]
AIT -->|"Inference"| AIIR["A100/H200, MIG<br/>PCIe 5.0, 2× 100GbE"]
ST --> STT{"Type"}
STT -->|"Ceph OSD"| STC["EPYC (PCIe lanes)<br/>4-8 GB/OSD, HBA, 2× 25/100GbE"]
STT -->|"MinIO"| STM["EPYC 8-16C, 32-64 GB<br/>4-16× NVMe direct, 2× 25/100GbE"]
STT -->|"NAS (ZFS)"| STN["EPYC 16-32C, 64-128 GB<br/>RAID-Z, SLOG NVMe, 2-4× 10/25GbE"]
WEB --> WEBE["EPYC high clock, 8-32C<br/>32-128 GB, 2× NVMe RAID1, 2× 10/25GbE"]
```
### Connectivity summary by platform
| Platform | App / VM network | Storage network | Replication / Cluster | Management |
|-----------|-------------|-------------|---------------------|------------|
| **DB local (small)** | 2× 25 GbE LACP | — | 2× 25 GbE (shared) | 1× 1 GbE (iLO) |
| **DB local (medium)** | 2× 25/100 GbE LACP | — | 2× 25 GbE dedicated | 1× 1 GbE (iLO) |
| **DB FC SAN** | 2× 25/100 GbE LACP | 2× 32/64 Gb FC multipath | FC replication | 1× 1 GbE (iLO) + SAN mgmt |
| **DB Ceph** | 2× 25/100 GbE | 2× 25/100 GbE (Ceph public) | 2× 25/100 GbE (Ceph cluster) | 1× 1 GbE (iLO) |
| **Hypervisor local** | 2-4× 10/25 GbE LACP | — (local) | — | 1× 1 GbE (iLO) |
| **Hypervisor vSAN** | 2× 25/100 GbE LACP | 2× 25/100 GbE (vSAN) | vSAN traffic | 1× 1 GbE (iLO) |
| **Hypervisor FC SAN** | 2-4× 25/100 GbE LACP | 2× 32/64 Gb FC multipath | 2× 25 GbE (vMotion) | 1× 1 GbE (iLO) |
| **Hypervisor Ceph** | 2× 25/100 GbE LACP | 2× 25/100 GbE (Ceph) | 2× 25 GbE (migration) | 1× 1 GbE (iLO) |
| **Kubernetes** | 2× 25/100 GbE | 2× 25/100 GbE (Ceph/Longhorn) | 2× 25/100 GbE (K8s cluster) | 1× 1 GbE (BMC) |
| **Web/API** | 2× 10/25 GbE LACP | — | — | 1× 1 GbE (BMC) |
| **Oracle Standalone** | 2× 25 GbE LACP | 2× FC 32G or NVMe local | Data Guard 2× 25 GbE | 1× 1 GbE (iLO) + ASM mgmt |
| **Oracle Data Guard** | 2× 25/100 GbE LACP | 2× FC 64G multipath | 2× 25 GbE (DG sync) | 1× 1 GbE (iLO) + SAN mgmt |
| **Oracle RAC** | 2× 100 GbE LACP (VIP/SCAN) | 2× FC 64G multipath | 2× 100 GbE RoCE (Cache Fusion) | 1× 1 GbE (iLO) + Clusterware |
| **Oracle Exadata** | 4-8× 100 GbE RoCE | NVMe over Fabric | RDMA interconnect | Exadata CLI + OEDA |
## Sources
Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
*Last revision: 2026-06-03*