Files

Stanislav Hubacek 3fa11ef0f6 comiiit

2026-06-11 15:27:28 +02:00

37 KiB

Raw Blame History

⚙️ Server configuration — best practices by workload

General BIOS/UEFI settings

Setting	Recommendation	Rationale
Boot mode	UEFI	Secure Boot, GPT, larger disks
Power profile	Performance / OS Control	Max performance, C-States disabled
Hyper-Threading	Enabled	+30-50 % throughput for multi-thread
Virtualization	Enabled (VT-x/AMD-V)	Required for hypervisor, containers
SR-IOV	Enabled	GPU, NIC passthrough
NUMA	Enabled	NUMA-aware scheduling
ACPI	Enabled	Power management, OS-level
Secure Boot	Enabled	Secure boot chain
TPM	Enabled	Measured boot, key storage

1. Database servers

CPU Selection

DB type	CPU preference	Rationale
OLTP (PostgreSQL, MySQL)	High clock, moderate cores	Low latency per transaction, limited parallelism
OLAP (ClickHouse, Snowflake)	Many cores, AVX-512	Columnstore, high parallelism
In-memory (Redis, Memcached)	High clock, low cache latency	Single-threaded (Redis), RAM bandwidth
Document (MongoDB)	Balance (clock × cores)	Mixed workload
Distributed (Cassandra, Scylla)	Many cores, high cache	Shard-per-core (Scylla), compaction
Oracle OLTP	High clock, moderate cores, core-factor aware	CPU license cost (core factor 0.5 for AMD EPYC and Intel Xeon)
Oracle OLAP / DW	Many cores, large SGA, in-memory option	Parallel query, Exadata Smart Scan, compression

Oracle CPU licensing — core factor

Oracle licenses per core with a correction factor depending on the processor. Factor 0.5 means 2 cores = 1 Oracle license.

Processor	Core factor	64 physical cores → Oracle licenses
AMD EPYC (all series)	0.5	32
Intel Xeon (Scalable)	0.5	32
IBM POWER	1.0	64
ARM (Ampere Altra)	0.5	32

Impact on CPU selection: At the same Oracle license cost, EPYC with more cores is more advantageous — you get more compute power for the same license price.

Configuration by company size and storage type

Variant A: Small company — local NVMe RAID

Component	Recommendation	Note
CPU	1× EPYC 9124/9224 or Intel Xeon 4410Y (8-16C)	1 socket, high clock
RAM	64-256 GB (8-16 GB/core)	DDR5-4800, 1DPC
OS disk	2× SATA/SAS SSD, RAID 1 (240-480 GB)	For OS + binaries
Data disk	4-6× NVMe (U.2/E3.S), RAID 10	Local data, no sharing
WAL disk	2× NVMe RAID 1 (400-800 GB)	PostgreSQL only
Network	2× 25 GbE (LACP)	Application traffic + management
Form factor	1U or 2U	Single node, no cluster
Storage backend	Local RAID controller (PERC/Broadcom)	HW RAID 10 or SW RAID (mdadm)
HA	Application manages failover (patroni, repmgr, orchestrator)	Standby node on failure

Use case: Startup, branch office, dev/test, < 500 users, single database server, low availability requirements.

Variant B: Medium company — local NVMe + asynchronous replication

Component	Recommendation	Note
CPU	1-2× EPYC 9334/9374F or Intel Xeon 5418Y (16-24C)	1-2 socket, balanced
RAM	128-512 GB (8-16 GB/core)	DDR5-4800/5600, 1DPC
OS disk	2× NVMe RAID 1 (2× 480 GB)	OS + binaries
Data disk	6-8× NVMe, RAID 10	Local NVMe, 3-6 TB usable
WAL disk	2× NVMe RAID 1 (2× 800 GB)	Separate from data
Network	2× 25 GbE (app) + 2× 25 GbE (replication)	Application and replication networks separated
Form factor	2U	Primary + replica node
Storage backend	SW RAID (mdadm) or HW RAID (PERC H965)	Write-back cache with BBU
HA	Patroni / repmgr / MySQL InnoDB Cluster	Asynchronous replication to 1-2 standby

Use case: E-commerce, medium SaaS, 500-5000 users, RPO < 1 min, RTO < 5 min.

Variant C: Large company — FC SAN (enterprise)

Component	Recommendation	Note
CPU	2× EPYC 9654/9965 or Xeon 8592+/6980P (48-128C)	2 socket, max cores, large cache
RAM	512 GB - 2 TB (8-16 GB/core)	DDR5, 2DPC (speed penalty), 12 channels (EPYC)
OS disk	2× SATA SSD RAID 1 (2× 480 GB)	OS only, data on SAN
Data + WAL	LUNs from FC SAN	Hitachi VSP / Dell PowerMax / Pure //X
HBA	2× dual-port FC HBA (32/64 Gb)	Multipath (active-active), FC-NVMe
Network	2× 25/100 GbE (app) + 2× 32/64 Gb FC (storage)	App and storage networks separated
Form factor	2U	2-8 node cluster (RAC, AlwaysOn AG)
Storage backend	FC SAN — LUN per database	Thin provisioning, RAID on SAN, snapshots
HA	Oracle RAC / SQL Server AOAG / PostgreSQL Patroni	Synchronous replication, FC multipath

SAN advantages: Centralized management, snapshots, cloning, disaster recovery (SRDF/Metro), separate storage network, higher availability. Disadvantages: Higher latency compared to local NVMe (~50-200 µs over SAN vs ~10 µs local NVMe), higher CAPEX, vendor lock-in.

Variant D: Large company — Ceph / SDS backend

Component	Recommendation	Note
CPU	2× EPYC 9334/9654 (16-32C)	Fewer cores than SAN variant — part of CPU goes to Ceph client
RAM	256-512 GB	Less RAM — Ceph client cache is not as effective as local buffer
OS disk	2× SATA SSD RAID 1 (2× 480 GB)	OS
Network	2× 25/100 GbE (app) + 2× 25/100 GbE (Ceph public)	App and Ceph traffic over Ethernet
HBA	Storage HBA in IT/HBA mode (no RAID)	For Ceph OSD node, not DB node
Form factor	2U	DB node + separate Ceph OSD node
Storage backend	RBD (RADOS Block Device) over Ceph	3× replication or erasure coding
HA	Application + Ceph inherent HA	Ceph self-healing, auto-rebalance

Ceph advantages: No vendor lock-in, horizontal scaling, unified platform for block/file/object, lower CAPEX. Disadvantages: Higher latency and CPU overhead (Ceph client → network → OSD), variable performance, more complex troubleshooting.

Variant E: Cloud — RDS / CloudSQL / Azure SQL

Component	Recommendation	Note
Compute	AWS RDS (db.r7g/r8g), Azure SQL (GP/BC/Hyperscale)	Managed service, no OS access
Storage	EBS gp3 / io2, Azure Premium SSD v2, Cloud SQL SSD	Automatic scaling, PITR, multi-AZ
Network	Security Group, Private Link, VPC peering	No HBA, no SAN — everything over Ethernet
HA	Multi-AZ (synchronous), read replicas	Managed failover, RTO < 60 s
Backup	Automated, PITR (7-35 days)	No management required

Use case: No on-prem hardware, elastic scaling, pay-per-use, lower operational overhead. Disadvantages: Higher long-term costs, data residency, network latency, limited customization.

Variant comparison

Aspect	Local NVMe (small)	Local NVMe (medium)	FC SAN	Ceph	Cloud
Latency	~10 µs	~10 µs	~50-200 µs	~100-500 µs	~100-1000 µs
Scaling	Vertical	Vertical	Horizontal	Horizontal	Elastic
CAPEX	Low	Medium	High	Medium	None (OPEX)
Operational overhead	Low	Low	High (SAN admin)	Medium	None
HA	Application	Patroni/Cluster	RAC/AOAG	Ceph HA	Managed
RPO	1-5 min	< 1 min	< 10 s	< 30 s	< 60 s
RTO	5-15 min	< 5 min	< 2 min	< 5 min	< 60 s
Number of servers	1-2	2-4	4-16	6-20+	0 (managed)
Company	Startup/SME	SME/Enterprise	Enterprise	Enterprise	Any

PostgreSQL parameter matrix by storage type

Parameter	Local NVMe	FC SAN	Ceph RBD
`random_page_cost`	1.1	1.5-2.0	2.0-3.0
`effective_io_concurrency`	300	100-200	50-100
`synchronous_commit`	off (NVMe cache)	on (SAN cache)	off (Ceph cache)
`full_page_writes`	on	on	on (even over Ceph)

Storage layout by backend type

Local NVMe (small/medium):

Mount point    FS       RAID       Disk            Purpose
/               ext4    1 (mirror) 2× SATA SSD     OS
/data           xfs     10          4-8× NVMe       Data
/wal            xfs     1 (mirror) 2× NVMe          WAL (PG)

FC SAN (enterprise):

Mount point    FS       Device                   Purpose
/               ext4    local RAID 1 (2× SSD)     OS
/dev/sdb        xfs     FC LUN 1 (500 GB)         WAL (PG)
/dev/sdc        xfs     FC LUN 2 (2 TB)           Data
/dev/sdd        xfs     FC LUN 3 (2 TB)           Indexes (separate)

Ceph RBD:

Mount point    FS       Ceph device               Purpose
/               ext4    local RAID 1 (2× SSD)     OS
/dev/rbd0       xfs     rbd datastore-01          Data + WAL (Ceph RBD)

Kernel tuning by variant

Local NVMe:

vm.dirty_ratio = 30
vm.dirty_background_ratio = 5

FC SAN:

# SAN storage — higher latency, less aggressive flush
vm.dirty_ratio = 20
vm.dirty_background_ratio = 3
vm.dirty_expire_centisecs = 3000   # Defer writes (SAN cache)

Ceph RBD:

# Ceph RBD — network storage, optimize for RBD cache
vm.dirty_ratio = 15
vm.dirty_background_ratio = 2
# RBD cache settings
# rbd cache = true (client-side)
# rbd cache size = 256-512 MB

Database-specific tuning

Parameter	PostgreSQL	MySQL	Oracle	MongoDB
Cache	`shared_buffers` 25 % RAM	`innodb_buffer_pool` 70-80 % RAM	`SGA_TARGET` 60-80 % RAM	`WiredTiger cache` 50-80 % RAM
OS cache	`effective_cache_size` 75 % RAM	OS cache + InnoDB	OS cache (double buffering risk with large SGA)	OS cache
Write buffer	`wal_buffers` 64-256 MB	`innodb_log_file_size` 1-4 GB	Redo log (2-4 groups, 200 MB-4 GB)	WiredTiger log
Connections	`max_connections` 50-500	`max_connections` 100-500	`processes` 200-2000	maxIncomingConnections
I/O	`effective_io_concurrency` 200	`innodb_io_capacity` 2000	`db_file_multiblock_read_count` 128	WiredTiger eviction
Huge pages	`huge_pages = try`	`large-pages = ON`	`use_large_pages = only` (mandatory)	transparent_hugepages=never
Parallel query	`max_parallel_workers` 4-8	`innodb_parallel_read_threads` 4	`parallel_degree_policy = auto` — up to 64	—

Connectivity by variant

Variant	App network	Storage network	Replication	Management
Local (small)	2× 25 GbE LACP	—	2× 25 GbE (same)	iDRAC/iLO
Local (medium)	2× 25 GbE LACP	—	2× 25 GbE dedicated	iDRAC/iLO
FC SAN	2× 25/100 GbE	2× 32/64 Gb FC (multipath)	FC replication	iDRAC/iLO + SAN mgmt
Ceph	2× 25/100 GbE	2× 25/100 GbE (public net)	2× 25/100 GbE (cluster net)	iDRAC/iLO + Ceph mgmt
Cloud	Elastic IP / Private Link	—	—	AWS Console / API
Oracle Standalone	2× 25 GbE LACP	ASM (2× 25 GbE or FC 32G)	Data Guard 2× 25 GbE	iLO + ASM mgmt
Oracle RAC	2-4× 25/100 GbE	2× 64 Gb FC (multipath)	Cache Fusion interconnect	iLO + SAN mgmt
Oracle Exadata	4-8× 100 GbE RoCE	NVMe over Fabric	RDMA interconnect	Exadata CLI + OEDA

Oracle-specific configuration

Oracle ASM — diskgroup layout

Oracle ASM (Automatic Storage Management) replaces traditional filesystem + volume manager:

Diskgroup	Redundancy	Disks	Purpose
DATA	Normal (2× mirror)	4-12× FC LUN/NVMe	Data files, temp files, control files
FRA (Flash Recovery Area)	Normal (2× mirror)	2-6× FC LUN/NVMe	Archive logs, backup, flashback logs
REDO	High (3× mirror)	2-4× FC LUN/NVMe	Online redo log groups (I/O critical)
SPFILE	Normal	2× small LUN	Server parameter file

ASM striping: Coarse (1 MB) for regular data, Fine (128 KB) for redo logs (lower write latency).

Variant O1: Standalone Oracle (small/medium, single instance)

Parameter	Small (< 500 users)	Medium (500-2000 users)
CPU	1-2× EPYC 9124-9224 / Xeon 4410Y (8-16C)	2× EPYC 9334-9374F / Xeon 5418Y (16-24C)
RAM (SGA + PGA)	64-128 GB (SGA 70 %, PGA 30 %)	128-512 GB (SGA 60-80 %, PGA 20-40 %)
Huge pages	Yes (vm.nr_hugepages) — mandatory for SGA	Yes
OS disk	2× SATA SSD RAID 1 (240 GB)	2× NVMe RAID 1 (480 GB)
DATA + FRA	4-6× NVMe, ASM normal redundancy	6-8× NVMe or FC LUN, ASM normal
REDO	2-4× NVMe (separate from DATA), ASM high	4× FC LUN (separate), ASM high
Archive log	Local FRA	FC LUN (FRA diskgroup)
Network (app)	2× 25 GbE LACP	2-4× 25/100 GbE LACP
Network (storage)	— (local NVMe)	2× FC 32G multipath
Network (Data Guard)	—	2× 25 GbE dedicated
DB version	Oracle SE2 (max 16 threads)	Oracle EE (unlimited)

Use case: Dev/test, small production DBs, branch offices. SE2 license = max 16 CPU threads, limited parallel execution.

Variant O2: Oracle Data Guard (medium/large, HA + DR)

Primary + standby in active-passive mode, Active Data Guard possible for reporting.

Parameter	Recommendation
CPU	2× EPYC 9654-9965 / Xeon 8592+ (32-64C)
RAM	256-1024 GB (SGA 60-80 %, PGA 20-40 %)
Huge pages	Yes (50-80 % RAM allocated for SGA)
OS disk	2× NVMe RAID 1 (480 GB)
Storage	FC SAN LUN (DATA + FRA + REDO separate) or NVMe + ASM
HBA	2× dual-port FC 32/64 Gb (multipath active-active)
App network	2-4× 25/100 GbE LACP
Storage network	2× FC 32/64 Gb multipath
Data Guard network	2× 25/100 GbE dedicated (sync or async)
Data Guard mode	Maximum Availability (sync, fallback to async) — RPO = 0
Topology	1 primary + 1-2 standby (physical), far sync for geo-DR
Active Data Guard	Standby open for read (reporting, backup) — requires ADG license

Data Guard latency:

Synchronous (Maximum Availability):
  Primary COMMIT → LGWR flush REDO → sync over network → Standby LGWR → ACK → ~1-5 ms
  RPO = 0, impact on write latency

Asynchronous (Maximum Performance):
  Primary COMMIT → LGWR flush REDO → async to standby buffer → ~0.1-1 ms
  RPO = a few seconds, negligible write impact

Network requirements for Data Guard sync:

RTT < 2 ms for synchronous mode (recommended < 1 ms)
Min. 10 GbE, recommended 25 GbE (throughput = REDO rate × 2)
REDO rate: OLTP ~50-500 MB/s, batch ~500-2000 MB/s
At REDO rate 500 MB/s and 25 GbE → ~20 % link utilization

Variant O3: Oracle RAC (large, enterprise)

Multi-instance cluster with shared storage and Cache Fusion.

Parameter	Recommendation
Number of nodes	2-4 (typical), max 64 (RAC cluster)
CPU per node	2× EPYC 9654-9965 / Xeon 8592+ (32-64C)
RAM per node	512-2048 GB (SGA 60-80 %, PGA 20-40 %)
Huge pages	Yes (1 GB pages if RAM > 512 GB)
Storage	FC SAN — shared LUNs (ASM normal/high redundancy)
HBA	2× dual-port FC 64 Gb (multipath, active-active)
App network	2-4× 25/100 GbE LACP (VIP, SCAN listener)
Storage network	2-4× FC 64 Gb (multipath per node)
Cache Fusion interconnect	2× 100 GbE (RoCE v2 or InfiniBand) — dedicated
RAC interconnect latency	< 5 µs (recommended), max < 10 µs
ASM	Normal redundancy (2-way mirror)
Oracle Clusterware	Voting disk (3× 1 GB LUN), OCR (3× 500 MB LUN)
Service	OLTP_service, REPORT_service, BATCH_service

Cache Fusion — critical interconnect:

Node A (DB instance) ←──→ Node B (DB instance)
       │                        │
       └──────── ASM ───────────┘
              │
        FC SAN (shared storage)

Cache Fusion traffic: dirty block transfer between instances
  → Latency < 5 µs, otherwise RAC scaling degrades
  → Capacity: 2× 100 GbE, dedicated switch or InfiniBand HDR100
  → Recommended MTU: 9000 (jumbo frames)

RAC sizing by transaction count:

TPS	Nodes	CPU per node	RAM per node	Interconnect
< 10 000	2	16-24C	256 GB	2× 25 GbE
10 000 - 50 000	2-4	32-48C	512 GB	2× 100 GbE RoCE
50 000 - 200 000	4-8	48-64C	1024 GB	2× 100 GbE RoCE / InfiniBand
> 200 000	8+	64-128C	2048 GB	InfiniBand HDR100/HDR200

RAC sizing — license cost calculation:

Example: 4-node RAC, each node 2× EPYC 9654 (96C) = 192 cores per node
  Core factor 0.5 → 96 Oracle licenses per node
  4 × 96 = 384 Oracle EE licenses
  At ~$47.5k/license → ~$18.2M (licenses only, without 22 % annual support)

Variant O4: Oracle Exadata (hyperscale)

Engineered system — optimal for hybrid workload (OLTP + DW).

Parameter	X9M / X10M	Use case
Database servers	2-8× (Xeon, 1.5-6 TB RAM, NVMe)	Compute
Storage servers	3-18× (NVMe + HDD, Smart Scan)	Predicate offloading
Smart Scan	Filtering at storage layer	Less data over network, higher throughput
RoCE interconnect	100 GbE (RDMA)	Low latency, high bandwidth
In-Memory Column Store	Optional license	Real-time analytics without ETL
HCC (Hybrid Columnar Compression)	Compression in storage servers	Up to 10-15× compression for DW
Rack power	~15-30 kW (full rack)	Higher density

When to choose Exadata over standalone RAC:

OLTP > 50 000 TPS
Consolidation needed (multiple DBs on one cluster)
Smart Scan significantly accelerates reporting on production data
HCC for storage savings on DW workloads

2. Hypervisor host (ESXi / KVM / Hyper-V)

Configuration by size and storage type

Variant A: Small company — local storage (2-3 hosts)

Component	Recommendation	Note
CPU	1× EPYC 9224/9254 or Xeon 4410Y/5418Y (12-24C)	1 socket, enough cores for VM density
RAM	128-256 GB (4-8 GB/core)	DDR5, 1DPC
OS disk	2× SATA SSD RAID 1 (2× 240-480 GB)	ESXi / Proxmox / Hyper-V boot
VM storage	4-6× SATA/SAS SSD, RAID 5/6 or 10	Local RAID, 4-12 TB usable
Network	2-4× 10/25 GbE (LACP)	Shared for everything (management + VM + storage)
Hypervisor	VMware vSphere Standard / Proxmox VE / Hyper-V	Basic license, no enterprise features
Storage backend	Local RAID controller (PERC H755, Broadcom 9560)	HW RAID with cache, write-back
HA	VMware HA / Proxmox HA	Restart VM on another host on failure
Backup	Veeam B&R Free / PBS (Proxmox Backup Server)	Local or USB disk

Use case: Small office, branch office, dev/test, < 10 VMs, low budget, simple management. Limitations: No vMotion without shared storage, outage during host failure (HA restart, not seamless).

Variant B: Medium company — vSAN / Ceph (3-6 hosts)

Component	Recommendation	Note
CPU	1-2× EPYC 9334/9654 or Xeon 5418Y/8592+ (16-32C)	1-2 socket
RAM	256-512 GB (4-8 GB/core)	DDR5, 2DPC (minimal penalty)
OS disk	2× SATA SSD RAID 1 or 2× M.2 NVMe (BOSS-S1)	Separate from VM storage
Cache tier	1-2× NVMe (vSAN caching / Ceph WAL+DB)	For write performance
Capacity tier	4-8× SATA/SAS SSD or HDD (vSAN capacity / Ceph OSD)	HDD for capacity, SSD for performance
Network	4× 25/100 GbE — 2× VM + mgmt, 2× storage (vSAN/Ceph)	Separate storage network, RDMA (RoCE v2)
Hypervisor	VMware vSAN / Proxmox Ceph / StarWind HCI	HCI license (vSAN ~$2.5k/Core)
Storage backend	vSAN OSA/ESA or Ceph (RADOS)	Distributed storage, auto-rebalance
HA	vSphere HA + vSAN / Proxmox HA + Ceph	vMotion, DRS, automated failover
Failover	N+1 (one host as reserve)	vSAN requires min. 4 hosts (ESA min. 3)

Pure Ceph variant (Proxmox / OpenStack):

Proxmox node (3-6×):
├── CPU: 1× EPYC 9224-9334 (12-24C)
├── RAM: 128-256 GB
├── OS: 2× SATA SSD RAID 1
├── Ceph OSD: 4-8× NVMe/SATA SSD (RAW, HBA mode)
├── Network: 2× 25 GbE (public) + 2× 25 GbE (cluster)
└── Storage: Ceph 3× replication, CRUSH host failure domain

VMware vSAN variant (4-6 hosts):

vSAN node (4-6×):
├── CPU: 1-2× EPYC/Xeon (16-32C)
├── RAM: 256-512 GB
├── OS: 2× M.2 NVMe (BOSS-S1) or SD card (deprecated)
├── vSAN cache: 1-2× NVMe (write buffer)
├── vSAN capacity: 4-8× SATA SSD (vSAN ESA) or HDD (vSAN OSA)
├── Network: 2× 25/100 GbE (VM) + 2× 25 GbE (vSAN)
└── Storage: vSAN ESA (all-NVMe) or OSA (hybrid)

Use case: SME, enterprise division, 10-100 VMs, need for vMotion, DRS, HA, simple storage management.

Variant C: Large company — FC SAN (6+ hosts)

Component	Recommendation	Note
CPU	2× EPYC 9654/9965 or Xeon 8592+/6980P (32-64C)	2 socket, max VM density
RAM	512 GB - 2 TB (4-8 GB/core)	DDR5, 2DPC
OS disk	2× SATA SSD RAID 1 or SD card (vSphere)	Boot, image storage
VM storage	LUNs from FC SAN — VMFS / NFS datastores	Hitachi, Dell, Pure, HPE storage
HBA	2× dual-port FC HBA 32/64 Gb	Multipath, FC-NVMe
Network	4-8× 25/100 GbE — split by traffic type	Management, VM, vMotion, FT separated
Hypervisor	VMware vSphere Enterprise+ / Hyper-V DC	Enterprise license, DRS, HA, FT
Storage backend	FC SAN — VMFS 8 datastores, VVols	Thin provisioning, storage DRS, array snapshots
HA	vSphere HA + DRS + vCenter	vMotion, DRS, FT, SRM for DR
Failover	N+1 or admission control (CPU/RAM reserve)	Reserved capacity for HA failover

Use case: Enterprise, 100+ VMs, mix of DB and applications, centralized storage management, enterprise SLA.

Variant D: Hyperscale — Ceph / SDS (20+ hosts)

Component	Recommendation	Note
CPU	2× EPYC 9654/9965 (64-128C)	2 socket, compute optimal
RAM	512 GB - 1 TB (2-4 GB/core)	Low overcommit ratio for consistency
OS disk	2× M.2 NVMe RAID 1 (BOSS)	Boot
Network	4-8× 100 GbE (compute + storage)	Separate OVN/OVS for SDN, VXLAN tunneling
Hypervisor	OpenStack (Nova) / OpenShift (KubeVirt)	Open source, API-driven, multi-tenant
Storage backend	Ceph (RADOS, RBD, RGW, CephFS)	Unified storage, erasure coding (8+3)
Orchestration	OpenStack / Kubernetes	Infrastructure-as-Code, autoscaling
HA	OpenStack HA / Kubernetes HA	Self-healing, auto-rebalance

Use case: Cloud provider, hyperscale, 500+ VMs, multi-tenant, maximum automation.

Hypervisor variant comparison

Aspect	Local (small)	vSAN/Ceph (medium)	FC SAN (large)	Ceph hyperscale
Storage	Local RAID	vSAN / Ceph (HCI)	FC SAN (centralized)	Ceph (distributed)
Number of hosts	2-3	3-6	6-50+	20+
VM latency	~10 µs (local)	~100-500 µs	~200 µs (SAN)	~500-2000 µs
CAPEX/host	Low	Medium	High	Medium
CAPEX storage	Low	None (part of hosts)	High (SAN array)	None (part of hosts)
Management	Simple (per host)	vCenter / Proxmox	vCenter + SAN mgmt	OpenStack / K8s
vMotion	No (no shared storage)	Yes (vSAN / Ceph RBD)	Yes (FC LUN)	Yes (Ceph RBD)
DRS	No	Yes (vSphere)	Yes (vSphere)	OpenStack scheduler
Scaling	Vertical	Horizontal (add host)	Horizontal (host + SAN)	Horizontal

Network design by variant

Small (local storage)

Traffic	VLAN	Speed	Teaming	Note
Management	Mgmt	1 GbE	Active/Passive	Dedicated port (iLO/iDRAC)
VM + Storage	All	2-4× 10/25 GbE	LACP	Shared, VLAN tagging

┌──────────────────────────────────────────┐
│  Host                                   │
│  ┌──────┐ ┌─────────────────────────────┐│
│  │ iLO  │ │   NIC1   NIC2               ││
│  │ 1 GbE │ │  [LACP] 25 GbE             ││
│  └──────┘ └──────────┬──────────────────┘│
└──────────────────────┼───────────────────┘
                       │
                 ┌─────┴─────┐
                 │  Switch   │
                 └───────────┘

Medium (vSAN / Ceph)

Traffic	VLAN	Speed	Teaming	Note
Management	Mgmt	1 GbE	Active/Passive	Dedicated iLO/iDRAC
VM	VM	2× 25/100 GbE	LACP	VM traffic, migration
Storage	vSAN/Ceph	2× 25/100 GbE	LACP or RDMA	Separate, Jumbo frames (MTU 9000)

┌──────────────────────────────────────────┐
│  Host                                   │
│  ┌──────┐ ┌──────────┐ ┌───────────────┐│
│  │ iLO  │ │ NIC1 NIC2│ │ NIC3 NIC4     ││
│  │ 1 GbE │ │ VM traffic│ │ Storage (vSAN)││
│  └──────┘ └──────────┘ └───────────────┘│
└──────────────────────────────────────────┘

Large (FC SAN)

Traffic	VLAN	Speed	Teaming	Note
Management	Mgmt	1 GbE	Active/Passive	Dedicated
VM	VM	2-4× 25/100 GbE	LACP	VM traffic
vMotion	vMotion	2× 25 GbE	Dedicated	Multi-NIC vMotion
FT	FT	2× 10/25 GbE	Dedicated	Low latency
Storage	—	2× 32/64 Gb FC	Multipath	FC SAN

┌──────────────────────────────────────────────┐
│  Host                                       │
│  ┌──────┐ ┌────────────┐ ┌────┐ ┌─────────┐│
│  │ iLO  │ │ NIC1-4      │ │HBA1│ │ HBA2    ││
│  │ 1 GbE │ │ VM+vMotion+FT│ │32Gb│ │ 32Gb    ││
│  └──────┘ └────────────┘ └─┬──┘ └──┬──────┘│
└────────────────────────────┼───────┼───────┘
                             │       │
                     ┌───────┴───┐ ┌─┴────────┐
                     │ Ethernet  │ │ FC Switch │
                     │ Switch    │ │ (Brocade/ │
                     │           │ │  Cisco)   │
                     └───────────┘ └──────────┘

BIOS for hypervisor — all variants

Setting	Value	Rationale
Hyper-Threading	Enabled	Higher VM density
Virtualization Technology	Enabled	VT-x/AMD-V
VT-d / IOMMU	Enabled	Passthrough, SR-IOV
Power Management	Performance / OS	Minimize VM exit latency
C-States	Disabled	Lower VM exit latency (important for real-time VMs)
NUMA	Enabled	NUMA-aware VM placement
SR-IOV	Enabled	NIC/GPU virtualization
Adjacent Sector Prefetch	Enabled (Intel)	Better sequential reads
DCU Streamer / IP Prefetcher	Enabled	HW prefetch for VM workload
Patrol Scrub	Disabled (vSAN/Ceph)	Can cause latency spikes with SDS

Hypervisor selection by variant

Criterion	VMware vSphere	Proxmox VE	Hyper-V	OpenStack
Size	SME - Enterprise	SME	SME - Enterprise	Hyperscale
Storage	vSAN, SAN, NFS	Ceph, ZFS, NFS	Storage Spaces, SAN	Ceph, manila
License	~$1-5k/core	Free (support ~$500/host)	Part of Windows Server	Open source
Familiarity	Highest	Medium	Windows admin	Low
Automation	Terraform, Ansible, PowerCLI	Ansible, Terraform, PBS	PowerShell, SCVMM	Terraform, Heat, Ansible
Ecosystem	Broadest (Veeam, Zerto, SRM)	Growing (PBS, remote migration)	Windows ecosystem	Open source (Kolla, TripleO)

3. Kubernetes node

Node profiles

Role	CPU	RAM	Storage	Network
General purpose	16-32 cores	64-128 GB	1× NVMe OS + 1×NVMe local	Web, API, microservices
Memory optimized	32-64 cores	256-512 GB	1× NVMe OS + 2×NVMe local	In-memory cache, DB
Compute optimized	64-128 cores	128-256 GB	1× NVMe OS	Batch, CI/CD
GPU node	32-64 cores	512-1024 GB	1× NVMe OS + 4-8×NVMe local	AI/ML training, inference
Storage node	16-32 cores	64-128 GB	4-12× NVMe/SATA (Ceph/Longhorn)	SDS, persistent volumes

Kernel tuning

# /etc/sysctl.d/99-kubernetes.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
net.ipv4.conf.all.forwarding = 1

# Connection tracking (for NodePort, Service)
net.netfilter.nf_conntrack_max = 2097152
net.netfilter.nf_conntrack_tcp_timeout_established = 86400

# File watchers (for kubelet, containerd)
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288

# Memory management
vm.swappiness = 0
vm.overcommit_memory = 1      # Allow overcommit (CRI-O, containerd)
vm.panic_on_oom = 0
kernel.panic = 10
kernel.panic_on_oops = 1

Container storage

Type	Recommendation	Note
OS disk	RAID 1 (2× NVMe)	Ext4/XFS, 100-200 GB
Container runtime image	RAID 1 (2× NVMe)	/var/lib/containerd, 200-500 GB
Local PV	Single NVMe	Raw device, no RAID
Rook/Ceph OSD	Raw NVMe/SATA	HBA/IT mode, no RAID
Longhorn	Raw NVMe/SATA	Ext4/XFS per volume

4. Storage server (Ceph / MinIO / NAS)

Ceph OSD node

Component	Recommendation	Note
CPU	1-2 cores per OSD	Up to 12 OSD per node (24 cores)
RAM	4-8 GB per OSD + OS	BlueStore cache, 16-64 GB min
Network	2× 25/100 GbE	Public + Cluster network
Storage	10-12× NVMe/SATA SSD OSD	HBA/IT mode, no RAID
OS disk	2× SATA SSD RAID 1	OS, Ceph MON/MGR

BIOS for Ceph:

SATA/NVMe: AHCI/NVMe mode (not RAID)
C-States: Disabled (lower OSD latency)
NUMA: Enabled
Power: Performance

MinIO node

Component	Recommendation
CPU	8-16 cores (32+ for erasure coding)
RAM	32-64 GB + 1 GB per 1 TB storage
Storage	4-16× NVMe (direct, no RAID)
Network	2× 25/100 GbE
OS	Ubuntu / RHEL, XFS (for data)

NAS (TrueNAS / FreeNAS)

ZFS: RAID-Z1/Z2/Z3, compression (lz4, zstd), dedup
ARC cache: 1 GB per 1 TB storage (max 64 GB)
L2ARC: NVMe cache (optional, read-heavy)
SLOG: NVDIMM / Optane (sync write, ZIL)
Network: 2-4× 10/25 GbE LACP

5. Web / API servers

Parameter	Recommendation
CPU	High clock, 8-32 cores
RAM	32-128 GB
Storage	2× NVMe RAID 1 (OS + app)
OS	Ubuntu / RHEL, optimized kernel
Network	2× 10/25 GbE (bonding)

Kernel tuning:

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535

Quick decision tree — server selection by workload, size and storage

flowchart TD
    W["What workload?"] --> DB["Database"]
    W --> HV["Virtualization"]
    W --> K8s["Kubernetes"]
    W --> AI["AI/ML"]
    W --> ST["Storage server"]
    W --> WEB["Web / API"]

    DB --> DBS{"Company size"}
    DBS -->|"< 500"| DB1["1× EPYC 8-16C, 64-256 GB<br/>NVMe RAID10, 2× 25GbE"]
    DBS -->|"500-5000"| DB2{"Storage"}
    DB2 -->|"Local"| DB2L["1-2× EPYC 16-24C, 128-512 GB<br/>NVMe RAID10, 4× 25GbE"]
    DB2 -->|"Ceph"| DB2C["2× EPYC 16-32C, 256-512 GB<br/>RBD, 4× 25/100GbE"]
    DBS -->|"Enterprise"| DB3{"Storage"}
    DB3 -->|"FC SAN"| DB3F["2× EPYC 48-128C, 512-2048 GB<br/>SAN LUN + 2× FC 32/64G"]
    DB3 -->|"Ceph"| DB3C["2× EPYC 32-64C, 256-512 GB<br/>RBD, 4× 100GbE"]
    DBS -->|"Cloud"| DBC["RDS/Azure SQL/CloudSQL<br/>Managed, Multi-AZ"]

    DB --> ORACLE{"Oracle architecture?"}
    ORACLE -->|"Standalone"| ORA1["1-2× EPYC 8-24C<br/>64-512 GB, ASM local/FC<br/>2× 25GbE + FC 32G"]
    ORACLE -->|"Data Guard"| ORA2["2× EPYC 32-64C<br/>256-1024 GB, FC SAN<br/>2× 25/100GbE + 2× FC 64G<br/>2× 25GbE (DG sync)"]
    ORACLE -->|"RAC 2-4 nodes"| ORA3["Per node: 2× EPYC 32-64C<br/>512-2048 GB, FC SAN<br/>2× 100GbE (app)<br/>2× FC 64G (storage)<br/>2× 100GbE RoCE (interconnect)"]
    ORACLE -->|"Exadata"| ORA4["Engineered system<br/>2-8 DB servers + 3-18 storage<br/>RoCE 100GbE, Smart Scan<br/>15-30 kW/rack"]

    HV --> HVS{"Number of hosts"}
    HVS -->|"2-3"| HV1["1× EPYC 12-24C, 128-256 GB<br/>RAID5/6 SSD, 2-4× 10/25GbE"]
    HVS -->|"3-6"| HV2{"HCI"}
    HV2 -->|"vSAN"| HV2V["1-2× EPYC 16-32C, 256-512 GB<br/>NVMe cache + SSD, 4× 25GbE"]
    HV2 -->|"Ceph"| HV2C["1× EPYC 12-24C, 128-256 GB<br/>4-8× HBA NVMe/SSD, 4× 25GbE"]
    HVS -->|"6+"| HV3["2× EPYC 32-64C, 512-2048 GB<br/>FC SAN 32/64G, 4-8× 25/100GbE"]
    HVS -->|"20+"| HV4["2× EPYC 64-128C, 512-1024 GB<br/>OpenStack + Ceph, 4-8× 100GbE"]

    K8s --> K8T{"Node type"}
    K8T -->|"General"| K8G["16-32C, 64-128 GB<br/>2× NVMe, 2× 25GbE"]
    K8T -->|"Memory"| K8M["32-64C, 256-512 GB<br/>3× NVMe, 2× 25GbE"]
    K8T -->|"GPU"| K8U["32-64C, 512-1024 GB<br/>6-10× NVMe, H100/B200, 4× 100GbE"]
    K8T -->|"Storage"| K8S["16-32C, 64-128 GB<br/>6-14× HBA NVMe, 4× 25GbE"]

    AI --> AIT{"Purpose"}
    AIT -->|"Training"| AITR["GPU H100/B200, NVLink<br/>InfiniBand 400Gb/s, liquid cooling"]
    AIT -->|"Inference"| AIIR["A100/H200, MIG<br/>PCIe 5.0, 2× 100GbE"]

    ST --> STT{"Type"}
    STT -->|"Ceph OSD"| STC["EPYC (PCIe lanes)<br/>4-8 GB/OSD, HBA, 2× 25/100GbE"]
    STT -->|"MinIO"| STM["EPYC 8-16C, 32-64 GB<br/>4-16× NVMe direct, 2× 25/100GbE"]
    STT -->|"NAS (ZFS)"| STN["EPYC 16-32C, 64-128 GB<br/>RAID-Z, SLOG NVMe, 2-4× 10/25GbE"]

    WEB --> WEBE["EPYC high clock, 8-32C<br/>32-128 GB, 2× NVMe RAID1, 2× 10/25GbE"]

Connectivity summary by platform

Platform	App / VM network	Storage network	Replication / Cluster	Management
DB local (small)	2× 25 GbE LACP	—	2× 25 GbE (shared)	1× 1 GbE (iLO)
DB local (medium)	2× 25/100 GbE LACP	—	2× 25 GbE dedicated	1× 1 GbE (iLO)
DB FC SAN	2× 25/100 GbE LACP	2× 32/64 Gb FC multipath	FC replication	1× 1 GbE (iLO) + SAN mgmt
DB Ceph	2× 25/100 GbE	2× 25/100 GbE (Ceph public)	2× 25/100 GbE (Ceph cluster)	1× 1 GbE (iLO)
Hypervisor local	2-4× 10/25 GbE LACP	— (local)	—	1× 1 GbE (iLO)
Hypervisor vSAN	2× 25/100 GbE LACP	2× 25/100 GbE (vSAN)	vSAN traffic	1× 1 GbE (iLO)
Hypervisor FC SAN	2-4× 25/100 GbE LACP	2× 32/64 Gb FC multipath	2× 25 GbE (vMotion)	1× 1 GbE (iLO)
Hypervisor Ceph	2× 25/100 GbE LACP	2× 25/100 GbE (Ceph)	2× 25 GbE (migration)	1× 1 GbE (iLO)
Kubernetes	2× 25/100 GbE	2× 25/100 GbE (Ceph/Longhorn)	2× 25/100 GbE (K8s cluster)	1× 1 GbE (BMC)
Web/API	2× 10/25 GbE LACP	—	—	1× 1 GbE (BMC)
Oracle Standalone	2× 25 GbE LACP	2× FC 32G or NVMe local	Data Guard 2× 25 GbE	1× 1 GbE (iLO) + ASM mgmt
Oracle Data Guard	2× 25/100 GbE LACP	2× FC 64G multipath	2× 25 GbE (DG sync)	1× 1 GbE (iLO) + SAN mgmt
Oracle RAC	2× 100 GbE LACP (VIP/SCAN)	2× FC 64G multipath	2× 100 GbE RoCE (Cache Fusion)	1× 1 GbE (iLO) + Clusterware
Oracle Exadata	4-8× 100 GbE RoCE	NVMe over Fabric	RDMA interconnect	Exadata CLI + OEDA

Sources

Links, books and standards: sources/infrastructure/sources.md

Last revision: 2026-06-03

37 KiB Raw Blame History Unescape Escape