Files
knowledge-base/STORAGE.en.md
Stanislav Hubacek ef3c2f75b1 18.6.2026
2026-06-18 16:25:33 +02:00

335 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 💾 Storage infrastructure
## Storage types
| Type | Description | Latency | Use case |
|-----|-------|---------|----------|
| **DAS** (Direct Attached) | Disks directly in server | <0.1 ms | OS, cache, local data |
| **SAN** (Storage Area Network) | Block devices over network | <1 ms | Databases, VM datastores |
| **NAS** (Network Attached Storage) | File access (NFS, SMB) | 1-3 ms | Shared files, home dirs |
| **Object storage** | REST API, flat namespace | 10-100 ms | Backups, media, big data |
## Protocols
| Protocol | Type | Speed | Note |
|----------|-----|----------|----------|
| **Fibre Channel** | SAN | 8/16/32/64 Gbps | Low latency, dedicated network |
| **iSCSI** | SAN (IP) | 1/10/25 GbE | Cheaper, over ethernet |
| **NVMe-oF** | SAN (NVMe) | 25/50/100 GbE | Lowest latency, emerging |
| **NFS** | NAS | 1/10/25 GbE | Universal, simple |
| **SMB/CIFS** | NAS | 1/10/25 GbE | Windows native |
| **S3 API** | Object | — | Standard for object storage |
## RAID
| RAID | Min. disks | Capacity | Protection | Read speed | Write speed | Use case |
|------|-----------|----------|---------|---------------|----------------|----------|
| **0** | 2 | 100 % | None | N × (striping) | N × | Temp data, cache (risky) |
| **1** | 2 | 50 % | 1 disk | N × (mirror) | 1 × | OS disk, critical data |
| **5** | 3 | 67-94 % | 1 disk | N-1 × | N-1 × (parity write penalty) | Universal file/VM storage |
| **6** | 4 | 50-88 % | 2 disks | N-2 × | N-2 × (double parity) | Large capacities, important data |
| **10** | 4 | 50 % | 1/mirror | N × | N/2 × | Databases, VM, high-performance |
| **50** | 6 | 67-94 % | 1/stripe | N-1 × | N-1 × | Large capacity + performance |
| **60** | 8 | 50-88 % | 2/stripe | N-2 × | N-2 × | Enterprise |
### Stripe size
- Small stripe (16-64 KB) — better IOPS, worse throughput (databases, OLTP)
- Large stripe (128-1024 KB) — better throughput, worse IOPS (video, media, backup)
- Write hole on RAID 5/6: metadata inconsistency during power loss while writing parity (prevention: non-volatile cache, battery-backed RAID controller)
## Software-Defined Storage (SDS)
| Tool | Type | Use case |
|---------|-----|----------|
| **Ceph** | Object/Block/File (RADOS) | Universal SDS, OpenStack, Kubernetes |
| **MinIO** | Object (S3 API) | High-performance S3, AI/ML data lake |
| **GlusterFS** | Distributed File | Shared filesystem, POSIX |
| **Longhorn** | Block (Kubernetes) | K8s PVC, microservices |
| **Linstor** | Block (DRBD + LVM) | Linux SDS, Kubernetes |
| **VMware vSAN** | Block (HCI) | VMware ecosystem |
| **StarWind** | Block (HCI) | Hyper-V / VMware |
### Ceph
**Architecture**:
```
RADOS (Reliable Autonomic Distributed Object Store)
├── Monitors (MON) — cluster map, quorum (3/5)
├── Managers (MGR) — dashboard, balancer, orchestrator
├── OSDs (Object Storage Daemons) — data + replication
└── MDS (Metadata Server) — CephFS only
```
**CRUSH map** (Controlled Replication Under Scalable Hashing):
- Algorithm for calculating data placement (no central index)
- Layers: Root → Datacenter → Rack → Host → OSD
- Failure domain: replication across racks / hosts
- `ceph osd crush rule create-replicated replicated_rule default host`
**Access interfaces**:
| Interface | Type | Use case |
|----------|-----|----------|
| **RBD** (RADOS Block Device) | Block | VM images, Kubernetes PVC (csi-rbd) |
| **RGW** (RADOS Gateway) | Object (S3/Swift API) | S3-compatible storage, backup |
| **CephFS** | File (POSIX) | Shared filesystem, home dirs |
| **NFS-Ganesha** | File (NFS) | NFS export over CephFS |
**Erasure coding**:
- K+M (data + parity chunks), e.g. 8+3 (8 data, 3 parity)
- More space-efficient than 3× replication (1.375× vs 3×)
- Higher CPU overhead, lower IOPS
- Recommended for cold data (RGW) instead of replication
## Enterprise storage vendors
### Hitachi VSP (Virtual Storage Platform)
| Model | Architecture | Max capacity | IOPS / Latency | Protocols | Use case |
|-------|-------------|--------------|----------------|-----------|----------|
| **VSP 5200/5600** | Active-active, scale-up/out, 212 controllers | 69.3 PB raw, 287 PBe | 33M IOPS, 39 µs | FC-NVMe 32Gb, FC 16/32Gb, FICON 16Gb, iSCSI 10Gb | Mission-critical, mainframe, enterprise consolidation |
| **VSP E590/E790/E1090** | Symmetric active-active, up to 65 nodes/130 controllers | 10.62 PB raw (E1090) | 8.4M IOPS, <41 µs | FC 32Gb, iSCSI 25Gb, FC-NVMe 32Gb | Midrange enterprise, hybrid workloads |
**Key features**: SVOS common across entire portfolio, AI-driven data reduction 4:1 guarantee, Global-Active Device metro clustering, 8 nines availability (HW), 100% data availability guarantee.
---
### Huawei OceanStor Dorado
| Model | Architecture | Max capacity | IOPS / Latency | Protocols | Use case |
|-------|-------------|--------------|----------------|-----------|----------|
| **Dorado 8000/18000 V6** | SmartMatrix full-mesh, up to 32 controllers | 32 TB cache, 6400 SSD | 40M IOPS, 0.05 ms | FC 32/64Gb, FC-NVMe, iSCSI, NFS, SMB, NVMe/RoCE, S3 | Mission-critical, finance, govt, carrier |
| **Dorado 8000/18000 V7 (2025)** | SmartMatrix 4.0, up to 64/128 controllers | 500 PB+ | >100M IOPS, 0.03 ms | FC, RoCE, NVMe/TCP, NFS, SMB, S3 | AI workloads, converged block/file/object |
**Key features**: SmartMatrix survives 7/8 controllers, FlashEver (3-gen online HW upgrade in 10 years), RAID-TP (triple SSD failure), DPU-based SmartNIC, ML-based I/O prefetch, 100% ransomware detection (Tolly), #1 SPC-1 benchmark.
---
### Dell PowerStore & PowerMax
| Model | Architecture | Max capacity | IOPS / Latency | Protocols | Use case |
|-------|-------------|--------------|----------------|-----------|----------|
| **PowerStore 1500/5500/9500 (Gen 3)** | Active-active dual-node, PCIe Gen5, DDR5, RDMA 200GbE | 1.2 PB raw, 5.8 PBe | 3× IOPS vs Gen2 | FC 32/64Gb, iSCSI, NVMe/FC, NVMe/TCP, NFSv4, SMB3 | Midrange-to-high-end, VMware, containerized |
| **PowerMax 2500/8500** | Scale-out NVMe, Dynamic Fabric, up to 16 nodes | 8.8 PBe (2500), 18 PBe (8500) | 6 nines availability | FC 64Gb, FICON, NVMe/FC, NVMe/TCP, iSCSI, NFS, SMB | Mission-critical, mainframe, OLTP, cyber vault |
**Key features**: PowerStore 6:1 DRR guarantee, unified block/file/vVols out of box, Cyber Detect AI anomaly; PowerMax 5:1 DRR, Secure Snapshots 65M, SRDF/Metro, Flexible RAID up to 92% efficient, FIPS 140-3.
---
### HPE Alletra
| Model | Architecture | Max capacity | IOPS / Latency | Protocols | Use case |
|-------|-------------|--------------|----------------|-----------|----------|
| **Alletra 5000** | Active-active hybrid flash, dual controller | 1.2 PB raw | 99.9999% guarantee | FC, iSCSI | Mixed primary + secondary, cost-efficient hybrid |
| **Alletra 6000** | Active-active all-NVMe, dual controller | ~368 TB usable | <100 µs | FC, iSCSI | Business-critical DB, VDI, VMware |
| **Alletra 9000** | Active-active all-NVMe, multi-node scale-out | 24 PB+ usable | ~23M IOPS, <150 µs | FC, iSCSI, NVMe/FC | Mission-critical ERP, AI, consolidation |
| **Alletra Storage MP** | Disaggregated modular, block + file + object | 5.8 PB block, 11.8 PB object | 100% availability guarantee | FC, iSCSI, NVMe/FC, NFS, SMB, S3 | Multi-protocol consolidation, AI/analytics |
**Key features**: Triple Parity RAID (5000), InfoSight AI Ops, HPE GreenLake as-a-service, non-disruptive controller upgrades (MP), 100% data availability guarantee.
---
### Infinidat
| Model | Architecture | Max capacity | IOPS / Latency | Protocols | Use case |
|-------|-------------|--------------|----------------|-----------|----------|
| **InfiniBox SSA G4** | Triple-active controller, AMD EPYC PCIe 5.0, DDR5 | 1.97 PB usable / 5.9 PBe | 2.24M IOPS, 35 µs | FC 32Gb, 25/100GbE, NVMe-oF/TCP, iSCSI, NFS, SMB, S3 | Mission-critical Oracle/SQL, multi-site DR |
| **InfiniBox G4 Hybrid** | Triple-active hybrid (HDD + flash cache) | 10.9 PB raw / 32.8 PBe | 2.24M IOPS, 64 GB/s | FC, Ethernet, NVMe-oF, iSCSI, NFS, SMB, S3 | Backup, massive unstructured data |
**Key features**: Only 3-way active on the market, Neural Cache (ML-driven), InfiniRAID, Immutable snapshots, 100% availability + 1-min snapshot recovery guarantee, everything included in base price (no extra licensing).
---
### Pure Storage FlashArray
| Model | Architecture | Max capacity | IOPS / Latency | Protocols | Use case |
|-------|-------------|--------------|----------------|-----------|----------|
| **FlashArray//X (X20X90 R5)** | Active-active, NVMe DirectFlash | 1.2 PB raw / 4.4 PBe | 250 µs, 5:1 DRR | FC, NVMe/FC, NVMe/RoCE, NVMe/TCP, iSCSI, NFS, SMB | Mission-critical DB, VMware, enterprise |
| **FlashArray//C (C50C90 R5)** | Active-active, QLC DirectFlash | 4.2 PB raw / 16.3 PBe | 5:1 DRR | FC, NVMe-oF, iSCSI, NFS, SMB | Capacity-optimized, backup, file |
| **FlashArray//XL (XL190)** | Active-active, 40 DirectFlash modules | 1.9 PB raw / 9.4 PBe | >4M IOPS, <100 µs, 45 GB/s | FC 64Gb, 100GbE RoCE, NVMe/FC, NVMe/TCP, NFS, SMB | Largest DB consolidation, OLTP |
**Key features**: DirectFlash (no FTL layer), 99.9999% availability, Evergreen (never forklift upgrade), Purity OS unified across entire portfolio, ActiveCluster/ActiveDR, Pure1 AIOps.
---
### Lenovo ThinkSystem
| Model | Architecture | Max capacity | IOPS / Latency | Protocols | Use case |
|-------|-------------|--------------|----------------|-----------|----------|
| **DM Series** (DM3200F/5200F/7200F) | Active-active, all-NVMe, NetApp ONTAP | 1.8 PB raw / 6.8 PBe | Up to 120 NVMe SSD | FC 64Gb, iSCSI, NVMe/FC, NFS, SMB, S3 | Unified block/file, AI/ML, VMware |
| **DG Series** (DG5200/7200) | Active-active, all-QLC, ONTAP | 7.4 PB raw / 27 PBe | QLC economics | FC, NVMe/FC, NVMe/TCP, iSCSI, NFS, SMB, S3 | Capacity-optimized, backup, archive |
| **DE Series** (DE4000FDE6600F) | Active-active, SAS/NVMe hybrid | 1.84 PB raw | 2M IOPS, <100 µs, 44 GB/s | FC 32Gb, iSCSI 25Gb, NVMe/FC, SAS, NVMe/RoCE | HPC, analytics, video surveillance |
**Key features**: DM/DG use ONTAP (SnapMirror, SnapVault, FabricPool, RAID-DP/RAID-TEC); cluster scale-out up to 12 HA pairs; DE series best price/performance in portfolio.
---
### Synology
| Model | Architecture | Max capacity | Protocols | Use case |
|-------|-------------|--------------|-----------|----------|
| **UC3200/UC3400** | Active-active dual-controller, SAS backend | 576 TB raw | iSCSI, FC 16Gb, 10/25GbE | SMB/midmarket SAN, VMware, HA |
| **DS/RS Series** (RS3626xs+, RS6426xs+) | Single-controller / HA pair, Btrfs | 864 TB raw, 1 PB volume | SMB, NFS, iSCSI, FC (HBA) | SME all-in-one NAS/SAN, backup, surveillance |
**Key features**: DSM UC for SAN, Synology HA, Snapshot Replication (16K snapshots), VMware VAAI/ODX/ALUA, Surveillance Station, low TCO.
---
### Vendor comparison — overview
| Vendor | Flagship | Max IOPS | Max capacity | Latency | Availability guarantee | Main differentiator |
|--------|----------|----------|-------------|---------|---------------------|----------------------|
| **Hitachi** | VSP 5600 | 33M | 287 PBe | 39 µs | 8 nines (HW) | Mainframe + open; 65-node cluster |
| **Huawei** | Dorado 18000 V7 | >100M | 500 PB+ | 0.03 ms | 99.99999% | SmartMatrix; #1 SPC-1 |
| **Dell** | PowerMax 8500 | — | 18 PBe | — | 6 nines | SRDF/Metro; mainframe |
| **HPE** | Alletra 9000/MP | ~3M | 11.8 PBe | <150 µs | 100% data guarantee | InfoSight AIOps; GreenLake |
| **Infinidat** | InfiniBox SSA G4 | 2.24M | 32.8 PBe | 35 µs | 100% availability | 3-way active; Neural Cache |
| **Pure** | FlashArray//XL | >4M | 16.3 PBe | <100 µs | 99.9999% | DirectFlash; Evergreen |
| **Lenovo** | DM7200F | — | 27 PBe | — | — | ONTAP ecosystem; broad portfolio |
| **Synology** | UC3400 | 690K | 576 TB | — | — | Lowest price for active-active SAN |
---
### Storage selection by use case
| Use case | Recommendation | Rationale |
|----------|-----------|-------------|
| **Mainframe + open hybrid** | Hitachi VSP / Dell PowerMax | Only ones with FICON + FC simultaneously |
| **AI/ML training** | Huawei Dorado V7 / Pure //XL | Highest IOPS, lowest latency |
| **Enterprise DB (Oracle, SQL Server)** | Infinidat / Pure //X | Low latency, consistent performance |
| **Virtualization (VMware, Hyper-V)** | Dell PowerStore / HPE Alletra 6000 | VAAI, vVols, InfoSight |
| **SMB / SME** | Synology / Lenovo DE | Low TCO, simple management |
| **Object storage / backup** | Pure //C / Lenovo DG / Infinidat Hybrid | QLC economics, high capacity |
| **Multi-protocol consolidation** | HPE Alletra MP / Huawei Dorado | Block + file + object in one platform |
## Decision diagram — storage platform selection
```mermaid
flowchart TD
Start(["Storage requirement"]) --> PROTO{"Access type"}
PROTO -->|"Block (SAN)"| BLOCK
PROTO -->|"File (NAS)"| FILE
PROTO -->|"Object"| OBJECT
BLOCK --> BPERF{"Performance tier"}
BPERF -->|"Tier 0/1<br/>< 100 µs, > 1M IOPS"| BT1["Infinidat / Pure //XL<br/>Huawei Dorado V7<br/>FC-NVMe, NVMe-oF"]
BPERF -->|"Tier 2<br/>100-500 µs"| BT2["Dell PowerStore / HPE Alletra 6000<br/>Hitachi VSP / Lenovo DM<br/>FC 32G, iSCSI 25GbE"]
BPERF -->|"Tier 3<br/>SME / low-cost"| BT3["Synology UC3400<br/>Lenovo DE / Dell PowerVault<br/>iSCSI, SAS"]
BLOCK --> BECOS{"Ecosystem"}
BECOS -->|"Mainframe"| BMF["Hitachi VSP / Dell PowerMax<br/>FICON + FC simultaneously"]
BECOS -->|"VMware"| BVM["Dell PowerStore / HPE Alletra<br/>VAAI, vVols, InfoSight"]
BECOS -->|"Oracle / SQL Server"| BDB["Infinidat / Pure //X<br/>Lowest latency"]
FILE --> FSIZE{"Scaling"}
FSIZE -->|"Enterprise"| FE["HPE Alletra MP (file)<br/>Lenovo DM / Dell PowerScale<br/>NFS, SMB, multi-protocol"]
FSIZE -->|"SMB"| FS["Synology DS/RS<br/>Lenovo DE / TrueNAS<br/>Btrfs, NFS, SMB, low TCO"]
OBJECT --> OUSE{"Use case"}
OUSE -->|"Backup / archive"| OB["Pure //C / Infinidat Hybrid<br/>Lenovo DG<br/>QLC, erasure coding, low cost/TB"]
OUSE -->|"AI/ML data lake"| OM["MinIO / Pure //C<br/>High throughput S3<br/>NVMe direct, erasure coding"]
OUSE -->|"Kubernetes PVC"| OK["Ceph RBD / Longhorn / Linstor<br/>SDS on K8s<br/>CSI, replication, snapshots"]
```
## OpenStack Storage
OpenStack offers three main storage services:
| Service | Type | Description |
|--------|-----|-------|
| **Cinder** | Block storage | Persistent volumes for instances (iSCSI, NFS, Ceph RBD) |
| **Swift** | Object storage | RESTful object store (S3-compatible via middleware) |
| **Manila** | File storage | Shared file systems (NFS, CIFS) as a managed service |
### Cinder (Block Storage)
- Multi-backend support: LVM, Ceph RBD, NFS, iSCSI, Fibre Channel
- Snapshoting, cloning, encryption at rest
- Cinder scheduler for volume distribution across backends
- QoS specs for IOPS/bandwidth limits
### Swift (Object Storage)
- Alternative to S3 for on-prem object storage
- Ring-based data distribution (consistent hashing)
- Multi-region replication (syncopy)
- Stateless REST API (RESTful, no single point of failure)
### Manila (Shared File Systems)
- Managed NFS/CIFS for sharing between instances
- Backends: NetApp, Dell EMC, CephFS, GlusterFS
- Access rules (IP-based, cert-based, user-based)
- Use case: HPC cluster home directories, NAS for legacy apps
### Container storage (OpenStack + Ceph)
Ceph is the most common storage backend for OpenStack: Cinder (RBD), Swift (RGW), Manila (CephFS), Glance (RBD images).
## Big Data storage
### HDFS cluster
HDFS is the primary storage for the Hadoop ecosystem (on-prem). Typical configuration:
| Parameter | Value | Note |
|-----------|-------|------|
| **Disk per DataNode** | 824 × HDD (1422 TB) + 2× NVMe (metadata, cache) | Balance capacity / performance |
| **Replication factor** | 3× | Rack-aware |
| **Network** | 2× 25/100 GbE (data) + 1× 1 GbE (management) | Data + replication traffic |
| **RAM** | 64256 GB (OS cache + metadata) | HDFS cache + OS buffer cache |
| **CPU** | 1632 cores | HDFS overhead is low |
| **NameNode HA** | Active + Standby + JN (JournalNode) | Quorum-based HA |
| **Use case** | Sequential read/write, large files, Spark YARN |
**Model cluster — 1 PB usable:**
- 10× DataNode (12× 18 TB HDD, 2× 1.9 TB NVMe)
- 2× NameNode (HA, 256 GB RAM)
- 3× JournalNode (small VMs)
- Replication 3× → raw ~ 2.2 PB
- Network: 25 GbE for data, 100 GbE for shuffle-heavy Spark
### Object storage as Data Lake (S3/GCS/MinIO)
For new projects (Spark on K8s, Iceberg/Delta, lakehouse), object storage is preferred over HDFS:
| Platform | Advantages | Limits |
|----------|-----------|--------|
| **MinIO** (on-prem) | S3 API, erasure coding, NVMe direct, high throughput | Single tenant (per cluster) |
| **Pure //C** (on-prem) | QLC NVMe, dedupe, S3 + NFS | Higher $/TB |
| **AWS S3** (cloud) | Unlimited capacity, Iceberg/Delta support | Egress fees |
| **Azure ADLS** (cloud) | Hierarchical namespace, HNS, POSIX-like ACLs | Vendor lock |
| **GCP GCS** (cloud) | Uniform + fine-grained ACLs, object versioning | Region restrictions |
### Comparison: HDFS vs Object Storage for Big Data
| Criteria | HDFS | Object Storage (S3/MinIO) |
|----------|------|-------------------------|
| **Architecture** | Master/worker (NameNode SPOF) | Distributed, no SPOF (erasure coding) |
| **Consistency** | Strong (single writer per file) | Eventual (S3) / Strong (MinIO) |
| **Throughput** | High (rack-aware, locality) | High (network-bound) |
| **Scaling** | Horizontal (DataNode) | Horizontal (stateless) |
| **Cost** | Low (HDD) | Medium (S3 API) |
| **Metadata** | NameNode (1M blocks ~ 1 GB RAM) | Object-level (flat namespace) |
| **Spark integration** | Native (locality-optimized) | S3A connector, Hadoop Compatible |
| **2026 trend** | Legacy, declining | Standard for new projects |
For more information about Big Data see [BIG-DATA.en.md](BIG-DATA.en.md).
## Sources
Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
### Recommended reading
| Book | Authors | ISBN | Description |
|-------|--------|------|-------|
| Storage Systems | Ganger, Gibson | 978-1680837540 | Textbook covering the design, implementation and operation of storage systems — from device characteristics through OS, databases and networking to server distribution and large-scale systems. An essential resource for storage infrastructure architects. |
*Last revision: 2026-06-03*