18.6.2026

2026-06-18 16:25:33 +02:00
parent b53714113c
commit ef3c2f75b1
43 changed files with 3637 additions and 129 deletions
--- a/STORAGE.en.md
+++ b/STORAGE.en.md
@@ -270,9 +270,60 @@ OpenStack offers three main storage services:

 Ceph is the most common storage backend for OpenStack: Cinder (RBD), Swift (RGW), Manila (CephFS), Glance (RBD images).

+## Big Data storage
+
+### HDFS cluster
+
+HDFS is the primary storage for the Hadoop ecosystem (on-prem). Typical configuration:
+
+| Parameter | Value | Note |
+|-----------|-------|------|
+| **Disk per DataNode** | 8–24 × HDD (14–22 TB) + 2× NVMe (metadata, cache) | Balance capacity / performance |
+| **Replication factor** | 3× | Rack-aware |
+| **Network** | 2× 25/100 GbE (data) + 1× 1 GbE (management) | Data + replication traffic |
+| **RAM** | 64–256 GB (OS cache + metadata) | HDFS cache + OS buffer cache |
+| **CPU** | 16–32 cores | HDFS overhead is low |
+| **NameNode HA** | Active + Standby + JN (JournalNode) | Quorum-based HA |
+| **Use case** | Sequential read/write, large files, Spark YARN |
+
+**Model cluster — 1 PB usable:**
+
+- 10× DataNode (12× 18 TB HDD, 2× 1.9 TB NVMe)
+- 2× NameNode (HA, 256 GB RAM)
+- 3× JournalNode (small VMs)
+- Replication 3× → raw ~ 2.2 PB
+- Network: 25 GbE for data, 100 GbE for shuffle-heavy Spark
+
+### Object storage as Data Lake (S3/GCS/MinIO)
+
+For new projects (Spark on K8s, Iceberg/Delta, lakehouse), object storage is preferred over HDFS:
+
+| Platform | Advantages | Limits |
+|----------|-----------|--------|
+| **MinIO** (on-prem) | S3 API, erasure coding, NVMe direct, high throughput | Single tenant (per cluster) |
+| **Pure //C** (on-prem) | QLC NVMe, dedupe, S3 + NFS | Higher $/TB |
+| **AWS S3** (cloud) | Unlimited capacity, Iceberg/Delta support | Egress fees |
+| **Azure ADLS** (cloud) | Hierarchical namespace, HNS, POSIX-like ACLs | Vendor lock |
+| **GCP GCS** (cloud) | Uniform + fine-grained ACLs, object versioning | Region restrictions |
+
+### Comparison: HDFS vs Object Storage for Big Data
+
+| Criteria | HDFS | Object Storage (S3/MinIO) |
+|----------|------|-------------------------|
+| **Architecture** | Master/worker (NameNode SPOF) | Distributed, no SPOF (erasure coding) |
+| **Consistency** | Strong (single writer per file) | Eventual (S3) / Strong (MinIO) |
+| **Throughput** | High (rack-aware, locality) | High (network-bound) |
+| **Scaling** | Horizontal (DataNode) | Horizontal (stateless) |
+| **Cost** | Low (HDD) | Medium (S3 API) |
+| **Metadata** | NameNode (1M blocks ~ 1 GB RAM) | Object-level (flat namespace) |
+| **Spark integration** | Native (locality-optimized) | S3A connector, Hadoop Compatible |
+| **2026 trend** | Legacy, declining | Standard for new projects |
+
+For more information about Big Data see [BIG-DATA.en.md](BIG-DATA.en.md).
+
 ## Sources

-Links, books and standards: [sources/infrastructure/sources.md](sources/infrastructure/sources.md)
+Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)

 ### Recommended reading