Files
knowledge-base/BIG-DATA.en.md
Stanislav Hubacek ef3c2f75b1 18.6.2026
2026-06-18 16:25:33 +02:00

233 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🗄️ Big Data — ecosystem, architecture, tools
## Overview
The Big Data ecosystem in 2026: "Hadoop is dead, and yet it's everywhere." HDFS has shrunk, MapReduce is effectively gone, the Cloudera/Hortonworks era is over. But YARN lives on, the Hive Metastore has changed clothes into Iceberg/Delta, and the lakehouse pattern (cheap object storage + table format + distributed engine) is the inheritance Hadoop left behind.
The modern Big Data stack has 8 layers:
1. **Storage** — HDFS, S3, GCS, ABFS, MinIO
2. **Table format** — Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon
3. **Catalog** — Hive Metastore, Unity Catalog, Polaris, Nessie, AWS Glue
4. **Batch processing** — Apache Spark, Trino-on-Spark, Dremio
5. **Stream processing** — Apache Flink, Spark Structured Streaming, Kafka Streams
6. **Distributed SQL** — Trino, Presto, StarRocks, ClickHouse
7. **Transformation** — dbt, SQLMesh
8. **Orchestration** — Apache Airflow 3.0, Dagster, Prefect, Kestra
---
## Storage
### HDFS (Hadoop Distributed File System)
| Feature | Detail |
|---------|--------|
| **Architecture** | Master/worker: NameNode (metadata) + DataNode (data) |
| **Replication** | Default 3×, configurable (rack-aware) |
| **Block size** | Default 128 MB (range 64 MB 256 MB) |
| **Limits** | NameNode memory ~ 1 GB / 1 million blocks; ~1000 DataNodes per cluster |
| **Use case** | On-prem clusters, sequential read/write, large files |
| **Status 2026** | Declining — most projects migrate to object storage (S3, GCS, MinIO) |
HDFS remains relevant for on-prem environments where object storage is unavailable, or for specific use cases (YARN clusters, Spark shuffle). For new projects, object storage is recommended.
### Object storage as Data Lake
| Platform | Service | Use case |
|----------|--------|----------|
| **AWS** | S3 | Primary data lake, Iceberg/Delta on S3 |
| **Azure** | ADLS Gen2 / Blob | Data lake for Azure ecosystem |
| **GCP** | GCS | Data lake for GCP (Dataproc, BigQuery) |
| **On-prem** | MinIO | S3-compatible object storage on own HW |
### HDFS capacity planning
| Data size | Configuration |
|-----------|-------------|
| **< 100 TB** | 35 DataNodes, 10 GbE, replication 3× |
| **100 TB 1 PB** | 520 DataNodes, 25/100 GbE, rack-aware, NameNode HA |
| **1 PB+** | 20+ DataNodes, 100 GbE, Federation (multiple NameNodes) |
---
## Open Table Formats
Table formats bring ACID transactions, schema evolution, and time travel to data lake object storage.
| Format | Organization | Engine compatibility | Streaming | Catalog |
|--------|-------------|---------------------|-----------|---------|
| **Apache Iceberg** | Apache Foundation | Spark, Flink, Trino, Dremio, Athena, Snowflake | Flink sink, snapshot-based | REST catalog, Polaris, Glue, Hive |
| **Delta Lake** | Linux Foundation (Databricks) | Spark (native), Trino, Flink (limited), Athena | Spark Streaming, DLT | Unity Catalog (proprietary), Hive |
| **Apache Hudi** | Apache Foundation | Spark, Flink, Trino (connector) | Built-in CDC, incremental | Hive, Glue (limited) |
| **Apache Paimon** | Apache Foundation | Flink (native), Spark | LSM-tree, changelog mode | Hive, REST |
**Recommendation 2026:**
- **Iceberg** — broadest multi-engine support, vendor-neutral, open catalog (Polaris)
- **Delta Lake** — best for Spark/Databricks ecosystem, UniForm for cross-format reads
- **Hudi** — losing momentum, only if already in production
- **Paimon** — emerging, Flink-native, LSM architecture
---
## Processing Engines
### Apache Spark
Dominant batch processing engine and unifying engine (batch + streaming + SQL + ML).
| Feature | Detail |
|---------|--------|
| **Version 2026** | Spark 4.x (4.1.0), native Kubernetes support, Structured Streaming, Delta Lake integration |
| **API** | Scala, Java, Python (PySpark), SQL, R (SparkR) |
| **Batch** | DataFrame/Dataset, RDD, SQL queries — 10100× faster than MapReduce |
| **Streaming** | Structured Streaming (micro-batch), latency ~100 ms 5 s |
| **SQL** | Spark SQL, ANSI SQL, Hive compatible |
| **ML** | MLlib, SparkML, MLflow integration |
| **Scheduler** | YARN, Kubernetes (production-ready since Spark 3.x), standalone |
| **Fault tolerance** | RDD lineage, checkpointing |
**When to use Spark:**
- Batch ETL/ELT pipelines
- Unified engine for batch + streaming (team preference)
- Machine learning pipelines (MLlib, SparkML)
- SQL analytics on large datasets
### Apache Flink
Highest-performance engine for true streaming (per-event processing).
| Feature | Detail |
|---------|--------|
| **Version 2026** | Flink 2.x (streaming-first, batch as bounded stream) |
| **API** | DataStream API, Table/SQL API, ProcessFunction (low-level) |
| **Latency** | < 100 ms (true streaming, Chandy-Lamport checkpointing) |
| **State management** | Managed state (ValueState, ListState, MapState), RocksDB backend |
| **Event time** | Native, watermarks, out-of-order handling |
| **Batch** | Batch as bounded stream (same runtime) |
| **Deployment** | YARN, Kubernetes, standalone |
| **Economics** | Higher memory requirements (managed state), requires careful tuning |
**When to use Flink:**
- Fraud detection, real-time bidding, IoT (< 100 ms latency)
- Complex stateful stream processing
- CDC pipelines
- Event-driven architectures
### Trino (ex PrestoSQL)
Distributed SQL query engine — federated queries across various sources.
| Feature | Detail |
|---------|--------|
| **Architecture** | Coordinator + Worker (no storage, no scheduler) |
| **Connectors** | Iceberg, Delta, Hive, HDFS, S3, GCS, ADLS, PostgreSQL, MySQL, Kafka, Elasticsearch |
| **Use case** | Interactive SQL, federated queries, lakehouse queries |
| **Version 2026** | Trino 470+, Iceberg native, Delta Lake connector |
---
## Spark vs Flink vs Trino comparison
| Criteria | Spark | Flink | Trino |
|----------|-------|-------|-------|
| **Primary use case** | Batch + unifying | True streaming | Interactive SQL |
| **Streaming latency** | 100 ms 5 s (micro-batch) | < 100 ms (true streaming) | N/A |
| **Throughput** | High (batch-optimized) | High (pipeline-optimized) | Medium (ad-hoc) |
| **State management** | State store (external) | Managed state (embedded) | N/A |
| **SQL support** | Spark SQL | Flink SQL | ANSI SQL (broadest) |
| **ML/AI** | MLlib, SparkML | — | — |
| **Kubernetes** | Native (production) | Native (production) | Native (production) |
| **Learning curve** | Medium | High | Low |
| **Operational complexity** | Medium | High | Medium |
---
## Orchestration
| Tool | Version 2026 | Use case |
|------|-------------|----------|
| **Apache Airflow** | 3.0+ (taskflow API, dynamic tasks, deferrable operators) | Universal orchestration, largest ecosystem |
| **Dagster** | 1.x (asset-oriented, software-defined assets) | Data pipelines, observability, asset lineage |
| **Prefect** | 3.x (native async, workers, blocks) | Python-native, serverless workers |
| **Kestra** | 1.x (YAML-native, declarative) | Event-driven orchestration |
| **Apache NiFi** | 2.x (flow-based, visual) | Data ingestion, CDC, streaming |
---
## Lakehouse architecture
Lakehouse combines data lake flexibility (object storage) with data warehouse performance and governance.
```
┌──────────────────────────────────────────────────────┐
│ Query Engines │
│ Trino Spark SQL Flink SQL Dremio Athena │
└─────────────────────────┬────────────────────────────┘
┌─────────────────────────▼────────────────────────────┐
│ Table Format Layer │
│ Apache Iceberg / Delta Lake / Hudi │
│ (ACID, time travel, schema evolution) │
└─────────────────────────┬────────────────────────────┘
┌─────────────────────────▼────────────────────────────┐
│ Storage Layer │
│ S3 / GCS / ADLS / MinIO / HDFS │
│ (Parquet / ORC / Avro) │
└──────────────────────────────────────────────────────┘
```
For Iceberg details see [DATABASES.en.md — Apache Iceberg Lakehouse](DATABASES.en.md#apache-iceberg-lakehouse).
---
## Big Data Infrastructure
### Cluster sizing
| Component | Spark (batch) | Flink (streaming) | Trino (SQL) |
|-----------|--------------|-------------------|-------------|
| **CPU** | 1664 cores/node | 1632 cores/node | 832 cores/node |
| **RAM** | 64256 GB/node | 64256 GB/node (incl. managed state) | 64256 GB/node |
| **Storage** | HDFS / object storage | Object storage (checkpoints) | None (stateless) |
| **Network** | 25100 GbE (shuffle-heavy) | 25100 GbE (checkpointing) | 25100 GbE |
| **Disk** | NVMe (scratch, shuffle) | NVMe (RocksDB state backend) | — |
| **Cluster size** | 5200+ nodes | 3100+ nodes | 550 nodes |
### Network considerations
- **Spark shuffle** — heavy network traffic between nodes; recommend 25100 GbE, ideally no oversubscription
- **Flink checkpointing** — periodic state writes to object storage; requires stable latency
- **HDFS rack awareness** — optimizes replication across racks
- **Data locality** — HDFS: local disk reads; object storage: network-bound
### Kubernetes vs YARN
| Criteria | YARN | Kubernetes |
|----------|------|-----------|
| **Resource isolation** | Cgroups (YARN containers) | Cgroups + namespaces (pods) |
| **Ecosystem fit** | Hadoop-native (HDFS, Hive, Spark) | Cloud-native, Spark, Flink, Trino |
| **Operational complexity** | Lower (single cluster manager) | Higher (requires K8s cluster) |
| **Multi-tenant isolation** | YARN queues (Capacity/Fair Scheduler) | Namespaces, ResourceQuotas, LimitRanges |
| **Stateful workloads** | Limited | StatefulSets, PVC, Operators |
| **2026 trend** | Legacy (declining) | Standard for new projects |
---
## Cloud deployment
| Cloud | Batch processing | Streaming | SQL | Managed K8s |
|-------|-----------------|-----------|-----|-------------|
| **AWS** | EMR (Spark, Hive, Flink) | Kinesis, MSK (Kafka), EMR Flink | Athena (Trino), Redshift | EKS |
| **Azure** | HDInsight (Spark, Hive), Synapse | Event Hubs, HDInsight Flink | Synapse SQL, Azure Data Explorer | AKS |
| **GCP** | Dataproc (Spark, Flink, Hive, Trino) | Pub/Sub, Dataflow (Beam), Dataproc Flink | BigQuery | GKE |
---
## Sources
Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
*Last revision: 2026-06-18*