Files

Stanislav Hubacek ef3c2f75b1 18.6.2026

2026-06-18 16:25:33 +02:00

11 KiB

Raw Permalink Blame History

🗄️ Big Data — ecosystem, architecture, tools

Overview

The Big Data ecosystem in 2026: "Hadoop is dead, and yet it's everywhere." HDFS has shrunk, MapReduce is effectively gone, the Cloudera/Hortonworks era is over. But YARN lives on, the Hive Metastore has changed clothes into Iceberg/Delta, and the lakehouse pattern (cheap object storage + table format + distributed engine) is the inheritance Hadoop left behind.

The modern Big Data stack has 8 layers:

Storage — HDFS, S3, GCS, ABFS, MinIO
Table format — Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon
Catalog — Hive Metastore, Unity Catalog, Polaris, Nessie, AWS Glue
Batch processing — Apache Spark, Trino-on-Spark, Dremio
Stream processing — Apache Flink, Spark Structured Streaming, Kafka Streams
Distributed SQL — Trino, Presto, StarRocks, ClickHouse
Transformation — dbt, SQLMesh
Orchestration — Apache Airflow 3.0, Dagster, Prefect, Kestra

Storage

HDFS (Hadoop Distributed File System)

Feature	Detail
Architecture	Master/worker: NameNode (metadata) + DataNode (data)
Replication	Default 3×, configurable (rack-aware)
Block size	Default 128 MB (range 64 MB – 256 MB)
Limits	NameNode memory ~ 1 GB / 1 million blocks; ~1000 DataNodes per cluster
Use case	On-prem clusters, sequential read/write, large files
Status 2026	Declining — most projects migrate to object storage (S3, GCS, MinIO)

HDFS remains relevant for on-prem environments where object storage is unavailable, or for specific use cases (YARN clusters, Spark shuffle). For new projects, object storage is recommended.

Object storage as Data Lake

Platform	Service	Use case
AWS	S3	Primary data lake, Iceberg/Delta on S3
Azure	ADLS Gen2 / Blob	Data lake for Azure ecosystem
GCP	GCS	Data lake for GCP (Dataproc, BigQuery)
On-prem	MinIO	S3-compatible object storage on own HW

HDFS capacity planning

Data size	Configuration
< 100 TB	3–5 DataNodes, 10 GbE, replication 3×
100 TB – 1 PB	5–20 DataNodes, 25/100 GbE, rack-aware, NameNode HA
1 PB+	20+ DataNodes, 100 GbE, Federation (multiple NameNodes)

Open Table Formats

Table formats bring ACID transactions, schema evolution, and time travel to data lake object storage.

Format	Organization	Engine compatibility	Streaming	Catalog
Apache Iceberg	Apache Foundation	Spark, Flink, Trino, Dremio, Athena, Snowflake	Flink sink, snapshot-based	REST catalog, Polaris, Glue, Hive
Delta Lake	Linux Foundation (Databricks)	Spark (native), Trino, Flink (limited), Athena	Spark Streaming, DLT	Unity Catalog (proprietary), Hive
Apache Hudi	Apache Foundation	Spark, Flink, Trino (connector)	Built-in CDC, incremental	Hive, Glue (limited)
Apache Paimon	Apache Foundation	Flink (native), Spark	LSM-tree, changelog mode	Hive, REST

Recommendation 2026:

Iceberg — broadest multi-engine support, vendor-neutral, open catalog (Polaris)
Delta Lake — best for Spark/Databricks ecosystem, UniForm for cross-format reads
Hudi — losing momentum, only if already in production
Paimon — emerging, Flink-native, LSM architecture

Processing Engines

Apache Spark

Dominant batch processing engine and unifying engine (batch + streaming + SQL + ML).

Feature	Detail
Version 2026	Spark 4.x (4.1.0), native Kubernetes support, Structured Streaming, Delta Lake integration
API	Scala, Java, Python (PySpark), SQL, R (SparkR)
Batch	DataFrame/Dataset, RDD, SQL queries — 10–100× faster than MapReduce
Streaming	Structured Streaming (micro-batch), latency ~100 ms – 5 s
SQL	Spark SQL, ANSI SQL, Hive compatible
ML	MLlib, SparkML, MLflow integration
Scheduler	YARN, Kubernetes (production-ready since Spark 3.x), standalone
Fault tolerance	RDD lineage, checkpointing

When to use Spark:

Batch ETL/ELT pipelines
Unified engine for batch + streaming (team preference)
Machine learning pipelines (MLlib, SparkML)
SQL analytics on large datasets

Apache Flink

Highest-performance engine for true streaming (per-event processing).

Feature	Detail
Version 2026	Flink 2.x (streaming-first, batch as bounded stream)
API	DataStream API, Table/SQL API, ProcessFunction (low-level)
Latency	< 100 ms (true streaming, Chandy-Lamport checkpointing)
State management	Managed state (ValueState, ListState, MapState), RocksDB backend
Event time	Native, watermarks, out-of-order handling
Batch	Batch as bounded stream (same runtime)
Deployment	YARN, Kubernetes, standalone
Economics	Higher memory requirements (managed state), requires careful tuning

When to use Flink:

Fraud detection, real-time bidding, IoT (< 100 ms latency)
Complex stateful stream processing
CDC pipelines
Event-driven architectures

Trino (ex PrestoSQL)

Distributed SQL query engine — federated queries across various sources.

Feature	Detail
Architecture	Coordinator + Worker (no storage, no scheduler)
Connectors	Iceberg, Delta, Hive, HDFS, S3, GCS, ADLS, PostgreSQL, MySQL, Kafka, Elasticsearch
Use case	Interactive SQL, federated queries, lakehouse queries
Version 2026	Trino 470+, Iceberg native, Delta Lake connector

Spark vs Flink vs Trino comparison

Criteria	Spark	Flink	Trino
Primary use case	Batch + unifying	True streaming	Interactive SQL
Streaming latency	100 ms – 5 s (micro-batch)	< 100 ms (true streaming)	N/A
Throughput	High (batch-optimized)	High (pipeline-optimized)	Medium (ad-hoc)
State management	State store (external)	Managed state (embedded)	N/A
SQL support	Spark SQL	Flink SQL	ANSI SQL (broadest)
ML/AI	MLlib, SparkML	—	—
Kubernetes	Native (production)	Native (production)	Native (production)
Learning curve	Medium	High	Low
Operational complexity	Medium	High	Medium

Orchestration

Tool	Version 2026	Use case
Apache Airflow	3.0+ (taskflow API, dynamic tasks, deferrable operators)	Universal orchestration, largest ecosystem
Dagster	1.x (asset-oriented, software-defined assets)	Data pipelines, observability, asset lineage
Prefect	3.x (native async, workers, blocks)	Python-native, serverless workers
Kestra	1.x (YAML-native, declarative)	Event-driven orchestration
Apache NiFi	2.x (flow-based, visual)	Data ingestion, CDC, streaming

Lakehouse architecture

Lakehouse combines data lake flexibility (object storage) with data warehouse performance and governance.

┌──────────────────────────────────────────────────────┐
│                    Query Engines                      │
│   Trino    Spark SQL    Flink SQL    Dremio    Athena  │
└─────────────────────────┬────────────────────────────┘
                          │
┌─────────────────────────▼────────────────────────────┐
│                  Table Format Layer                    │
│         Apache Iceberg / Delta Lake / Hudi            │
│       (ACID, time travel, schema evolution)           │
└─────────────────────────┬────────────────────────────┘
                          │
┌─────────────────────────▼────────────────────────────┐
│                    Storage Layer                       │
│           S3 / GCS / ADLS / MinIO / HDFS              │
│               (Parquet / ORC / Avro)                  │
└──────────────────────────────────────────────────────┘

For Iceberg details see DATABASES.en.md — Apache Iceberg Lakehouse.

Big Data Infrastructure

Cluster sizing

Component	Spark (batch)	Flink (streaming)	Trino (SQL)
CPU	16–64 cores/node	16–32 cores/node	8–32 cores/node
RAM	64–256 GB/node	64–256 GB/node (incl. managed state)	64–256 GB/node
Storage	HDFS / object storage	Object storage (checkpoints)	None (stateless)
Network	25–100 GbE (shuffle-heavy)	25–100 GbE (checkpointing)	25–100 GbE
Disk	NVMe (scratch, shuffle)	NVMe (RocksDB state backend)	—
Cluster size	5–200+ nodes	3–100+ nodes	5–50 nodes

Network considerations

Spark shuffle — heavy network traffic between nodes; recommend 25–100 GbE, ideally no oversubscription
Flink checkpointing — periodic state writes to object storage; requires stable latency
HDFS rack awareness — optimizes replication across racks
Data locality — HDFS: local disk reads; object storage: network-bound

Kubernetes vs YARN

Criteria	YARN	Kubernetes
Resource isolation	Cgroups (YARN containers)	Cgroups + namespaces (pods)
Ecosystem fit	Hadoop-native (HDFS, Hive, Spark)	Cloud-native, Spark, Flink, Trino
Operational complexity	Lower (single cluster manager)	Higher (requires K8s cluster)
Multi-tenant isolation	YARN queues (Capacity/Fair Scheduler)	Namespaces, ResourceQuotas, LimitRanges
Stateful workloads	Limited	StatefulSets, PVC, Operators
2026 trend	Legacy (declining)	Standard for new projects

Cloud deployment

Cloud	Batch processing	Streaming	SQL	Managed K8s
AWS	EMR (Spark, Hive, Flink)	Kinesis, MSK (Kafka), EMR Flink	Athena (Trino), Redshift	EKS
Azure	HDInsight (Spark, Hive), Synapse	Event Hubs, HDInsight Flink	Synapse SQL, Azure Data Explorer	AKS
GCP	Dataproc (Spark, Flink, Hive, Trino)	Pub/Sub, Dataflow (Beam), Dataproc Flink	BigQuery	GKE

Sources

Links, books and standards: sources/infrastructure/sources.en.md

Last revision: 2026-06-18

11 KiB Raw Permalink Blame History Unescape Escape