Files
knowledge-base/BIG-DATA.en.md
Stanislav Hubacek ef3c2f75b1 18.6.2026
2026-06-18 16:25:33 +02:00

11 KiB
Raw Permalink Blame History

🗄️ Big Data — ecosystem, architecture, tools

Overview

The Big Data ecosystem in 2026: "Hadoop is dead, and yet it's everywhere." HDFS has shrunk, MapReduce is effectively gone, the Cloudera/Hortonworks era is over. But YARN lives on, the Hive Metastore has changed clothes into Iceberg/Delta, and the lakehouse pattern (cheap object storage + table format + distributed engine) is the inheritance Hadoop left behind.

The modern Big Data stack has 8 layers:

  1. Storage — HDFS, S3, GCS, ABFS, MinIO
  2. Table format — Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon
  3. Catalog — Hive Metastore, Unity Catalog, Polaris, Nessie, AWS Glue
  4. Batch processing — Apache Spark, Trino-on-Spark, Dremio
  5. Stream processing — Apache Flink, Spark Structured Streaming, Kafka Streams
  6. Distributed SQL — Trino, Presto, StarRocks, ClickHouse
  7. Transformation — dbt, SQLMesh
  8. Orchestration — Apache Airflow 3.0, Dagster, Prefect, Kestra

Storage

HDFS (Hadoop Distributed File System)

Feature Detail
Architecture Master/worker: NameNode (metadata) + DataNode (data)
Replication Default 3×, configurable (rack-aware)
Block size Default 128 MB (range 64 MB 256 MB)
Limits NameNode memory ~ 1 GB / 1 million blocks; ~1000 DataNodes per cluster
Use case On-prem clusters, sequential read/write, large files
Status 2026 Declining — most projects migrate to object storage (S3, GCS, MinIO)

HDFS remains relevant for on-prem environments where object storage is unavailable, or for specific use cases (YARN clusters, Spark shuffle). For new projects, object storage is recommended.

Object storage as Data Lake

Platform Service Use case
AWS S3 Primary data lake, Iceberg/Delta on S3
Azure ADLS Gen2 / Blob Data lake for Azure ecosystem
GCP GCS Data lake for GCP (Dataproc, BigQuery)
On-prem MinIO S3-compatible object storage on own HW

HDFS capacity planning

Data size Configuration
< 100 TB 35 DataNodes, 10 GbE, replication 3×
100 TB 1 PB 520 DataNodes, 25/100 GbE, rack-aware, NameNode HA
1 PB+ 20+ DataNodes, 100 GbE, Federation (multiple NameNodes)

Open Table Formats

Table formats bring ACID transactions, schema evolution, and time travel to data lake object storage.

Format Organization Engine compatibility Streaming Catalog
Apache Iceberg Apache Foundation Spark, Flink, Trino, Dremio, Athena, Snowflake Flink sink, snapshot-based REST catalog, Polaris, Glue, Hive
Delta Lake Linux Foundation (Databricks) Spark (native), Trino, Flink (limited), Athena Spark Streaming, DLT Unity Catalog (proprietary), Hive
Apache Hudi Apache Foundation Spark, Flink, Trino (connector) Built-in CDC, incremental Hive, Glue (limited)
Apache Paimon Apache Foundation Flink (native), Spark LSM-tree, changelog mode Hive, REST

Recommendation 2026:

  • Iceberg — broadest multi-engine support, vendor-neutral, open catalog (Polaris)
  • Delta Lake — best for Spark/Databricks ecosystem, UniForm for cross-format reads
  • Hudi — losing momentum, only if already in production
  • Paimon — emerging, Flink-native, LSM architecture

Processing Engines

Apache Spark

Dominant batch processing engine and unifying engine (batch + streaming + SQL + ML).

Feature Detail
Version 2026 Spark 4.x (4.1.0), native Kubernetes support, Structured Streaming, Delta Lake integration
API Scala, Java, Python (PySpark), SQL, R (SparkR)
Batch DataFrame/Dataset, RDD, SQL queries — 10100× faster than MapReduce
Streaming Structured Streaming (micro-batch), latency ~100 ms 5 s
SQL Spark SQL, ANSI SQL, Hive compatible
ML MLlib, SparkML, MLflow integration
Scheduler YARN, Kubernetes (production-ready since Spark 3.x), standalone
Fault tolerance RDD lineage, checkpointing

When to use Spark:

  • Batch ETL/ELT pipelines
  • Unified engine for batch + streaming (team preference)
  • Machine learning pipelines (MLlib, SparkML)
  • SQL analytics on large datasets

Highest-performance engine for true streaming (per-event processing).

Feature Detail
Version 2026 Flink 2.x (streaming-first, batch as bounded stream)
API DataStream API, Table/SQL API, ProcessFunction (low-level)
Latency < 100 ms (true streaming, Chandy-Lamport checkpointing)
State management Managed state (ValueState, ListState, MapState), RocksDB backend
Event time Native, watermarks, out-of-order handling
Batch Batch as bounded stream (same runtime)
Deployment YARN, Kubernetes, standalone
Economics Higher memory requirements (managed state), requires careful tuning

When to use Flink:

  • Fraud detection, real-time bidding, IoT (< 100 ms latency)
  • Complex stateful stream processing
  • CDC pipelines
  • Event-driven architectures

Trino (ex PrestoSQL)

Distributed SQL query engine — federated queries across various sources.

Feature Detail
Architecture Coordinator + Worker (no storage, no scheduler)
Connectors Iceberg, Delta, Hive, HDFS, S3, GCS, ADLS, PostgreSQL, MySQL, Kafka, Elasticsearch
Use case Interactive SQL, federated queries, lakehouse queries
Version 2026 Trino 470+, Iceberg native, Delta Lake connector

Criteria Spark Flink Trino
Primary use case Batch + unifying True streaming Interactive SQL
Streaming latency 100 ms 5 s (micro-batch) < 100 ms (true streaming) N/A
Throughput High (batch-optimized) High (pipeline-optimized) Medium (ad-hoc)
State management State store (external) Managed state (embedded) N/A
SQL support Spark SQL Flink SQL ANSI SQL (broadest)
ML/AI MLlib, SparkML
Kubernetes Native (production) Native (production) Native (production)
Learning curve Medium High Low
Operational complexity Medium High Medium

Orchestration

Tool Version 2026 Use case
Apache Airflow 3.0+ (taskflow API, dynamic tasks, deferrable operators) Universal orchestration, largest ecosystem
Dagster 1.x (asset-oriented, software-defined assets) Data pipelines, observability, asset lineage
Prefect 3.x (native async, workers, blocks) Python-native, serverless workers
Kestra 1.x (YAML-native, declarative) Event-driven orchestration
Apache NiFi 2.x (flow-based, visual) Data ingestion, CDC, streaming

Lakehouse architecture

Lakehouse combines data lake flexibility (object storage) with data warehouse performance and governance.

┌──────────────────────────────────────────────────────┐
│                    Query Engines                      │
│   Trino    Spark SQL    Flink SQL    Dremio    Athena  │
└─────────────────────────┬────────────────────────────┘
                          │
┌─────────────────────────▼────────────────────────────┐
│                  Table Format Layer                    │
│         Apache Iceberg / Delta Lake / Hudi            │
│       (ACID, time travel, schema evolution)           │
└─────────────────────────┬────────────────────────────┘
                          │
┌─────────────────────────▼────────────────────────────┐
│                    Storage Layer                       │
│           S3 / GCS / ADLS / MinIO / HDFS              │
│               (Parquet / ORC / Avro)                  │
└──────────────────────────────────────────────────────┘

For Iceberg details see DATABASES.en.md — Apache Iceberg Lakehouse.


Big Data Infrastructure

Cluster sizing

Component Spark (batch) Flink (streaming) Trino (SQL)
CPU 1664 cores/node 1632 cores/node 832 cores/node
RAM 64256 GB/node 64256 GB/node (incl. managed state) 64256 GB/node
Storage HDFS / object storage Object storage (checkpoints) None (stateless)
Network 25100 GbE (shuffle-heavy) 25100 GbE (checkpointing) 25100 GbE
Disk NVMe (scratch, shuffle) NVMe (RocksDB state backend)
Cluster size 5200+ nodes 3100+ nodes 550 nodes

Network considerations

  • Spark shuffle — heavy network traffic between nodes; recommend 25100 GbE, ideally no oversubscription
  • Flink checkpointing — periodic state writes to object storage; requires stable latency
  • HDFS rack awareness — optimizes replication across racks
  • Data locality — HDFS: local disk reads; object storage: network-bound

Kubernetes vs YARN

Criteria YARN Kubernetes
Resource isolation Cgroups (YARN containers) Cgroups + namespaces (pods)
Ecosystem fit Hadoop-native (HDFS, Hive, Spark) Cloud-native, Spark, Flink, Trino
Operational complexity Lower (single cluster manager) Higher (requires K8s cluster)
Multi-tenant isolation YARN queues (Capacity/Fair Scheduler) Namespaces, ResourceQuotas, LimitRanges
Stateful workloads Limited StatefulSets, PVC, Operators
2026 trend Legacy (declining) Standard for new projects

Cloud deployment

Cloud Batch processing Streaming SQL Managed K8s
AWS EMR (Spark, Hive, Flink) Kinesis, MSK (Kafka), EMR Flink Athena (Trino), Redshift EKS
Azure HDInsight (Spark, Hive), Synapse Event Hubs, HDInsight Flink Synapse SQL, Azure Data Explorer AKS
GCP Dataproc (Spark, Flink, Hive, Trino) Pub/Sub, Dataflow (Beam), Dataproc Flink BigQuery GKE

Sources

Links, books and standards: sources/infrastructure/sources.en.md

Last revision: 2026-06-18