18.6.2026
This commit is contained in:
232
BIG-DATA.en.md
Normal file
232
BIG-DATA.en.md
Normal file
@@ -0,0 +1,232 @@
|
||||
# 🗄️ Big Data — ecosystem, architecture, tools
|
||||
|
||||
## Overview
|
||||
|
||||
The Big Data ecosystem in 2026: "Hadoop is dead, and yet it's everywhere." HDFS has shrunk, MapReduce is effectively gone, the Cloudera/Hortonworks era is over. But YARN lives on, the Hive Metastore has changed clothes into Iceberg/Delta, and the lakehouse pattern (cheap object storage + table format + distributed engine) is the inheritance Hadoop left behind.
|
||||
|
||||
The modern Big Data stack has 8 layers:
|
||||
|
||||
1. **Storage** — HDFS, S3, GCS, ABFS, MinIO
|
||||
2. **Table format** — Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon
|
||||
3. **Catalog** — Hive Metastore, Unity Catalog, Polaris, Nessie, AWS Glue
|
||||
4. **Batch processing** — Apache Spark, Trino-on-Spark, Dremio
|
||||
5. **Stream processing** — Apache Flink, Spark Structured Streaming, Kafka Streams
|
||||
6. **Distributed SQL** — Trino, Presto, StarRocks, ClickHouse
|
||||
7. **Transformation** — dbt, SQLMesh
|
||||
8. **Orchestration** — Apache Airflow 3.0, Dagster, Prefect, Kestra
|
||||
|
||||
---
|
||||
|
||||
## Storage
|
||||
|
||||
### HDFS (Hadoop Distributed File System)
|
||||
|
||||
| Feature | Detail |
|
||||
|---------|--------|
|
||||
| **Architecture** | Master/worker: NameNode (metadata) + DataNode (data) |
|
||||
| **Replication** | Default 3×, configurable (rack-aware) |
|
||||
| **Block size** | Default 128 MB (range 64 MB – 256 MB) |
|
||||
| **Limits** | NameNode memory ~ 1 GB / 1 million blocks; ~1000 DataNodes per cluster |
|
||||
| **Use case** | On-prem clusters, sequential read/write, large files |
|
||||
| **Status 2026** | Declining — most projects migrate to object storage (S3, GCS, MinIO) |
|
||||
|
||||
HDFS remains relevant for on-prem environments where object storage is unavailable, or for specific use cases (YARN clusters, Spark shuffle). For new projects, object storage is recommended.
|
||||
|
||||
### Object storage as Data Lake
|
||||
|
||||
| Platform | Service | Use case |
|
||||
|----------|--------|----------|
|
||||
| **AWS** | S3 | Primary data lake, Iceberg/Delta on S3 |
|
||||
| **Azure** | ADLS Gen2 / Blob | Data lake for Azure ecosystem |
|
||||
| **GCP** | GCS | Data lake for GCP (Dataproc, BigQuery) |
|
||||
| **On-prem** | MinIO | S3-compatible object storage on own HW |
|
||||
|
||||
### HDFS capacity planning
|
||||
|
||||
| Data size | Configuration |
|
||||
|-----------|-------------|
|
||||
| **< 100 TB** | 3–5 DataNodes, 10 GbE, replication 3× |
|
||||
| **100 TB – 1 PB** | 5–20 DataNodes, 25/100 GbE, rack-aware, NameNode HA |
|
||||
| **1 PB+** | 20+ DataNodes, 100 GbE, Federation (multiple NameNodes) |
|
||||
|
||||
---
|
||||
|
||||
## Open Table Formats
|
||||
|
||||
Table formats bring ACID transactions, schema evolution, and time travel to data lake object storage.
|
||||
|
||||
| Format | Organization | Engine compatibility | Streaming | Catalog |
|
||||
|--------|-------------|---------------------|-----------|---------|
|
||||
| **Apache Iceberg** | Apache Foundation | Spark, Flink, Trino, Dremio, Athena, Snowflake | Flink sink, snapshot-based | REST catalog, Polaris, Glue, Hive |
|
||||
| **Delta Lake** | Linux Foundation (Databricks) | Spark (native), Trino, Flink (limited), Athena | Spark Streaming, DLT | Unity Catalog (proprietary), Hive |
|
||||
| **Apache Hudi** | Apache Foundation | Spark, Flink, Trino (connector) | Built-in CDC, incremental | Hive, Glue (limited) |
|
||||
| **Apache Paimon** | Apache Foundation | Flink (native), Spark | LSM-tree, changelog mode | Hive, REST |
|
||||
|
||||
**Recommendation 2026:**
|
||||
- **Iceberg** — broadest multi-engine support, vendor-neutral, open catalog (Polaris)
|
||||
- **Delta Lake** — best for Spark/Databricks ecosystem, UniForm for cross-format reads
|
||||
- **Hudi** — losing momentum, only if already in production
|
||||
- **Paimon** — emerging, Flink-native, LSM architecture
|
||||
|
||||
---
|
||||
|
||||
## Processing Engines
|
||||
|
||||
### Apache Spark
|
||||
|
||||
Dominant batch processing engine and unifying engine (batch + streaming + SQL + ML).
|
||||
|
||||
| Feature | Detail |
|
||||
|---------|--------|
|
||||
| **Version 2026** | Spark 4.x (4.1.0), native Kubernetes support, Structured Streaming, Delta Lake integration |
|
||||
| **API** | Scala, Java, Python (PySpark), SQL, R (SparkR) |
|
||||
| **Batch** | DataFrame/Dataset, RDD, SQL queries — 10–100× faster than MapReduce |
|
||||
| **Streaming** | Structured Streaming (micro-batch), latency ~100 ms – 5 s |
|
||||
| **SQL** | Spark SQL, ANSI SQL, Hive compatible |
|
||||
| **ML** | MLlib, SparkML, MLflow integration |
|
||||
| **Scheduler** | YARN, Kubernetes (production-ready since Spark 3.x), standalone |
|
||||
| **Fault tolerance** | RDD lineage, checkpointing |
|
||||
|
||||
**When to use Spark:**
|
||||
- Batch ETL/ELT pipelines
|
||||
- Unified engine for batch + streaming (team preference)
|
||||
- Machine learning pipelines (MLlib, SparkML)
|
||||
- SQL analytics on large datasets
|
||||
|
||||
### Apache Flink
|
||||
|
||||
Highest-performance engine for true streaming (per-event processing).
|
||||
|
||||
| Feature | Detail |
|
||||
|---------|--------|
|
||||
| **Version 2026** | Flink 2.x (streaming-first, batch as bounded stream) |
|
||||
| **API** | DataStream API, Table/SQL API, ProcessFunction (low-level) |
|
||||
| **Latency** | < 100 ms (true streaming, Chandy-Lamport checkpointing) |
|
||||
| **State management** | Managed state (ValueState, ListState, MapState), RocksDB backend |
|
||||
| **Event time** | Native, watermarks, out-of-order handling |
|
||||
| **Batch** | Batch as bounded stream (same runtime) |
|
||||
| **Deployment** | YARN, Kubernetes, standalone |
|
||||
| **Economics** | Higher memory requirements (managed state), requires careful tuning |
|
||||
|
||||
**When to use Flink:**
|
||||
- Fraud detection, real-time bidding, IoT (< 100 ms latency)
|
||||
- Complex stateful stream processing
|
||||
- CDC pipelines
|
||||
- Event-driven architectures
|
||||
|
||||
### Trino (ex PrestoSQL)
|
||||
|
||||
Distributed SQL query engine — federated queries across various sources.
|
||||
|
||||
| Feature | Detail |
|
||||
|---------|--------|
|
||||
| **Architecture** | Coordinator + Worker (no storage, no scheduler) |
|
||||
| **Connectors** | Iceberg, Delta, Hive, HDFS, S3, GCS, ADLS, PostgreSQL, MySQL, Kafka, Elasticsearch |
|
||||
| **Use case** | Interactive SQL, federated queries, lakehouse queries |
|
||||
| **Version 2026** | Trino 470+, Iceberg native, Delta Lake connector |
|
||||
|
||||
---
|
||||
|
||||
## Spark vs Flink vs Trino comparison
|
||||
|
||||
| Criteria | Spark | Flink | Trino |
|
||||
|----------|-------|-------|-------|
|
||||
| **Primary use case** | Batch + unifying | True streaming | Interactive SQL |
|
||||
| **Streaming latency** | 100 ms – 5 s (micro-batch) | < 100 ms (true streaming) | N/A |
|
||||
| **Throughput** | High (batch-optimized) | High (pipeline-optimized) | Medium (ad-hoc) |
|
||||
| **State management** | State store (external) | Managed state (embedded) | N/A |
|
||||
| **SQL support** | Spark SQL | Flink SQL | ANSI SQL (broadest) |
|
||||
| **ML/AI** | MLlib, SparkML | — | — |
|
||||
| **Kubernetes** | Native (production) | Native (production) | Native (production) |
|
||||
| **Learning curve** | Medium | High | Low |
|
||||
| **Operational complexity** | Medium | High | Medium |
|
||||
|
||||
---
|
||||
|
||||
## Orchestration
|
||||
|
||||
| Tool | Version 2026 | Use case |
|
||||
|------|-------------|----------|
|
||||
| **Apache Airflow** | 3.0+ (taskflow API, dynamic tasks, deferrable operators) | Universal orchestration, largest ecosystem |
|
||||
| **Dagster** | 1.x (asset-oriented, software-defined assets) | Data pipelines, observability, asset lineage |
|
||||
| **Prefect** | 3.x (native async, workers, blocks) | Python-native, serverless workers |
|
||||
| **Kestra** | 1.x (YAML-native, declarative) | Event-driven orchestration |
|
||||
| **Apache NiFi** | 2.x (flow-based, visual) | Data ingestion, CDC, streaming |
|
||||
|
||||
---
|
||||
|
||||
## Lakehouse architecture
|
||||
|
||||
Lakehouse combines data lake flexibility (object storage) with data warehouse performance and governance.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────┐
|
||||
│ Query Engines │
|
||||
│ Trino Spark SQL Flink SQL Dremio Athena │
|
||||
└─────────────────────────┬────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────▼────────────────────────────┐
|
||||
│ Table Format Layer │
|
||||
│ Apache Iceberg / Delta Lake / Hudi │
|
||||
│ (ACID, time travel, schema evolution) │
|
||||
└─────────────────────────┬────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────▼────────────────────────────┐
|
||||
│ Storage Layer │
|
||||
│ S3 / GCS / ADLS / MinIO / HDFS │
|
||||
│ (Parquet / ORC / Avro) │
|
||||
└──────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
For Iceberg details see [DATABASES.en.md — Apache Iceberg Lakehouse](DATABASES.en.md#apache-iceberg-lakehouse).
|
||||
|
||||
---
|
||||
|
||||
## Big Data Infrastructure
|
||||
|
||||
### Cluster sizing
|
||||
|
||||
| Component | Spark (batch) | Flink (streaming) | Trino (SQL) |
|
||||
|-----------|--------------|-------------------|-------------|
|
||||
| **CPU** | 16–64 cores/node | 16–32 cores/node | 8–32 cores/node |
|
||||
| **RAM** | 64–256 GB/node | 64–256 GB/node (incl. managed state) | 64–256 GB/node |
|
||||
| **Storage** | HDFS / object storage | Object storage (checkpoints) | None (stateless) |
|
||||
| **Network** | 25–100 GbE (shuffle-heavy) | 25–100 GbE (checkpointing) | 25–100 GbE |
|
||||
| **Disk** | NVMe (scratch, shuffle) | NVMe (RocksDB state backend) | — |
|
||||
| **Cluster size** | 5–200+ nodes | 3–100+ nodes | 5–50 nodes |
|
||||
|
||||
### Network considerations
|
||||
|
||||
- **Spark shuffle** — heavy network traffic between nodes; recommend 25–100 GbE, ideally no oversubscription
|
||||
- **Flink checkpointing** — periodic state writes to object storage; requires stable latency
|
||||
- **HDFS rack awareness** — optimizes replication across racks
|
||||
- **Data locality** — HDFS: local disk reads; object storage: network-bound
|
||||
|
||||
### Kubernetes vs YARN
|
||||
|
||||
| Criteria | YARN | Kubernetes |
|
||||
|----------|------|-----------|
|
||||
| **Resource isolation** | Cgroups (YARN containers) | Cgroups + namespaces (pods) |
|
||||
| **Ecosystem fit** | Hadoop-native (HDFS, Hive, Spark) | Cloud-native, Spark, Flink, Trino |
|
||||
| **Operational complexity** | Lower (single cluster manager) | Higher (requires K8s cluster) |
|
||||
| **Multi-tenant isolation** | YARN queues (Capacity/Fair Scheduler) | Namespaces, ResourceQuotas, LimitRanges |
|
||||
| **Stateful workloads** | Limited | StatefulSets, PVC, Operators |
|
||||
| **2026 trend** | Legacy (declining) | Standard for new projects |
|
||||
|
||||
---
|
||||
|
||||
## Cloud deployment
|
||||
|
||||
| Cloud | Batch processing | Streaming | SQL | Managed K8s |
|
||||
|-------|-----------------|-----------|-----|-------------|
|
||||
| **AWS** | EMR (Spark, Hive, Flink) | Kinesis, MSK (Kafka), EMR Flink | Athena (Trino), Redshift | EKS |
|
||||
| **Azure** | HDInsight (Spark, Hive), Synapse | Event Hubs, HDInsight Flink | Synapse SQL, Azure Data Explorer | AKS |
|
||||
| **GCP** | Dataproc (Spark, Flink, Hive, Trino) | Pub/Sub, Dataflow (Beam), Dataproc Flink | BigQuery | GKE |
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
Links, books and standards: [sources/infrastructure/sources.en.md](sources/infrastructure/sources.en.md)
|
||||
|
||||
*Last revision: 2026-06-18*
|
||||
Reference in New Issue
Block a user