324 lines
15 KiB
Markdown
324 lines
15 KiB
Markdown
# 🗄️ Database Architecture
|
|
|
|
## Database Classification
|
|
|
|
### Relational (SQL)
|
|
|
|
| DB | License | Use Case | Details |
|
|
|----|---------|----------|--------|
|
|
| **PostgreSQL** | Open source | Universal, geospatial, analytics, AI | [POSTGRESQL.md](POSTGRESQL.md) |
|
|
| **MySQL / MariaDB** | Open source | Web, LAMP stack, e-commerce | [MYSQL.md](MYSQL.md) |
|
|
| **Microsoft SQL Server** | Proprietary | Enterprise .NET, Windows ecosystem | — |
|
|
| **Oracle DB** | Proprietary | Enterprise, finance, mainframe, RAC cluster | [ORACLE.md](ORACLE.md) |
|
|
| **Amazon Aurora** | Managed | MySQL/PostgreSQL compatible, cloud-native | — |
|
|
|
|
### NoSQL
|
|
|
|
| Type | DB | Use Case | Details |
|
|
|-----|----|----------|--------|
|
|
| **Document** | MongoDB, Couchbase | JSON data, flexible schema | [MONGODB.md](MONGODB.md) |
|
|
| **Key-Value / Cache** | Redis, Memcached, DynamoDB | Cache, session store, real-time | [REDIS.md](REDIS.md) |
|
|
| **Wide-column** | Cassandra, ScyllaDB | Time-series, IoT, big data | [CASSANDRA.md](CASSANDRA.md) |
|
|
| **Vector** | Pinecone, Qdrant, Milvus, pgvector | Embeddings, RAG, semantic search | [VEKTOROVE-DB.md](VEKTOROVE-DB.md) |
|
|
| **Graph** | Neo4j, Dgraph | Relationships, recommendations, social graphs | — |
|
|
|
|
### Storage Engines
|
|
|
|
Common concepts across databases: [DATABASE-ENGINES.en.md](DATABASE-ENGINES.en.md)
|
|
|
|
---
|
|
|
|
## Transaction Isolation Levels
|
|
|
|
| Level | Dirty Read | Non-repeatable Read | Phantom Read | Serialization Anomaly |
|
|
|--------|-----------|---------------------|-------------|----------------------|
|
|
| **Read Uncommitted** | Yes (possible) | Yes | Yes | Yes |
|
|
| **Read Committed** | No (prevented) | Yes | Yes | Yes |
|
|
| **Repeatable Read** | No | No | No (PostgreSQL: No) | Yes |
|
|
| **Serializable** | No | No | No | No |
|
|
|
|
**Anomalies**:
|
|
- **Dirty Read** — reading data from an uncommitted transaction (data may be rolled back)
|
|
- **Non-repeatable Read** — same query returns different data (another transaction updated the row in the meantime)
|
|
- **Phantom Read** — same query returns new rows (another transaction inserted data matching the condition)
|
|
- **Serialization Anomaly** — the result of transactions is not equivalent to any serial order
|
|
|
|
### PostgreSQL vs MySQL Differences
|
|
|
|
- **PostgreSQL**: Read Uncommitted behaves like Read Committed. Repeatable Read = Snapshot Isolation (also prevents phantom reads). Serializable = SSI.
|
|
- **MySQL InnoDB**: Repeatable Read uses next-key locking (prevents phantom reads).
|
|
|
|
---
|
|
|
|
## CAP Theorem
|
|
|
|
In a distributed system, only 2 out of 3 are possible: **C**onsistency, **A**vailability, **P**artition tolerance.
|
|
|
|
In practice: P is always required, we choose between CP (consistency) and AP (availability).
|
|
|
|
### PACELC Extension
|
|
|
|
PACELC extends CAP with behavior under normal conditions (no partition):
|
|
- **P**artition → **A**vailability vs **C**onsistency
|
|
- **E**lse (no partition) → **L**atency vs **C**onsistency
|
|
|
|
| DB | Partition Choice | Else Choice |
|
|
|----|----------------|------------|
|
|
| Cassandra | AP (availability) | LC (low latency, eventual consistency) |
|
|
| DynamoDB (default) | AP | LC |
|
|
| MongoDB | CP (primary) | LC |
|
|
| PostgreSQL (single) | CP | CC |
|
|
| CockroachDB | CP | CC |
|
|
|
|
### Quorum Details
|
|
|
|
- **R** (read quorum) + **W** (write quorum) > **N** (replication factor)
|
|
- Typical: N=3, R=2, W=2 (tolerates 1 node down)
|
|
- **Sloppy quorum** — when a node is unavailable, data is temporarily stored on another node
|
|
- **Hinted handoff** — temporary write to another node with a hint, data is transferred upon recovery
|
|
|
|
---
|
|
|
|
## Replication
|
|
|
|
| Type | Description | Latency |
|
|
|-----|-------|---------|
|
|
| Synchronous | Write confirmed only after replication to all nodes | High, but consistent |
|
|
| Asynchronous | Write confirmed immediately, replication in the background | Low, possible data loss |
|
|
| Semi-synchronous | Confirmation from majority of nodes | Compromise |
|
|
|
|
### Topologies
|
|
|
|
- **Leader-Follower** (Master-Slave) — reads from replicas
|
|
- **Leader-Leader** (Multi-master) — writes to multiple nodes
|
|
- **Quorum-based** — R + W > N (Cassandra, DynamoDB)
|
|
|
|
---
|
|
|
|
## Sharding
|
|
|
|
Data distribution across nodes based on a shard key.
|
|
|
|
```
|
|
┌─────────┐
|
|
│ Proxy │
|
|
│ Router │
|
|
└────┬────┘
|
|
┌──────────┼──────────┐
|
|
┌────▼───┐ ┌───▼────┐ ┌───▼────┐
|
|
│Shard A │ │Shard B │ │Shard C │
|
|
│ 0-100 │ │101-200 │ │201-300 │
|
|
└────────┘ └────────┘ └────────┘
|
|
```
|
|
|
|
### Methods
|
|
|
|
| Method | Description | Advantage | Disadvantage |
|
|
|--------|-------|--------|----------|
|
|
| **Hash-based** | `shard_id = hash(key) % N` | Even distribution | Loss of range queries |
|
|
| **Range-based** | Data by range (A-M, N-Z) | Preserves ordering | Hot spots |
|
|
| **Consistent hashing** | Hash ring, vnodes | Min. rebalancing when number of shards changes | More complex |
|
|
|
|
### Routing
|
|
|
|
- **Proxy-based** — application goes to proxy, which routes (Vitess, ProxySQL, mongos)
|
|
- **Client-side** — application knows which shard to target
|
|
- **DNS-based** — each shard has its own endpoint
|
|
|
|
---
|
|
|
|
## Data Consistency Patterns
|
|
|
|
| Pattern | Description | Example |
|
|
|---------|-------|---------|
|
|
| **Strong consistency** | After a write, every read sees the latest data | Single DB, Raft, Spanner |
|
|
| **Eventual consistency** | After a write, data propagates over time | DNS, DynamoDB (default), Cassandra |
|
|
| **Read-after-write** | The author always sees their own write (others are eventual) | Social networks, comments |
|
|
| **Causal consistency** | Causally dependent operations are seen in the correct order | COPS, Orbe, MongoDB (causal clusters) |
|
|
| **Monotonic reads** | You do not see older data after seeing newer data | Cassandra (MONOTONIC_READ consistency) |
|
|
| **Monotonic writes** | Writes from a single client are in order | Queue-based, single leader |
|
|
|
|
---
|
|
|
|
## Data Migration
|
|
|
|
### Schema Migration
|
|
|
|
```
|
|
V1__initial_schema.sql
|
|
V2__add_users_table.sql
|
|
V3__add_email_index.sql
|
|
V4__add_orders_table.sql
|
|
```
|
|
|
|
### Zero-Downtime Migration
|
|
|
|
1. **Expand** — add new column/table (application tolerates both states)
|
|
2. **Migrate** — backfill data, update application to new schema
|
|
3. **Contract** — remove old column/table
|
|
|
|
### Tools
|
|
|
|
| Tool | Language | Strategy | Zero-Downtime | Rollback |
|
|
|---------|-------|-----------|--------------|----------|
|
|
| **Flyway** | Java (multi-lang CLI) | Versioned SQL | Limited (additive only) | `undo` (limited, enterprise) |
|
|
| **Liquibase** | Java (multi-lang CLI) | Changesets (XML/YAML/JSON/SQL) | Yes (changeset design) | `rollback <count>` |
|
|
| **Alembic** | Python | Auto-generation, versioned | Yes (branching) | `downgrade` |
|
|
| **Prisma Migrate** | TypeScript | Declarative schema → diff | Yes (shadow DB) | `migrate diff` |
|
|
| **gh-ost** | Go | Triggerless online DDL (MySQL) | Yes (binlog stream) | No (progressive) |
|
|
| **pgroll** | Go | Online schema migration (PG) | Yes (views, multiple versions) | Yes (immediate) |
|
|
|
|
---
|
|
|
|
## SQL Antipatterns
|
|
|
|
Based on *More SQL Antipatterns* (Karwin, 2026) — 14 new antipatterns:
|
|
|
|
### Language Antipatterns
|
|
|
|
| Antipattern | Problem | Solution |
|
|
|-------------|---------|--------|
|
|
| **Fear of JOINs** | Manual pairing in application instead of JOIN | Use JOIN correctly |
|
|
| **Relational Division** | Finding sets in WHERE | Relational division (subquery with GROUP BY/HAVING) |
|
|
| **Pagination via OFFSET** | OFFSET is O(n) — the larger the offset, the slower | Keyset pagination (WHERE id > last_seen) |
|
|
| **Non-Sargable queries** | Functions on columns in WHERE (`WHERE YEAR(date) = 2026`) | Rewrite as range condition |
|
|
|
|
### Optimization Antipatterns
|
|
|
|
| Antipattern | Problem | Solution |
|
|
|-------------|---------|--------|
|
|
| **Premature denormalization** | Denormalization without reason | Measure, then optimize |
|
|
| **JSON overuse** | JSON as a universal solution | Use JSON only for genuinely flexible data |
|
|
| **Cacheless transactions** | Relying on query cache (removed in MySQL 8) | Application-level caching |
|
|
|
|
### Application Antipatterns
|
|
|
|
| Antipattern | Problem | Solution |
|
|
|-------------|---------|--------|
|
|
| **Polling** | Regularly querying for changes | LISTEN/NOTIFY, Kafka, Change Data Capture |
|
|
| **Transaction encapsulation** | Each model manages its own transaction | Unit of Work pattern |
|
|
| **Fear of deadlocks** | Trying to prevent all deadlocks | Mitigation, not prevention |
|
|
| **Data hoarding** | Storing everything forever | Data retention policies, archiving |
|
|
|
|
### Mini-Antipatterns
|
|
|
|
- `LIMIT` without `ORDER BY` — nondeterministic results
|
|
- `NATURAL JOIN` — fragile, implicit join condition
|
|
- `N+1 queries` — query in a loop instead of JOIN/batch
|
|
- Redundant indexes — duplicate/overlapping indexes unnecessarily slow writes
|
|
|
|
---
|
|
|
|
## Designing Data-Intensive Applications (2nd Edition)
|
|
|
|
*Kleppmann, Riccomini (2026)* — substantially revised edition.
|
|
|
|
### What's New Compared to 1st Edition
|
|
|
|
| Area | What's New |
|
|
|--------|-----------|
|
|
| **Cloud-native** | Storage = object store (S3, Blob), not local disk. Separation of control/data/compute plane |
|
|
| **AI workloads** | Vector indexes, DataFrames as a data model, batch processing for training data |
|
|
| **Local-first software** | DuckDB, PGlite, SQLite — databases running on laptop/edge, sync when connected |
|
|
| **Formal methods** | Randomized testing, formal verification (important for AI-generated code) |
|
|
| **Legal & ethics** | GDPR, ethics of predictive analytics, bias, algorithmic accountability |
|
|
| **Streaming → SQL views** | Materialize, incremental view maintenance — streaming as SQL |
|
|
|
|
### Key Principles (unchanged)
|
|
|
|
**Reliability**, **Scalability**, **Maintainability** — the three pillars of good data systems.
|
|
|
|
---
|
|
|
|
## Apache Iceberg Lakehouse
|
|
|
|
Based on *Architecting an Apache Iceberg Lakehouse* (Merced, 2026):
|
|
|
|
### What is a Data Lakehouse
|
|
|
|
An architecture combining the flexibility and low cost of a **data lake** (object storage) with the performance and governance of a **data warehouse**. Apache Iceberg is an open source table format.
|
|
|
|
### Iceberg Metadata Architecture
|
|
|
|
```
|
|
Table metadata (.metadata.json)
|
|
└── Snapshot manifest list
|
|
└── Manifests (file-level stats)
|
|
└── Data files (Parquet/ORC/Avro)
|
|
```
|
|
|
|
### Key Features
|
|
|
|
| Feature | Description |
|
|
|-----------|-------|
|
|
| **ACID transactions** | Safe concurrent read/write |
|
|
| **Schema evolution** | Add/drop/rename columns without rewrite |
|
|
| **Time travel** | Query historical snapshots |
|
|
| **Partition evolution** | Change partition strategy without data rewrite |
|
|
| **Hidden partitioning** | Automatic partition filters (user does not need to specify) |
|
|
| **Multi-engine** | Spark, Flink, Trino, Dremio, Snowflake over the same data |
|
|
|
|
### When to Use Iceberg
|
|
|
|
- Multi-tool access to the same governed data
|
|
- ACID on lake data
|
|
- Streaming + batch in a single table
|
|
- Reducing duplication (one canonical copy instead of ETL to warehouse)
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
- **Connection pooling** — PgBouncer, RDS Proxy, ProxySQL
|
|
- **Indexing based on query patterns** — do not have unnecessary indexes
|
|
- **Read replicas** for reporting and analytics
|
|
- **Backup & recovery** — point-in-time recovery (PITR), regular tests
|
|
- **Query monitoring** — slow query log, pg_stat_statements, performance_schema
|
|
- **Encryption at rest & in transit**
|
|
- **Migrations in CI/CD** — part of the pipeline, not manual
|
|
- **Choose DB based on workload** — no single universal DB (polyglot persistence)
|
|
|
|
---
|
|
|
|
## Database License Model Comparison
|
|
|
|
| DB | License | Price (self-hosted) | Price (managed cloud) | Vendor lock-in | Note |
|
|
|----|---------|-------------------|---------------------|----------------|----------|
|
|
| **PostgreSQL** | PostgreSQL license (MIT-like) | $0 | ~$0.10-1.00/hr (RDS, CloudSQL, Aurora) | Low | Fully open source, no restrictions |
|
|
| **MySQL** | GPL v2 / Commercial (Oracle) | $0 (GPL) / ~$2,000/server/year (commercial) | ~$0.10-1.00/hr (RDS, PlanetScale) | Medium (Oracle owned) | GPL = need to release application? (depends on distribution) |
|
|
| **MariaDB** | GPL v2 / Business Source | $0 (GPL) | ~$0.10-1.00/hr (SkySQL) | Low | Fully compatible MySQL fork, no Oracle influence |
|
|
| **Oracle SE2** | Proprietary (per core) | ~$17,500/core + 22% support/year | ~$1-5/hr (RDS, OCI) | High | Core factor 0.5 (EPYC/Xeon), max 16 threads |
|
|
| **Oracle EE** | Proprietary (per core + options) | ~$47,500/core + options + 22% support | ~$2-30/hr (OCI, RDS) | High | Options double the price (RAC, partitioning, compression) |
|
|
| **SQL Server Standard** | Proprietary (per core + CAL) | ~$1,000/core + $200/CAL | ~$0.20-1.00/hr (Azure SQL) | Medium | Windows Server license required additionally |
|
|
| **SQL Server Enterprise** | Proprietary (per core + CAL) | ~$7,000/core + $200/CAL | ~$1-5/hr (Azure SQL) | Medium | AlwaysOn, partitioning, in-memory OLTP |
|
|
| **MongoDB** | SSPL (Community) / Commercial (Enterprise) | $0 (Community) / ~$10k/server/year (Enterprise) | ~$0.10-5.00/hr (Atlas) | Medium | SSPL restricts managed cloud services |
|
|
| **Redis** | RSALv2 + SSPL (7.4+) / BSD (Valkey) | $0 (Valkey) | ~$0.10-1.00/hr (ElastiCache, Memorystore → Valkey) | Low (Valkey) | Redis 7.4+ license change → Valkey fork |
|
|
| **Cassandra** | Apache 2.0 | $0 | ~$0.10-1.00/hr (Keyspaces, Amazon Managed) | Low | Fully open source, no restrictions |
|
|
| **ScyllaDB** | Apache 2.0 (OSS) / Enterprise | $0 (OSS) / Enterprise subscription | ~$0.50-3.00/hr (ScyllaDB Cloud) | Low (OSS) | Enterprise: monitoring, security, support |
|
|
| **CockroachDB** | BSL (Business Source License) / Enterprise | $0 (core) / Enterprise subscription | ~$0.50-3.00/hr (CockroachDB Cloud) | Medium | BSL: converts to MIT after 3 years. Enterprise: multi-region, backup |
|
|
|
|
**Key Recommendations**:
|
|
- **Lowest TCO**: PostgreSQL (no license, broadest cloud support)
|
|
- **Highest vendor lock-in**: Oracle (PL/SQL, proprietary options, expensive migration)
|
|
- **License risk**: Redis (license change) → use Valkey for new projects
|
|
- **Cloud-native licensing**: MongoDB Atlas, CockroachDB Cloud, ScyllaDB Cloud — pay-per-use, no license management
|
|
|
|
## Resources
|
|
|
|
Links, books and standards: [sources/databases/sources.md](sources/databases/sources.md)
|
|
|
|
### Recommended Reading
|
|
|
|
| Book | Authors | ISBN | Key Takeaway |
|
|
|-------|--------|------|----------------|
|
|
| Database Internals | Alex Petrov | 978-1492040346 | In-depth explanation of storage engines (B-Tree, LSM-Tree, WAL, MVCC), distributed systems |
|
|
| Designing Data-Intensive Applications (2nd ed.) | Kleppmann, Riccomini | — | Cloud-native, AI, local-first, formal methods |
|
|
| High Performance MySQL (4th ed.) | Schwartz, Zaitsev, Tkachenko | 978-1492075292 | MySQL architecture, schema/index optimization |
|
|
| Expert Oracle Architecture (3rd ed.) | Kyte, Kuhn | 978-1484249602 | Oracle architecture, RAC, Data Guard, tuning |
|
|
| AI-Ready PostgreSQL 18 | Kumar, Linster | — | PostgreSQL as a unified platform for AI |
|
|
| More SQL Antipatterns | Bill Karwin (2026) | — | 14 antipatterns, keyset pagination |
|
|
| Vector Databases | Borwankar (2026) | — | Embeddings, vector indexes, RAG |
|
|
| Architecting an Apache Iceberg Lakehouse | Merced (2026) | — | Lakehouse architecture, Iceberg metadata |
|
|
|
|
*Last revision: 2026-06-03*
|