knowledge-base/DATABASE-ENGINES.en.md

# ⚙️ Storage Engines and Transaction Models

## B-Tree vs LSM-Tree

Two dominant storage engine approaches in modern databases.

| Property | B-Tree | LSM-Tree |
|-----------|--------|----------|
| **Write** | In-place update (random I/O on page) | Append-only (sequential I/O) |
| **Read** | Fast (directly in page, O(log N)) | Slower (merge from multiple SSTables, bloom filters) |
| **Write amplification** | Lower (page rewrite) | Higher (compaction, SSTable merge) |
| **Read amplification** | Lower (1 page read) | Higher (multiple SSTables to search) |
| **Compression** | Worse (page fragmentation) | Better (compact SSTable, block compression) |
| **Range scan** | Fast (linked list at leaf level) | Fast (SSTables are sorted) |
| **Space amplification** | Low | Higher (awaits compaction) |
| **Typical DBs** | PostgreSQL, MySQL (InnoDB), SQLite, Oracle | Cassandra, RocksDB, LevelDB, ScyllaDB, MongoDB (WiredTiger) |

### When to Choose Which Engine

**B-Tree** — when:
- You need fast point lookups (PK lookup, unique ID)
- Workload is read-heavy (most queries = SELECT by key)
- You need range queries on primary key
- Transactional workload (OLTP) with short queries

**LSM-Tree** — when:
- You need high write throughput (write-heavy)
- Append-only workload (logs, time-series, IoT)
- Data compression is important (saves space)
- Write amplification is not a concern (sufficient I/O capacity)

## Write-Ahead Log (WAL)

Append-only log guaranteeing that no operation is lost on crash:

```text
1. Transaction BEGIN → WAL entry
2. Data modification → WAL entry (before page modification)
3. Transaction COMMIT → flush WAL to disk (COMMIT confirmed only after flush)
4. Checkpoint → flush dirty pages → WAL up to checkpoint point can be deleted
```

- **Write-ahead** — WAL is written before the data page
- **Checkpoint** — point from which WAL is needed for recovery
- **Redo log** (InnoDB) — similar concept, used to replay missing changes
- **Group commit** — multiple transactions flush WAL at once (higher throughput)

## MVCC (Multi-Version Concurrency Control)

Each transaction sees a snapshot of data as of the start time. Old row versions remain in the table.

### Implementations

| DB | Mechanism | Vacuum/GC | Isolation Levels |
|----|------------|-----------|-----------------|
| **PostgreSQL** | Heap tuple (xmin/xmax) — old versions in main table | VACUUM (autovacuum) | RU, RC, RR, Serializable (SSI) |
| **MySQL InnoDB** | Undo log — old versions in undo segments | Purge (automatic) | RU, RC, RR, Serializable |
| **MSSQL** | Tempdb version store | Automatic (row versioning) | RC (snapshot), Serializable |
| **Oracle** | Undo tablespace | Automatic (undo retention) | RC, Serializable, Read-only |
| **MongoDB WiredTiger** | MVCC at document level | Automatic (eviction) | Snapshot isolation |
| **Cassandra** | No MVCC (value overwrite) | Compaction (merge SSTable) | — |

### Anomalies

| Level | Dirty Read | Non-repeatable Read | Phantom Read | Serialization Anomaly |
|--------|-----------|---------------------|-------------|----------------------|
| **Read Uncommitted** | Yes | Yes | Yes | Yes |
| **Read Committed** | No | Yes | Yes | Yes |
| **Repeatable Read** | No | No | No (PG: no, MySQL: next-key locking) | Yes |
| **Serializable** | No | No | No | No |

- **Dirty Read** — reading data from an uncommitted transaction
- **Non-repeatable Read** — same query returns different data
- **Phantom Read** — same query returns new rows
- **Serialization Anomaly** — result of transactions is not equivalent to any serial order

## Index Types

| Type | Algorithm | Use Case | DB Support |
|-----|-----------|----------|------------|
| **B-tree** | Balanced tree | `=`, `<`, `>`, `BETWEEN`, `IN`, `LIKE (prefix)` | All (default) |
| **Hash** | Hash table | Only `=` (equality) | PostgreSQL (hash index), MySQL (MEMORY) |
| **GiST** | Generalized Search Tree | Geometry, full-text, intervals, IP ranges | PostgreSQL |
| **GIN** | Generalized Inverted Index | JSONB, arrays, full-text (contains, overlaps) | PostgreSQL |
| **BRIN** | Block Range Index | Time-series, logs (data in order) — extremely small | PostgreSQL |
| **SP-GiST** | Space-partitioned | Quadrants, KD-tree, radix tree | PostgreSQL |
| **R-tree** | Spatial tree | Geospatial data | MySQL (MyISAM/InnoDB), SQLite |
| **Clustered index** | B-tree + data in leaves | PK lookup (InnoDB) — data stored with index | MySQL InnoDB, MSSQL |
| **Full-text** | Inverted index | Text search (stemming, relevance) | MySQL, PostgreSQL, MSSQL |

## Resources

Links, books and standards: [sources/databases/sources.md](sources/databases/sources.md)

### Recommended Reading

| Book | Authors | ISBN | Description |
|-------|--------|------|-------|
| Database Internals | Alex Petrov | 978-1492040346 | In-depth explanation of storage engines (B-Tree, LSM-Tree, WAL, MVCC), distributed systems (partitioning, replication, consensus) |

*Last revision: 2026-06-03*