102 lines
4.9 KiB
Markdown
102 lines
4.9 KiB
Markdown
# ⚙️ Storage Engines and Transaction Models
|
|
|
|
## B-Tree vs LSM-Tree
|
|
|
|
Two dominant storage engine approaches in modern databases.
|
|
|
|
| Property | B-Tree | LSM-Tree |
|
|
|-----------|--------|----------|
|
|
| **Write** | In-place update (random I/O on page) | Append-only (sequential I/O) |
|
|
| **Read** | Fast (directly in page, O(log N)) | Slower (merge from multiple SSTables, bloom filters) |
|
|
| **Write amplification** | Lower (page rewrite) | Higher (compaction, SSTable merge) |
|
|
| **Read amplification** | Lower (1 page read) | Higher (multiple SSTables to search) |
|
|
| **Compression** | Worse (page fragmentation) | Better (compact SSTable, block compression) |
|
|
| **Range scan** | Fast (linked list at leaf level) | Fast (SSTables are sorted) |
|
|
| **Space amplification** | Low | Higher (awaits compaction) |
|
|
| **Typical DBs** | PostgreSQL, MySQL (InnoDB), SQLite, Oracle | Cassandra, RocksDB, LevelDB, ScyllaDB, MongoDB (WiredTiger) |
|
|
|
|
### When to Choose Which Engine
|
|
|
|
**B-Tree** — when:
|
|
- You need fast point lookups (PK lookup, unique ID)
|
|
- Workload is read-heavy (most queries = SELECT by key)
|
|
- You need range queries on primary key
|
|
- Transactional workload (OLTP) with short queries
|
|
|
|
**LSM-Tree** — when:
|
|
- You need high write throughput (write-heavy)
|
|
- Append-only workload (logs, time-series, IoT)
|
|
- Data compression is important (saves space)
|
|
- Write amplification is not a concern (sufficient I/O capacity)
|
|
|
|
## Write-Ahead Log (WAL)
|
|
|
|
Append-only log guaranteeing that no operation is lost on crash:
|
|
|
|
```text
|
|
1. Transaction BEGIN → WAL entry
|
|
2. Data modification → WAL entry (before page modification)
|
|
3. Transaction COMMIT → flush WAL to disk (COMMIT confirmed only after flush)
|
|
4. Checkpoint → flush dirty pages → WAL up to checkpoint point can be deleted
|
|
```
|
|
|
|
- **Write-ahead** — WAL is written before the data page
|
|
- **Checkpoint** — point from which WAL is needed for recovery
|
|
- **Redo log** (InnoDB) — similar concept, used to replay missing changes
|
|
- **Group commit** — multiple transactions flush WAL at once (higher throughput)
|
|
|
|
## MVCC (Multi-Version Concurrency Control)
|
|
|
|
Each transaction sees a snapshot of data as of the start time. Old row versions remain in the table.
|
|
|
|
### Implementations
|
|
|
|
| DB | Mechanism | Vacuum/GC | Isolation Levels |
|
|
|----|------------|-----------|-----------------|
|
|
| **PostgreSQL** | Heap tuple (xmin/xmax) — old versions in main table | VACUUM (autovacuum) | RU, RC, RR, Serializable (SSI) |
|
|
| **MySQL InnoDB** | Undo log — old versions in undo segments | Purge (automatic) | RU, RC, RR, Serializable |
|
|
| **MSSQL** | Tempdb version store | Automatic (row versioning) | RC (snapshot), Serializable |
|
|
| **Oracle** | Undo tablespace | Automatic (undo retention) | RC, Serializable, Read-only |
|
|
| **MongoDB WiredTiger** | MVCC at document level | Automatic (eviction) | Snapshot isolation |
|
|
| **Cassandra** | No MVCC (value overwrite) | Compaction (merge SSTable) | — |
|
|
|
|
### Anomalies
|
|
|
|
| Level | Dirty Read | Non-repeatable Read | Phantom Read | Serialization Anomaly |
|
|
|--------|-----------|---------------------|-------------|----------------------|
|
|
| **Read Uncommitted** | Yes | Yes | Yes | Yes |
|
|
| **Read Committed** | No | Yes | Yes | Yes |
|
|
| **Repeatable Read** | No | No | No (PG: no, MySQL: next-key locking) | Yes |
|
|
| **Serializable** | No | No | No | No |
|
|
|
|
- **Dirty Read** — reading data from an uncommitted transaction
|
|
- **Non-repeatable Read** — same query returns different data
|
|
- **Phantom Read** — same query returns new rows
|
|
- **Serialization Anomaly** — result of transactions is not equivalent to any serial order
|
|
|
|
## Index Types
|
|
|
|
| Type | Algorithm | Use Case | DB Support |
|
|
|-----|-----------|----------|------------|
|
|
| **B-tree** | Balanced tree | `=`, `<`, `>`, `BETWEEN`, `IN`, `LIKE (prefix)` | All (default) |
|
|
| **Hash** | Hash table | Only `=` (equality) | PostgreSQL (hash index), MySQL (MEMORY) |
|
|
| **GiST** | Generalized Search Tree | Geometry, full-text, intervals, IP ranges | PostgreSQL |
|
|
| **GIN** | Generalized Inverted Index | JSONB, arrays, full-text (contains, overlaps) | PostgreSQL |
|
|
| **BRIN** | Block Range Index | Time-series, logs (data in order) — extremely small | PostgreSQL |
|
|
| **SP-GiST** | Space-partitioned | Quadrants, KD-tree, radix tree | PostgreSQL |
|
|
| **R-tree** | Spatial tree | Geospatial data | MySQL (MyISAM/InnoDB), SQLite |
|
|
| **Clustered index** | B-tree + data in leaves | PK lookup (InnoDB) — data stored with index | MySQL InnoDB, MSSQL |
|
|
| **Full-text** | Inverted index | Text search (stemming, relevance) | MySQL, PostgreSQL, MSSQL |
|
|
|
|
## Resources
|
|
|
|
Links, books and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md)
|
|
|
|
### Recommended Reading
|
|
|
|
| Book | Authors | ISBN | Description |
|
|
|-------|--------|------|-------|
|
|
| Database Internals | Alex Petrov | 978-1492040346 | In-depth explanation of storage engines (B-Tree, LSM-Tree, WAL, MVCC), distributed systems (partitioning, replication, consensus) |
|
|
|
|
*Last revision: 2026-06-03*
|