4.9 KiB
4.9 KiB
⚙️ Storage Engines and Transaction Models
B-Tree vs LSM-Tree
Two dominant storage engine approaches in modern databases.
| Property | B-Tree | LSM-Tree |
|---|---|---|
| Write | In-place update (random I/O on page) | Append-only (sequential I/O) |
| Read | Fast (directly in page, O(log N)) | Slower (merge from multiple SSTables, bloom filters) |
| Write amplification | Lower (page rewrite) | Higher (compaction, SSTable merge) |
| Read amplification | Lower (1 page read) | Higher (multiple SSTables to search) |
| Compression | Worse (page fragmentation) | Better (compact SSTable, block compression) |
| Range scan | Fast (linked list at leaf level) | Fast (SSTables are sorted) |
| Space amplification | Low | Higher (awaits compaction) |
| Typical DBs | PostgreSQL, MySQL (InnoDB), SQLite, Oracle | Cassandra, RocksDB, LevelDB, ScyllaDB, MongoDB (WiredTiger) |
When to Choose Which Engine
B-Tree — when:
- You need fast point lookups (PK lookup, unique ID)
- Workload is read-heavy (most queries = SELECT by key)
- You need range queries on primary key
- Transactional workload (OLTP) with short queries
LSM-Tree — when:
- You need high write throughput (write-heavy)
- Append-only workload (logs, time-series, IoT)
- Data compression is important (saves space)
- Write amplification is not a concern (sufficient I/O capacity)
Write-Ahead Log (WAL)
Append-only log guaranteeing that no operation is lost on crash:
1. Transaction BEGIN → WAL entry
2. Data modification → WAL entry (before page modification)
3. Transaction COMMIT → flush WAL to disk (COMMIT confirmed only after flush)
4. Checkpoint → flush dirty pages → WAL up to checkpoint point can be deleted
- Write-ahead — WAL is written before the data page
- Checkpoint — point from which WAL is needed for recovery
- Redo log (InnoDB) — similar concept, used to replay missing changes
- Group commit — multiple transactions flush WAL at once (higher throughput)
MVCC (Multi-Version Concurrency Control)
Each transaction sees a snapshot of data as of the start time. Old row versions remain in the table.
Implementations
| DB | Mechanism | Vacuum/GC | Isolation Levels |
|---|---|---|---|
| PostgreSQL | Heap tuple (xmin/xmax) — old versions in main table | VACUUM (autovacuum) | RU, RC, RR, Serializable (SSI) |
| MySQL InnoDB | Undo log — old versions in undo segments | Purge (automatic) | RU, RC, RR, Serializable |
| MSSQL | Tempdb version store | Automatic (row versioning) | RC (snapshot), Serializable |
| Oracle | Undo tablespace | Automatic (undo retention) | RC, Serializable, Read-only |
| MongoDB WiredTiger | MVCC at document level | Automatic (eviction) | Snapshot isolation |
| Cassandra | No MVCC (value overwrite) | Compaction (merge SSTable) | — |
Anomalies
| Level | Dirty Read | Non-repeatable Read | Phantom Read | Serialization Anomaly |
|---|---|---|---|---|
| Read Uncommitted | Yes | Yes | Yes | Yes |
| Read Committed | No | Yes | Yes | Yes |
| Repeatable Read | No | No | No (PG: no, MySQL: next-key locking) | Yes |
| Serializable | No | No | No | No |
- Dirty Read — reading data from an uncommitted transaction
- Non-repeatable Read — same query returns different data
- Phantom Read — same query returns new rows
- Serialization Anomaly — result of transactions is not equivalent to any serial order
Index Types
| Type | Algorithm | Use Case | DB Support |
|---|---|---|---|
| B-tree | Balanced tree | =, <, >, BETWEEN, IN, LIKE (prefix) |
All (default) |
| Hash | Hash table | Only = (equality) |
PostgreSQL (hash index), MySQL (MEMORY) |
| GiST | Generalized Search Tree | Geometry, full-text, intervals, IP ranges | PostgreSQL |
| GIN | Generalized Inverted Index | JSONB, arrays, full-text (contains, overlaps) | PostgreSQL |
| BRIN | Block Range Index | Time-series, logs (data in order) — extremely small | PostgreSQL |
| SP-GiST | Space-partitioned | Quadrants, KD-tree, radix tree | PostgreSQL |
| R-tree | Spatial tree | Geospatial data | MySQL (MyISAM/InnoDB), SQLite |
| Clustered index | B-tree + data in leaves | PK lookup (InnoDB) — data stored with index | MySQL InnoDB, MSSQL |
| Full-text | Inverted index | Text search (stemming, relevance) | MySQL, PostgreSQL, MSSQL |
Resources
Links, books and standards: sources/databases/sources.md
Recommended Reading
| Book | Authors | ISBN | Description |
|---|---|---|---|
| Database Internals | Alex Petrov | 978-1492040346 | In-depth explanation of storage engines (B-Tree, LSM-Tree, WAL, MVCC), distributed systems (partitioning, replication, consensus) |
Last revision: 2026-06-03