# ⚙️ Storage Engines and Transaction Models ## B-Tree vs LSM-Tree Two dominant storage engine approaches in modern databases. | Property | B-Tree | LSM-Tree | |-----------|--------|----------| | **Write** | In-place update (random I/O on page) | Append-only (sequential I/O) | | **Read** | Fast (directly in page, O(log N)) | Slower (merge from multiple SSTables, bloom filters) | | **Write amplification** | Lower (page rewrite) | Higher (compaction, SSTable merge) | | **Read amplification** | Lower (1 page read) | Higher (multiple SSTables to search) | | **Compression** | Worse (page fragmentation) | Better (compact SSTable, block compression) | | **Range scan** | Fast (linked list at leaf level) | Fast (SSTables are sorted) | | **Space amplification** | Low | Higher (awaits compaction) | | **Typical DBs** | PostgreSQL, MySQL (InnoDB), SQLite, Oracle | Cassandra, RocksDB, LevelDB, ScyllaDB, MongoDB (WiredTiger) | ### When to Choose Which Engine **B-Tree** — when: - You need fast point lookups (PK lookup, unique ID) - Workload is read-heavy (most queries = SELECT by key) - You need range queries on primary key - Transactional workload (OLTP) with short queries **LSM-Tree** — when: - You need high write throughput (write-heavy) - Append-only workload (logs, time-series, IoT) - Data compression is important (saves space) - Write amplification is not a concern (sufficient I/O capacity) ## Write-Ahead Log (WAL) Append-only log guaranteeing that no operation is lost on crash: ```text 1. Transaction BEGIN → WAL entry 2. Data modification → WAL entry (before page modification) 3. Transaction COMMIT → flush WAL to disk (COMMIT confirmed only after flush) 4. Checkpoint → flush dirty pages → WAL up to checkpoint point can be deleted ``` - **Write-ahead** — WAL is written before the data page - **Checkpoint** — point from which WAL is needed for recovery - **Redo log** (InnoDB) — similar concept, used to replay missing changes - **Group commit** — multiple transactions flush WAL at once (higher throughput) ## MVCC (Multi-Version Concurrency Control) Each transaction sees a snapshot of data as of the start time. Old row versions remain in the table. ### Implementations | DB | Mechanism | Vacuum/GC | Isolation Levels | |----|------------|-----------|-----------------| | **PostgreSQL** | Heap tuple (xmin/xmax) — old versions in main table | VACUUM (autovacuum) | RU, RC, RR, Serializable (SSI) | | **MySQL InnoDB** | Undo log — old versions in undo segments | Purge (automatic) | RU, RC, RR, Serializable | | **MSSQL** | Tempdb version store | Automatic (row versioning) | RC (snapshot), Serializable | | **Oracle** | Undo tablespace | Automatic (undo retention) | RC, Serializable, Read-only | | **MongoDB WiredTiger** | MVCC at document level | Automatic (eviction) | Snapshot isolation | | **Cassandra** | No MVCC (value overwrite) | Compaction (merge SSTable) | — | ### Anomalies | Level | Dirty Read | Non-repeatable Read | Phantom Read | Serialization Anomaly | |--------|-----------|---------------------|-------------|----------------------| | **Read Uncommitted** | Yes | Yes | Yes | Yes | | **Read Committed** | No | Yes | Yes | Yes | | **Repeatable Read** | No | No | No (PG: no, MySQL: next-key locking) | Yes | | **Serializable** | No | No | No | No | - **Dirty Read** — reading data from an uncommitted transaction - **Non-repeatable Read** — same query returns different data - **Phantom Read** — same query returns new rows - **Serialization Anomaly** — result of transactions is not equivalent to any serial order ## Index Types | Type | Algorithm | Use Case | DB Support | |-----|-----------|----------|------------| | **B-tree** | Balanced tree | `=`, `<`, `>`, `BETWEEN`, `IN`, `LIKE (prefix)` | All (default) | | **Hash** | Hash table | Only `=` (equality) | PostgreSQL (hash index), MySQL (MEMORY) | | **GiST** | Generalized Search Tree | Geometry, full-text, intervals, IP ranges | PostgreSQL | | **GIN** | Generalized Inverted Index | JSONB, arrays, full-text (contains, overlaps) | PostgreSQL | | **BRIN** | Block Range Index | Time-series, logs (data in order) — extremely small | PostgreSQL | | **SP-GiST** | Space-partitioned | Quadrants, KD-tree, radix tree | PostgreSQL | | **R-tree** | Spatial tree | Geospatial data | MySQL (MyISAM/InnoDB), SQLite | | **Clustered index** | B-tree + data in leaves | PK lookup (InnoDB) — data stored with index | MySQL InnoDB, MSSQL | | **Full-text** | Inverted index | Text search (stemming, relevance) | MySQL, PostgreSQL, MSSQL | ## Resources Links, books and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md) ### Recommended Reading | Book | Authors | ISBN | Description | |-------|--------|------|-------| | Database Internals | Alex Petrov | 978-1492040346 | In-depth explanation of storage engines (B-Tree, LSM-Tree, WAL, MVCC), distributed systems (partitioning, replication, consensus) | *Last revision: 2026-06-03*