Files
knowledge-base/DATABASE-ENGINES.en.md
Stanislav Hubacek 3fa11ef0f6 comiiit
2026-06-11 15:27:28 +02:00

4.9 KiB

⚙️ Storage Engines and Transaction Models

B-Tree vs LSM-Tree

Two dominant storage engine approaches in modern databases.

Property B-Tree LSM-Tree
Write In-place update (random I/O on page) Append-only (sequential I/O)
Read Fast (directly in page, O(log N)) Slower (merge from multiple SSTables, bloom filters)
Write amplification Lower (page rewrite) Higher (compaction, SSTable merge)
Read amplification Lower (1 page read) Higher (multiple SSTables to search)
Compression Worse (page fragmentation) Better (compact SSTable, block compression)
Range scan Fast (linked list at leaf level) Fast (SSTables are sorted)
Space amplification Low Higher (awaits compaction)
Typical DBs PostgreSQL, MySQL (InnoDB), SQLite, Oracle Cassandra, RocksDB, LevelDB, ScyllaDB, MongoDB (WiredTiger)

When to Choose Which Engine

B-Tree — when:

  • You need fast point lookups (PK lookup, unique ID)
  • Workload is read-heavy (most queries = SELECT by key)
  • You need range queries on primary key
  • Transactional workload (OLTP) with short queries

LSM-Tree — when:

  • You need high write throughput (write-heavy)
  • Append-only workload (logs, time-series, IoT)
  • Data compression is important (saves space)
  • Write amplification is not a concern (sufficient I/O capacity)

Write-Ahead Log (WAL)

Append-only log guaranteeing that no operation is lost on crash:

1. Transaction BEGIN → WAL entry
2. Data modification → WAL entry (before page modification)
3. Transaction COMMIT → flush WAL to disk (COMMIT confirmed only after flush)
4. Checkpoint → flush dirty pages → WAL up to checkpoint point can be deleted
  • Write-ahead — WAL is written before the data page
  • Checkpoint — point from which WAL is needed for recovery
  • Redo log (InnoDB) — similar concept, used to replay missing changes
  • Group commit — multiple transactions flush WAL at once (higher throughput)

MVCC (Multi-Version Concurrency Control)

Each transaction sees a snapshot of data as of the start time. Old row versions remain in the table.

Implementations

DB Mechanism Vacuum/GC Isolation Levels
PostgreSQL Heap tuple (xmin/xmax) — old versions in main table VACUUM (autovacuum) RU, RC, RR, Serializable (SSI)
MySQL InnoDB Undo log — old versions in undo segments Purge (automatic) RU, RC, RR, Serializable
MSSQL Tempdb version store Automatic (row versioning) RC (snapshot), Serializable
Oracle Undo tablespace Automatic (undo retention) RC, Serializable, Read-only
MongoDB WiredTiger MVCC at document level Automatic (eviction) Snapshot isolation
Cassandra No MVCC (value overwrite) Compaction (merge SSTable)

Anomalies

Level Dirty Read Non-repeatable Read Phantom Read Serialization Anomaly
Read Uncommitted Yes Yes Yes Yes
Read Committed No Yes Yes Yes
Repeatable Read No No No (PG: no, MySQL: next-key locking) Yes
Serializable No No No No
  • Dirty Read — reading data from an uncommitted transaction
  • Non-repeatable Read — same query returns different data
  • Phantom Read — same query returns new rows
  • Serialization Anomaly — result of transactions is not equivalent to any serial order

Index Types

Type Algorithm Use Case DB Support
B-tree Balanced tree =, <, >, BETWEEN, IN, LIKE (prefix) All (default)
Hash Hash table Only = (equality) PostgreSQL (hash index), MySQL (MEMORY)
GiST Generalized Search Tree Geometry, full-text, intervals, IP ranges PostgreSQL
GIN Generalized Inverted Index JSONB, arrays, full-text (contains, overlaps) PostgreSQL
BRIN Block Range Index Time-series, logs (data in order) — extremely small PostgreSQL
SP-GiST Space-partitioned Quadrants, KD-tree, radix tree PostgreSQL
R-tree Spatial tree Geospatial data MySQL (MyISAM/InnoDB), SQLite
Clustered index B-tree + data in leaves PK lookup (InnoDB) — data stored with index MySQL InnoDB, MSSQL
Full-text Inverted index Text search (stemming, relevance) MySQL, PostgreSQL, MSSQL

Resources

Links, books and standards: sources/databases/sources.md

Book Authors ISBN Description
Database Internals Alex Petrov 978-1492040346 In-depth explanation of storage engines (B-Tree, LSM-Tree, WAL, MVCC), distributed systems (partitioning, replication, consensus)

Last revision: 2026-06-03