comiiit

2026-06-11 15:25:40 +02:00
parent 95d1839f05
commit 3fa11ef0f6
50 changed files with 9336 additions and 33 deletions
--- a/DATABASE-ENGINES.en.md
+++ b/DATABASE-ENGINES.en.md
@@ -0,0 +1,101 @@
+# ⚙️ Storage Engines and Transaction Models
+
+## B-Tree vs LSM-Tree
+
+Two dominant storage engine approaches in modern databases.
+
+| Property | B-Tree | LSM-Tree |
+|-----------|--------|----------|
+| **Write** | In-place update (random I/O on page) | Append-only (sequential I/O) |
+| **Read** | Fast (directly in page, O(log N)) | Slower (merge from multiple SSTables, bloom filters) |
+| **Write amplification** | Lower (page rewrite) | Higher (compaction, SSTable merge) |
+| **Read amplification** | Lower (1 page read) | Higher (multiple SSTables to search) |
+| **Compression** | Worse (page fragmentation) | Better (compact SSTable, block compression) |
+| **Range scan** | Fast (linked list at leaf level) | Fast (SSTables are sorted) |
+| **Space amplification** | Low | Higher (awaits compaction) |
+| **Typical DBs** | PostgreSQL, MySQL (InnoDB), SQLite, Oracle | Cassandra, RocksDB, LevelDB, ScyllaDB, MongoDB (WiredTiger) |
+
+### When to Choose Which Engine
+
+**B-Tree** — when:
+- You need fast point lookups (PK lookup, unique ID)
+- Workload is read-heavy (most queries = SELECT by key)
+- You need range queries on primary key
+- Transactional workload (OLTP) with short queries
+
+**LSM-Tree** — when:
+- You need high write throughput (write-heavy)
+- Append-only workload (logs, time-series, IoT)
+- Data compression is important (saves space)
+- Write amplification is not a concern (sufficient I/O capacity)
+
+## Write-Ahead Log (WAL)
+
+Append-only log guaranteeing that no operation is lost on crash:
+
+```text
+1. Transaction BEGIN → WAL entry
+2. Data modification → WAL entry (before page modification)
+3. Transaction COMMIT → flush WAL to disk (COMMIT confirmed only after flush)
+4. Checkpoint → flush dirty pages → WAL up to checkpoint point can be deleted
+```
+
+- **Write-ahead** — WAL is written before the data page
+- **Checkpoint** — point from which WAL is needed for recovery
+- **Redo log** (InnoDB) — similar concept, used to replay missing changes
+- **Group commit** — multiple transactions flush WAL at once (higher throughput)
+
+## MVCC (Multi-Version Concurrency Control)
+
+Each transaction sees a snapshot of data as of the start time. Old row versions remain in the table.
+
+### Implementations
+
+| DB | Mechanism | Vacuum/GC | Isolation Levels |
+|----|------------|-----------|-----------------|
+| **PostgreSQL** | Heap tuple (xmin/xmax) — old versions in main table | VACUUM (autovacuum) | RU, RC, RR, Serializable (SSI) |
+| **MySQL InnoDB** | Undo log — old versions in undo segments | Purge (automatic) | RU, RC, RR, Serializable |
+| **MSSQL** | Tempdb version store | Automatic (row versioning) | RC (snapshot), Serializable |
+| **Oracle** | Undo tablespace | Automatic (undo retention) | RC, Serializable, Read-only |
+| **MongoDB WiredTiger** | MVCC at document level | Automatic (eviction) | Snapshot isolation |
+| **Cassandra** | No MVCC (value overwrite) | Compaction (merge SSTable) | — |
+
+### Anomalies
+
+| Level | Dirty Read | Non-repeatable Read | Phantom Read | Serialization Anomaly |
+|--------|-----------|---------------------|-------------|----------------------|
+| **Read Uncommitted** | Yes | Yes | Yes | Yes |
+| **Read Committed** | No | Yes | Yes | Yes |
+| **Repeatable Read** | No | No | No (PG: no, MySQL: next-key locking) | Yes |
+| **Serializable** | No | No | No | No |
+
+- **Dirty Read** — reading data from an uncommitted transaction
+- **Non-repeatable Read** — same query returns different data
+- **Phantom Read** — same query returns new rows
+- **Serialization Anomaly** — result of transactions is not equivalent to any serial order
+
+## Index Types
+
+| Type | Algorithm | Use Case | DB Support |
+|-----|-----------|----------|------------|
+| **B-tree** | Balanced tree | `=`, `<`, `>`, `BETWEEN`, `IN`, `LIKE (prefix)` | All (default) |
+| **Hash** | Hash table | Only `=` (equality) | PostgreSQL (hash index), MySQL (MEMORY) |
+| **GiST** | Generalized Search Tree | Geometry, full-text, intervals, IP ranges | PostgreSQL |
+| **GIN** | Generalized Inverted Index | JSONB, arrays, full-text (contains, overlaps) | PostgreSQL |
+| **BRIN** | Block Range Index | Time-series, logs (data in order) — extremely small | PostgreSQL |
+| **SP-GiST** | Space-partitioned | Quadrants, KD-tree, radix tree | PostgreSQL |
+| **R-tree** | Spatial tree | Geospatial data | MySQL (MyISAM/InnoDB), SQLite |
+| **Clustered index** | B-tree + data in leaves | PK lookup (InnoDB) — data stored with index | MySQL InnoDB, MSSQL |
+| **Full-text** | Inverted index | Text search (stemming, relevance) | MySQL, PostgreSQL, MSSQL |
+
+## Resources
+
+Links, books and standards: [sources/databases/sources.md](sources/databases/sources.md)
+
+### Recommended Reading
+
+| Book | Authors | ISBN | Description |
+|-------|--------|------|-------|
+| Database Internals | Alex Petrov | 978-1492040346 | In-depth explanation of storage engines (B-Tree, LSM-Tree, WAL, MVCC), distributed systems (partitioning, replication, consensus) |
+
+*Last revision: 2026-06-03*