comiiit
This commit is contained in:
101
DATABASE-ENGINES.en.md
Normal file
101
DATABASE-ENGINES.en.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# ⚙️ Storage Engines and Transaction Models
|
||||
|
||||
## B-Tree vs LSM-Tree
|
||||
|
||||
Two dominant storage engine approaches in modern databases.
|
||||
|
||||
| Property | B-Tree | LSM-Tree |
|
||||
|-----------|--------|----------|
|
||||
| **Write** | In-place update (random I/O on page) | Append-only (sequential I/O) |
|
||||
| **Read** | Fast (directly in page, O(log N)) | Slower (merge from multiple SSTables, bloom filters) |
|
||||
| **Write amplification** | Lower (page rewrite) | Higher (compaction, SSTable merge) |
|
||||
| **Read amplification** | Lower (1 page read) | Higher (multiple SSTables to search) |
|
||||
| **Compression** | Worse (page fragmentation) | Better (compact SSTable, block compression) |
|
||||
| **Range scan** | Fast (linked list at leaf level) | Fast (SSTables are sorted) |
|
||||
| **Space amplification** | Low | Higher (awaits compaction) |
|
||||
| **Typical DBs** | PostgreSQL, MySQL (InnoDB), SQLite, Oracle | Cassandra, RocksDB, LevelDB, ScyllaDB, MongoDB (WiredTiger) |
|
||||
|
||||
### When to Choose Which Engine
|
||||
|
||||
**B-Tree** — when:
|
||||
- You need fast point lookups (PK lookup, unique ID)
|
||||
- Workload is read-heavy (most queries = SELECT by key)
|
||||
- You need range queries on primary key
|
||||
- Transactional workload (OLTP) with short queries
|
||||
|
||||
**LSM-Tree** — when:
|
||||
- You need high write throughput (write-heavy)
|
||||
- Append-only workload (logs, time-series, IoT)
|
||||
- Data compression is important (saves space)
|
||||
- Write amplification is not a concern (sufficient I/O capacity)
|
||||
|
||||
## Write-Ahead Log (WAL)
|
||||
|
||||
Append-only log guaranteeing that no operation is lost on crash:
|
||||
|
||||
```text
|
||||
1. Transaction BEGIN → WAL entry
|
||||
2. Data modification → WAL entry (before page modification)
|
||||
3. Transaction COMMIT → flush WAL to disk (COMMIT confirmed only after flush)
|
||||
4. Checkpoint → flush dirty pages → WAL up to checkpoint point can be deleted
|
||||
```
|
||||
|
||||
- **Write-ahead** — WAL is written before the data page
|
||||
- **Checkpoint** — point from which WAL is needed for recovery
|
||||
- **Redo log** (InnoDB) — similar concept, used to replay missing changes
|
||||
- **Group commit** — multiple transactions flush WAL at once (higher throughput)
|
||||
|
||||
## MVCC (Multi-Version Concurrency Control)
|
||||
|
||||
Each transaction sees a snapshot of data as of the start time. Old row versions remain in the table.
|
||||
|
||||
### Implementations
|
||||
|
||||
| DB | Mechanism | Vacuum/GC | Isolation Levels |
|
||||
|----|------------|-----------|-----------------|
|
||||
| **PostgreSQL** | Heap tuple (xmin/xmax) — old versions in main table | VACUUM (autovacuum) | RU, RC, RR, Serializable (SSI) |
|
||||
| **MySQL InnoDB** | Undo log — old versions in undo segments | Purge (automatic) | RU, RC, RR, Serializable |
|
||||
| **MSSQL** | Tempdb version store | Automatic (row versioning) | RC (snapshot), Serializable |
|
||||
| **Oracle** | Undo tablespace | Automatic (undo retention) | RC, Serializable, Read-only |
|
||||
| **MongoDB WiredTiger** | MVCC at document level | Automatic (eviction) | Snapshot isolation |
|
||||
| **Cassandra** | No MVCC (value overwrite) | Compaction (merge SSTable) | — |
|
||||
|
||||
### Anomalies
|
||||
|
||||
| Level | Dirty Read | Non-repeatable Read | Phantom Read | Serialization Anomaly |
|
||||
|--------|-----------|---------------------|-------------|----------------------|
|
||||
| **Read Uncommitted** | Yes | Yes | Yes | Yes |
|
||||
| **Read Committed** | No | Yes | Yes | Yes |
|
||||
| **Repeatable Read** | No | No | No (PG: no, MySQL: next-key locking) | Yes |
|
||||
| **Serializable** | No | No | No | No |
|
||||
|
||||
- **Dirty Read** — reading data from an uncommitted transaction
|
||||
- **Non-repeatable Read** — same query returns different data
|
||||
- **Phantom Read** — same query returns new rows
|
||||
- **Serialization Anomaly** — result of transactions is not equivalent to any serial order
|
||||
|
||||
## Index Types
|
||||
|
||||
| Type | Algorithm | Use Case | DB Support |
|
||||
|-----|-----------|----------|------------|
|
||||
| **B-tree** | Balanced tree | `=`, `<`, `>`, `BETWEEN`, `IN`, `LIKE (prefix)` | All (default) |
|
||||
| **Hash** | Hash table | Only `=` (equality) | PostgreSQL (hash index), MySQL (MEMORY) |
|
||||
| **GiST** | Generalized Search Tree | Geometry, full-text, intervals, IP ranges | PostgreSQL |
|
||||
| **GIN** | Generalized Inverted Index | JSONB, arrays, full-text (contains, overlaps) | PostgreSQL |
|
||||
| **BRIN** | Block Range Index | Time-series, logs (data in order) — extremely small | PostgreSQL |
|
||||
| **SP-GiST** | Space-partitioned | Quadrants, KD-tree, radix tree | PostgreSQL |
|
||||
| **R-tree** | Spatial tree | Geospatial data | MySQL (MyISAM/InnoDB), SQLite |
|
||||
| **Clustered index** | B-tree + data in leaves | PK lookup (InnoDB) — data stored with index | MySQL InnoDB, MSSQL |
|
||||
| **Full-text** | Inverted index | Text search (stemming, relevance) | MySQL, PostgreSQL, MSSQL |
|
||||
|
||||
## Resources
|
||||
|
||||
Links, books and standards: [sources/databases/sources.md](sources/databases/sources.md)
|
||||
|
||||
### Recommended Reading
|
||||
|
||||
| Book | Authors | ISBN | Description |
|
||||
|-------|--------|------|-------|
|
||||
| Database Internals | Alex Petrov | 978-1492040346 | In-depth explanation of storage engines (B-Tree, LSM-Tree, WAL, MVCC), distributed systems (partitioning, replication, consensus) |
|
||||
|
||||
*Last revision: 2026-06-03*
|
||||
Reference in New Issue
Block a user