Files
knowledge-base/DATABASES.md
Stanislav Hubacek 95d1839f05 First batch
2026-06-11 15:27:28 +02:00

324 lines
15 KiB
Markdown

# đŸ—„ïž DatabĂĄzovĂĄ architektura
## Klasifikace databĂĄzĂ­
### Relační (SQL)
| DB | Licence | Use case | Detail |
|----|---------|----------|--------|
| **PostgreSQL** | Open source | UniverzĂĄlnĂ­, geospatial, analytika, AI | [POSTGRESQL.md](POSTGRESQL.md) |
| **MySQL / MariaDB** | Open source | Web, LAMP stack, e-commerce | [MYSQL.md](MYSQL.md) |
| **Microsoft SQL Server** | Proprietary | Enterprise .NET, Windows ekosystĂ©m | — |
| **Oracle DB** | Proprietary | Enterprise, finance, mainframe, RAC cluster | [ORACLE.md](ORACLE.md) |
| **Amazon Aurora** | Managed | MySQL/PostgreSQL kompatibilní, cloud-native | — |
### NoSQL
| Typ | DB | Use case | Detail |
|-----|----|----------|--------|
| **Document** | MongoDB, Couchbase | JSON data, flexibilnĂ­ schema | [MONGODB.md](MONGODB.md) |
| **Key-Value / Cache** | Redis, Memcached, DynamoDB | Cache, session store, real-time | [REDIS.md](REDIS.md) |
| **Wide-column** | Cassandra, ScyllaDB | Time-series, IoT, velkĂĄ data | [CASSANDRA.md](CASSANDRA.md) |
| **Vector** | Pinecone, Qdrant, Milvus, pgvector | Embeddingy, RAG, sémantické vyhledåvåní | [VEKTOROVE-DB.md](VEKTOROVE-DB.md) |
| **Graph** | Neo4j, Dgraph | Vztahy, doporučení, social grafy | — |
### Storage enginy
SpolečnĂ© koncepty napƙíč databĂĄzemi: [DATABAZOVE-ENGINY.md](DATABAZOVE-ENGINY.md)
---
## Transaction isolation levels
| Úroveƈ | Dirty Read | Non-repeatable Read | Phantom Read | Serialization Anomaly |
|--------|-----------|---------------------|-------------|----------------------|
| **Read Uncommitted** | Ano (moĆŸnĂ©) | Ano | Ano | Ano |
| **Read Committed** | Ne (prevence) | Ano | Ano | Ano |
| **Repeatable Read** | Ne | Ne | Ne (PostgreSQL: Ne) | Ano |
| **Serializable** | Ne | Ne | Ne | Ne |
**AnomĂĄlie**:
- **Dirty Read** — čtenĂ­ dat z necommitnutĂ© transakce (data mohou bĂœt rollbacknuta)
- **Non-repeatable Read** — stejnĂœ dotaz vrĂĄtĂ­ jinĂĄ data (jinĂĄ transakce mezitĂ­m updatovala ƙádek)
- **Phantom Read** — stejnĂœ dotaz vrĂĄtĂ­ novĂ© ƙádky (jinĂĄ transakce insertla data splƈujĂ­cĂ­ podmĂ­nku)
- **Serialization Anomaly** — vĂœsledek transakcĂ­ nenĂ­ ekvivalentnĂ­ ĆŸĂĄdnĂ©mu sĂ©riovĂ©mu poƙadĂ­
### PostgreSQL vs MySQL rozdĂ­ly
- **PostgreSQL**: Read Uncommitted se chovå jako Read Committed. Repeatable Read = Snapshot Isolation (zabraƈuje i phantom reads). Serializable = SSI.
- **MySQL InnoDB**: Repeatable Read pouĆŸĂ­vĂĄ next-key locking (zabrĂĄnĂ­ phantom reads).
---
## CAP teorém
V distribuovaném systému lze mít pouze 2 ze 3: **C**onsistency, **A**vailability, **P**artition tolerance.
V praxi: P je vĆŸdy vyĆŸadovĂĄno, volĂ­me mezi CP (konzistence) a AP (dostupnost).
### PACELC rozơíƙení
PACELC rozơiƙuje CAP o chování za normálních podmínek (bez partition):
- **P**artition → **A**vailability vs **C**onsistency
- **E**lse (bez partition) → **L**atency vs **C**onsistency
| DB | Partition volba | Else volba |
|----|----------------|------------|
| Cassandra | AP (dostupnost) | LC (nĂ­zkĂĄ latence, eventual consistency) |
| DynamoDB (default) | AP | LC |
| MongoDB | CP (primĂĄrnĂ­) | LC |
| PostgreSQL (single) | CP | CC |
| CockroachDB | CP | CC |
### Quorum detail
- **R** (read quorum) + **W** (write quorum) > **N** (replication factor)
- Typické: N=3, R=2, W=2 (toleruje 1 node down)
- **Sloppy quorum** — pƙi nedostupnosti nodu, data dočasně uloĆŸena na jinĂ©m nodu
- **Hinted handoff** — dočasnĂœ zĂĄpis na jinĂœ node s hintem, pƙi obnově se data pƙenesou
---
## Replikace
| Typ | Popis | Latence |
|-----|-------|---------|
| SynchronnĂ­ | ZĂĄpis potvrzen aĆŸ po replikaci na vĆĄechny nod | VysokĂĄ, ale konzistentnĂ­ |
| AsynchronnĂ­ | ZĂĄpis potvrzen ihned, replikace na pozadĂ­ | NĂ­zkĂĄ, moĆŸnĂœ data loss |
| Semi-synchronnĂ­ | PotvrzenĂ­ od majority nodĆŻ | Kompromis |
### Topologie
- **Leader-Follower** (Master-Slave) — čtení z replic
- **Leader-Leader** (Multi-master) — zápis na více nodƯ
- **Quorum-based** — R + W > N (Cassandra, DynamoDB)
---
## Sharding
Distribuce dat napƙíč uzly podle shard klíče.
```
┌─────────┐
│ Proxy │
│ Router │
└────┬────┘
┌──────────┌──────────┐
┌────▌───┐ ┌───▌────┐ ┌───▌────┐
│Shard A │ │Shard B │ │Shard C │
│ 0-100 │ │101-200 │ │201-300 │
└────────┘ └────────┘ └────────┘
```
### Metody
| Metoda | Popis | VĂœhoda | NevĂœhoda |
|--------|-------|--------|----------|
| **Hash-based** | `shard_id = hash(key) % N` | Rovnoměrná distribuce | Ztráta range dotazƯ |
| **Range-based** | Data dle rozsahu (A-M, N-Z) | Zachovává ƙazení | Hot spots |
| **Consistent hashing** | Hash ring, vnodes | Min. pƙeuspoƙádĂĄnĂ­ pƙi změně počtu shardĆŻ | SloĆŸitějĆĄĂ­ |
### Routing
- **Proxy-based** — aplikace jde na proxy, ta routuje (Vitess, ProxySQL, mongos)
- **Client-side** — aplikace vĂ­, na kterĂœ shard jĂ­t
- **DNS-based** — kaĆŸdĂœ shard mĂĄ vlastnĂ­ endpoint
---
## Data consistency patterns
| Pattern | Popis | Pƙíklad |
|---------|-------|---------|
| **Strong consistency** | Po zĂĄpisu kaĆŸdĂœ read vidĂ­ nejnovějĆĄĂ­ data | Single DB, Raft, Spanner |
| **Eventual consistency** | Po zápisu se data časem propagují | DNS, DynamoDB (default), Cassandra |
| **Read-after-write** | Autor svĆŻj zĂĄpis vĆŸdy vidĂ­ (ostatnĂ­ eventual) | SociĂĄlnĂ­ sĂ­tě, komentáƙe |
| **Causal consistency** | KauzĂĄlně zĂĄvislĂ© operace viděny ve sprĂĄvnĂ©m poƙadĂ­ | COPS, Orbe, MongoDB (causal clusters) |
| **Monotonic reads** | Nevidíte starơí data po tom, co jste viděli novějơí | Cassandra (MONOTONIC_READ consistency) |
| **Monotonic writes** | Zápisy od jednoho clienta v poƙadí | Queue-based, single leader |
---
## Migrace dat
### Schema migrace
```
V1__initial_schema.sql
V2__add_users_table.sql
V3__add_email_index.sql
V4__add_orders_table.sql
```
### Zero-downtime migrace
1. **Expand** — pƙidĂĄnĂ­ novĂ©ho sloupce/tabulky (aplikace toleruje oba stavy)
2. **Migrate** — backfill dat, update aplikace na novĂ© schema
3. **Contract** — odstraněnĂ­ starĂ©ho sloupce/tabulky
### NĂĄstroje
| NĂĄstroj | Jazyk | Strategie | Zero-downtime | Rollback |
|---------|-------|-----------|--------------|----------|
| **Flyway** | Java (multi-lang CLI) | Versioned SQL | Omezeně (jen additive) | `undo` (limited, enterprise) |
| **Liquibase** | Java (multi-lang CLI) | Changesets (XML/YAML/JSON/SQL) | Ano (changeset design) | `rollback <count>` |
| **Alembic** | Python | Auto-generation, versioned | Ano (branching) | `downgrade` |
| **Prisma Migrate** | TypeScript | Declarative schema → diff | Ano (shadow DB) | `migrate diff` |
| **gh-ost** | Go | Triggerless online DDL (MySQL) | Ano (binlog stream) | Ne (progresivnĂ­) |
| **pgroll** | Go | Online schema migrace (PG) | Ano (views, multiple versions) | Ano (okamĆŸitĂœ) |
---
## SQL Antipatterns
Na zĂĄkladě *More SQL Antipatterns* (Karwin, 2026) — 14 novĂœch antipatternĆŻ:
### Language antipatterns
| Antipattern | Problém | Ƙeƥení |
|-------------|---------|--------|
| **Fear of JOINs** | ManuĂĄlnĂ­ pĂĄrovĂĄnĂ­ v aplikaci mĂ­sto JOIN | PouĆŸĂ­vat JOIN sprĂĄvně |
| **Relational Division** | HledĂĄnĂ­ mnoĆŸin v WHERE | RelačnĂ­ dělenĂ­ (subquery s GROUP BY/HAVING) |
| **Pagination via OFFSET** | OFFSET je O(n) — čím větơí offset, tím pomalejơí | Keyset pagination (WHERE id > last_seen) |
| **Non-Sargable queries** | Funkce na sloupci v WHERE (`WHERE YEAR(date) = 2026`) | Pƙepsat na range podmínku |
### Optimization antipatterns
| Antipattern | Problém | Ƙeƥení |
|-------------|---------|--------|
| **Premature denormalization** | Denormalizace bez dƯvodu | Měƙit, pak optimalizovat |
| **JSON overuse** | JSON jako univerzĂĄlnĂ­ ƙeĆĄenĂ­ | PouĆŸĂ­t JSON jen pro skutečně flexibilnĂ­ data |
| **Cacheless transactions** | SpolĂ©hĂĄnĂ­ na query cache (v MySQL 8 odstraněna) | Application-level caching |
### Application antipatterns
| Antipattern | Problém | Ƙeƥení |
|-------------|---------|--------|
| **Polling** | PravidelnĂ© dotazovĂĄnĂ­ na změny | LISTEN/NOTIFY, Kafka, Change Data Capture |
| **Transaction encapsulation** | KaĆŸdĂœ model si spravuje vlastnĂ­ transakci | Unit of Work pattern |
| **Fear of deadlocks** | Snaha o prevenci vĆĄech deadlockĆŻ | Mitigace, ne prevence |
| **Data hoarding** | UklĂĄdĂĄnĂ­ vĆĄeho navĆŸdy | Data retention politiky, archĂ­vace |
### Mini-antipatterny
- `LIMIT` bez `ORDER BY` — nedeterministickĂ© vĂœsledky
- `NATURAL JOIN` — kƙehkĂœ, implicitnĂ­ join condition
- `N+1 queries` — dotaz v cyklu místo JOIN/batch
- RedundantnĂ­ indexy — duplicitnĂ­/pƙekrĂœvajĂ­cĂ­ se indexy zbytečně zpomalujĂ­ zĂĄpisy
---
## Designing Data-Intensive Applications (2. vydĂĄnĂ­)
*Kleppmann, Riccomini (2026)* — zĂĄsadně pƙepracovanĂ© vydĂĄnĂ­.
### Novinky oproti 1. vydĂĄnĂ­
| Oblast | Co je nové |
|--------|-----------|
| **Cloud-native** | Storage = object store (S3, Blob), nikoliv lokĂĄlnĂ­ disk. Separace control/data/compute plane |
| **AI workloads** | VektorovĂ© indexy, DataFrames jako datovĂœ model, batch processing pro training data |
| **Local-first software** | DuckDB, PGlite, SQLite — databĂĄze bÄ›ĆŸĂ­cĂ­ na laptopu/edge, sync pƙi pƙipojenĂ­ |
| **Formal methods** | RandomizovanĂ© testovĂĄnĂ­, formĂĄlnĂ­ verifikace (dĆŻleĆŸitĂ© pro AI-generovanĂœ kĂłd) |
| **Legal & ethics** | GDPR, etika prediktivnĂ­ analytiky, bias, accountability algoritmĆŻ |
| **Streaming → SQL views** | Materialize, incremental view maintenance — streamování jako SQL |
### KlíčovĂ© principy (neměnĂ­ se)
Spolehlivost (**Reliability**), ĆĄkĂĄlovatelnost (**Scalability**), udrĆŸovatelnost (**Maintainability**) — tƙi pilíƙe dobrĂœch datovĂœch systĂ©mĆŻ.
---
## Apache Iceberg Lakehouse
Na základě *Architecting an Apache Iceberg Lakehouse* (Merced, 2026):
### Co je data lakehouse
Architektura kombinujĂ­cĂ­ flexibilitu a nĂ­zkou cenu **data lake** (object storage) s vĂœkonem a governance **data warehouse**. Apache Iceberg je open source table format.
### Iceberg metadata architektura
```
Table metadata (.metadata.json)
└── Snapshot manifest list
└── Manifests (file-level stats)
└── Data files (Parquet/ORC/Avro)
```
### KlíčovĂ© vlastnosti
| Vlastnost | Popis |
|-----------|-------|
| **ACID transakce** | BezpečnĂ© concurrent read/write |
| **Schema evolution** | Pƙidání/odebrání/pƙejmenování sloupce bez rewrite |
| **Time travel** | Dotazovåní na historické snapshoty |
| **Partition evolution** | Změna partition strategie bez rewrite dat |
| **Hidden partitioning** | AutomatickĂ© partition filtry (uĆŸivatel nemusĂ­ uvĂĄdět) |
| **Multi-engine** | Spark, Flink, Trino, Dremio, Snowflake nad stejnĂœmi daty |
### Kdy pouĆŸĂ­t Iceberg
- Multi-tool pƙístup ke stejnĂœm governed datĆŻm
- ACID na lake datech
- Streamovåní + batch v jedné tabulce
- SnĂ­ĆŸenĂ­ duplicity (jedna canonical kopie mĂ­sto ETL do warehouse)
---
## Best practices
- **Connection pooling** — PgBouncer, RDS Proxy, ProxySQL
- **IndexovĂĄnĂ­ podle query patternĆŻ** — nemĂ­t zbytečnĂ© indexy
- **Read replicas** pro reporting a analytiku
- **Backup & recovery** — point-in-time recovery (PITR), pravidelnĂ© testy
- **Query monitoring** — slow query log, pg_stat_statements, performance_schema
- **Encryption at rest & in transit**
- **Migrace v CI/CD** — součást pipeline, ne manuálně
- **Volba DB podle workloadu** — neexistuje jedna univerzální DB (polyglot persistence)
---
## Srovnání licenčních modelƯ databází
| DB | Licence | Cena (self-hosted) | Cena (managed cloud) | Vendor lock-in | PoznĂĄmka |
|----|---------|-------------------|---------------------|----------------|----------|
| **PostgreSQL** | PostgreSQL license (MIT-like) | $0 | ~$0.10-1.00/hod (RDS, CloudSQL, Aurora) | NĂ­zkĂœ | Plně open source, ĆŸĂĄdnĂĄ omezenĂ­ |
| **MySQL** | GPL v2 / Commercial (Oracle) | $0 (GPL) / ~$2 000/server/rok (commercial) | ~$0.10-1.00/hod (RDS, PlanetScale) | Stƙední (Oracle vlastní) | GPL = nutnost uvolnit aplikaci? (závisí na distribuci) |
| **MariaDB** | GPL v2 / Business Source | $0 (GPL) | ~$0.10-1.00/hod (SkySQL) | NĂ­zkĂœ | Plně kompatibilnĂ­ fork MySQL, ĆŸĂĄdnĂœ Oracle vliv |
| **Oracle SE2** | Proprietary (per core) | ~$17 500/core + 22 % support/rok | ~$1-5/hod (RDS, OCI) | VysokĂœ | Core factor 0.5 (EPYC/Xeon), max 16 threads |
| **Oracle EE** | Proprietary (per core + options) | ~$47 500/core + options + 22 % support | ~$2-30/hod (OCI, RDS) | VysokĂœ | Options zdvojnĂĄsobujĂ­ cenu (RAC, partitioning, compression) |
| **SQL Server Standard** | Proprietary (per core + CAL) | ~$1 000/core + $200/CAL | ~$0.20-1.00/hod (Azure SQL) | Stƙední | Windows Server license nutná navíc |
| **SQL Server Enterprise** | Proprietary (per core + CAL) | ~$7 000/core + $200/CAL | ~$1-5/hod (Azure SQL) | Stƙední | AlwaysOn, partitioning, in-memory OLTP |
| **MongoDB** | SSPL (Community) / Commercial (Enterprise) | $0 (Community) / ~$10k/server/rok (Enterprise) | ~$0.10-5.00/hod (Atlas) | StƙednĂ­ | SSPL omezuje managed cloud sluĆŸby |
| **Redis** | RSALv2 + SSPL (7.4+) / BSD (Valkey) | $0 (Valkey) | ~$0.10-1.00/hod (ElastiCache, Memorystore → Valkey) | NĂ­zkĂœ (Valkey) | Redis 7.4+ změna licence → fork Valkey |
| **Cassandra** | Apache 2.0 | $0 | ~$0.10-1.00/hod (Keyspaces, Amazon Managed) | NĂ­zkĂœ | Plně open source, ĆŸĂĄdnĂĄ omezenĂ­ |
| **ScyllaDB** | Apache 2.0 (OSS) / Enterprise | $0 (OSS) / Enterprise subscription | ~$0.50-3.00/hod (ScyllaDB Cloud) | NĂ­zkĂœ (OSS) | Enterprise: monitoring, security, support |
| **CockroachDB** | BSL (Business Source License) / Enterprise | $0 (core) / Enterprise subscription | ~$0.50-3.00/hod (CockroachDB Cloud) | Stƙední | BSL: po 3 letech se mění na MIT. Enterprise: multi-region, backup |
**Klíčová doporučení**:
- **NejniĆŸĆĄĂ­ TCO**: PostgreSQL (ĆŸĂĄdnĂĄ licence, nejĆĄirĆĄĂ­ cloud podpora)
- **NejvyĆĄĆĄĂ­ vendor lock-in**: Oracle (PL/SQL, proprietary options, drahĂĄ migrace)
- **License risk**: Redis (změna licence) → pouĆŸĂ­vejte Valkey pro novĂ© projekty
- **Cloud-native licensing**: MongoDB Atlas, CockroachDB Cloud, ScyllaDB Cloud — pay-per-use, ĆŸĂĄdnĂĄ sprĂĄva licencĂ­
## Zdroje
Odkazy, knihy a standardy: [sources/databases/sources.md](sources/databases/sources.md)
### Doporučená literatura
| Kniha | Autoƙi | ISBN | KlíčovĂœ pƙínos |
|-------|--------|------|----------------|
| Database Internals | Alex Petrov | 978-1492040346 | HloubkovĂœ vĂœklad storage engine (B-Tree, LSM-Tree, WAL, MVCC), distribuovanĂ© systĂ©my |
| Designing Data-Intensive Applications (2nd ed.) | Kleppmann, Riccomini | — | Cloud-native, AI, local-first, formal methods |
| High Performance MySQL (4th ed.) | Schwartz, Zaitsev, Tkachenko | 978-1492075292 | MySQL architektura, schema/index optimalizace |
| Expert Oracle Architecture (3rd ed.) | Kyte, Kuhn | 978-1484249602 | Oracle architektura, RAC, Data Guard, tuning |
| AI-Ready PostgreSQL 18 | Kumar, Linster | — | PostgreSQL jako unified platform pro AI |
| More SQL Antipatterns | Bill Karwin (2026) | — | 14 antipatternƯ, keyset pagination |
| Vector Databases | Borwankar (2026) | — | Embeddings, vektorovĂ© indexy, RAG |
| Architecting an Apache Iceberg Lakehouse | Merced (2026) | — | Lakehouse architektura, Iceberg metadata |
*PoslednĂ­ revize: 2026-06-03*