# 🐘 PostgreSQL ## Overview PostgreSQL is the most advanced open-source relational database with emphasis on extensibility, SQL standards, and reliability. Development since 1996, strong community, active release cycle (major version every year). ## Architecture ### Process model ```text Postmaster (supervisor) ├── Backend process (1 per connection) ├── WAL writer ├── Checkpointer ├── Autovacuum launcher ├── Stats collector ├── Logical replication launcher └── Archiver (WAL archiving) ``` Each connection = its own OS process (not thread). Advantage: isolation, stability. Disadvantage: higher memory footprint with thousands of connections → connection pooler required (PgBouncer). ### MVCC (Multi-Version Concurrency Control) Each transaction sees a snapshot of data from the moment it started. Old row versions (tuples) remain in the table: - INSERT creates a new tuple with `xmin = current_xid` - DELETE marks tuple with `xmax = current_xid` (doesn't disappear immediately) - UPDATE = DELETE old + INSERT new - VACUUM physically deletes tuples older than the oldest active snapshot ### VACUUM and autovacuum | Parameter | Description | Default | |-----------|-------------|---------| | `autovacuum_vacuum_threshold` | Min. dead rows to trigger vacuum | 50 | | `autovacuum_vacuum_scale_factor` | % of table as threshold | 0.2 (20%) | | `autovacuum_analyze_threshold` | Min. changed rows for ANALYZE | 50 | | `autovacuum_vacuum_cost_limit` | Limits I/O of vacuum (prevents load) | 200 | | `autovacuum_naptime` | Interval between checks | 1 min | | `deadlock_timeout` | Deadlock detection | 1 s | **Signs of insufficient vacuum**: table growth (bloat), degraded index scan performance, XID wraparound hazard. ### WAL (Write-Ahead Log) Append-only log of all changes for crash recovery and replication: ```conf wal_level = replica # or logical archive_mode = on archive_command = 'aws s3 cp %p s3://backups/pg-wal/%f' ``` **PITR (Point-In-Time Recovery)**: 1. Restore base backup (pg_basebackup) 2. Replay WAL archives up to target time 3. `recovery_target_time = '2026-06-03 10:30:00 UTC'` ### Replication slots - **Physical** — guarantees WAL is not deleted by master until replica consumes it - **Logical** — for logical replication (selective tables, data transformation) - **Risk**: if replica fails, WAL grows on disk (disk full) - Monitoring: `pg_replication_slots`, `pg_stat_replication` ### Configuration Main files (per Obe & Hsu): - `postgresql.conf` — memory, network, logging, storage - `pg_hba.conf` — access privileges - `pg_ident.conf` — OS user to PostgreSQL role mapping ### AI-Ready PostgreSQL 18 (Kumar, Linster, 2026) — PostgreSQL 18 as a unified platform for transactions, analytics, and AI: | Area | Technique | |------|-----------| | Vectors | pgvector — embeddings directly in table rows | | Hybrid pattern | Semantic recall → SQL filtering | | LLM integration | PostgreSQL + MCP (Model Context Protocol) | | Embedding pipeline | Batch and stream embedding generation | **Hybrid query**: ```sql SELECT p.*, pm.name FROM products p JOIN product_embeddings pe ON p.id = pe.product_id WHERE pe.embedding <-> '[0.1, 0.3, ...]' < 0.8 AND p.in_stock = true AND p.price < 100.00 ORDER BY pe.embedding <-> '[0.1, 0.3, ...]' LIMIT 10; ``` ### Extensions | Extension | Purpose | |-----------|---------| | pgvector | Vector search for AI/embeddings | | PostGIS | Geographic data, spatial queries | | pg_stat_statements | Query performance monitoring | | pg_duckdb | Analytical queries (DuckDB engine inside PG) | | pg_search | Full-text and hybrid search | | pg_cron | DB job scheduling | | citus | Horizontal scaling (sharding) | | timescaledb | Time-series optimization | | pgaudit | Audit logging | ## Connection pooling | Pooler | Type | Protocol | |--------|------|----------| | PgBouncer | Proxy (transaction/session) | PostgreSQL wire | | Odyssey | Proxy (multithreaded) | PostgreSQL wire | | pgpool-II | Proxy (replication, load balancing) | PostgreSQL wire | | RDS Proxy | Managed proxy (AWS) | PostgreSQL wire | **PgBouncer modes**: - **Session pooling** — connection held for entire application session → overhead - **Transaction pooling** — connection returned after transaction completes → more efficient (requires statelessness) ## Recommendations — where PostgreSQL is better | Area | PostgreSQL | Competition | Why PG | |------|-----------|-------------|--------| | **Extensibility** | Extensions, custom types, operators, index methods | MySQL limited | Can add anything from vectors to full-text in DB | | **SQL standard** | Closest to ANSI SQL | MySQL deviations (GROUP BY, ALTER TABLE) | Portability, fewer surprises | | **Geospatial data** | PostGIS (gold standard GIS) | MySQL GIS (limited) | Only real open-source choice for GIS | | **Consistency** | SSI serializable, foreign keys, CHECK, exclusions | MySQL MyISAM no FK, InnoDB only RC | Suitable for financial and critical systems | | **Concurrent read/write** | MVCC without reader/writer blocking | MySQL InnoDB reader blocks writer (and vice versa) in older versions | Better read scalability | | **AI/vectors** | pgvector natively in DB | Separate vector DB (increased latency) | Hybrid queries in single SQL | | **License** | PostgreSQL license (MIT-like) | MySQL dual license (Oracle) | No vendor lock-in | ### When to use PostgreSQL - **Enterprise applications** — require ACID, referential integrity, complex transactions - **Geographic systems** — GIS, map applications, location services - **Financial systems** — accounting, banking, compliance (audit logging, SSI) - **AI / RAG applications** — hybrid vector + relational queries in one DB - **Analytics on relational data** — pg_duckdb, materialized views, window functions - **Multi-tenant applications** — row-level security, schemas per tenant ## PostgreSQL licensing | Variant | License | Price | Restrictions | |---------|---------|-------|-------------| | **PostgreSQL** | PostgreSQL license (MIT-like) | $0 | None — can use, modify, distribute in commercial products. No "commercial license" needed | | **Amazon Aurora PostgreSQL** | Proprietary (AWS) | ~$0.10-1.00/hour | AWS managed, PostgreSQL compatible. AWS may use PG code thanks to PostgreSQL license | | **YugabyteDB** | Apache 2.0 | $0 (core) | PostgreSQL compatible distributed SQL, built on PG query layer | | **TimescaleDB** | Apache 2.0 (community) / Timescale License (enterprise) | $0 (community) | Time-series extensions for PostgreSQL. Enterprise: tiered storage, compression, multi-node | **Key point**: The PostgreSQL license is one of the most liberal — it allows cloud providers (AWS, GCP, Azure) to offer PostgreSQL as a managed service without restrictions. This is different from MongoDB (SSPL) and Redis (RSALv2). Thanks to this, PostgreSQL has the broadest cloud support of any database. **Impact on choice**: No license risk, no vendor lock-in, no hidden costs. PostgreSQL is a safe choice for any project. ### When to use something else - **Simple web / blog** → SQLite (lighter in embedded scenarios) - **High-throughput key-value** → Redis (order of magnitude lower latency) - **Time-series at massive scale** → TimescaleDB, InfluxDB - **Globally distributed data** → CockroachDB, Spanner - **Full-text search primarily** → Elasticsearch ## Sources References, books, and standards: [sources/databases/sources.en.md](sources/databases/sources.en.md) ### Recommended reading | Book | Authors | ISBN | Description | |------|---------|------|-------------| | PostgreSQL: Up and Running (3rd ed.) | Regina Obe, Leo Hsu | 978-1491962935 | Practical guide to administration, configuration, and extensions | | AI-Ready PostgreSQL 18 | Kumar, Linster | — | PostgreSQL as unified platform for AI workloads | *Last revision: 2026-06-03*