Files
knowledge-base/POSTGRESQL.en.md
Stanislav Hubacek 3fa11ef0f6 comiiit
2026-06-11 15:27:28 +02:00

7.8 KiB

🐘 PostgreSQL

Overview

PostgreSQL is the most advanced open-source relational database with emphasis on extensibility, SQL standards, and reliability. Development since 1996, strong community, active release cycle (major version every year).

Architecture

Process model

Postmaster (supervisor)
    ├── Backend process (1 per connection)
    ├── WAL writer
    ├── Checkpointer
    ├── Autovacuum launcher
    ├── Stats collector
    ├── Logical replication launcher
    └── Archiver (WAL archiving)

Each connection = its own OS process (not thread). Advantage: isolation, stability. Disadvantage: higher memory footprint with thousands of connections → connection pooler required (PgBouncer).

MVCC (Multi-Version Concurrency Control)

Each transaction sees a snapshot of data from the moment it started. Old row versions (tuples) remain in the table:

  • INSERT creates a new tuple with xmin = current_xid
  • DELETE marks tuple with xmax = current_xid (doesn't disappear immediately)
  • UPDATE = DELETE old + INSERT new
  • VACUUM physically deletes tuples older than the oldest active snapshot

VACUUM and autovacuum

Parameter Description Default
autovacuum_vacuum_threshold Min. dead rows to trigger vacuum 50
autovacuum_vacuum_scale_factor % of table as threshold 0.2 (20%)
autovacuum_analyze_threshold Min. changed rows for ANALYZE 50
autovacuum_vacuum_cost_limit Limits I/O of vacuum (prevents load) 200
autovacuum_naptime Interval between checks 1 min
deadlock_timeout Deadlock detection 1 s

Signs of insufficient vacuum: table growth (bloat), degraded index scan performance, XID wraparound hazard.

WAL (Write-Ahead Log)

Append-only log of all changes for crash recovery and replication:

wal_level = replica                 # or logical
archive_mode = on
archive_command = 'aws s3 cp %p s3://backups/pg-wal/%f'

PITR (Point-In-Time Recovery):

  1. Restore base backup (pg_basebackup)
  2. Replay WAL archives up to target time
  3. recovery_target_time = '2026-06-03 10:30:00 UTC'

Replication slots

  • Physical — guarantees WAL is not deleted by master until replica consumes it
  • Logical — for logical replication (selective tables, data transformation)
  • Risk: if replica fails, WAL grows on disk (disk full)
  • Monitoring: pg_replication_slots, pg_stat_replication

Configuration

Main files (per Obe & Hsu):

  • postgresql.conf — memory, network, logging, storage
  • pg_hba.conf — access privileges
  • pg_ident.conf — OS user to PostgreSQL role mapping

AI-Ready PostgreSQL 18

(Kumar, Linster, 2026) — PostgreSQL 18 as a unified platform for transactions, analytics, and AI:

Area Technique
Vectors pgvector — embeddings directly in table rows
Hybrid pattern Semantic recall → SQL filtering
LLM integration PostgreSQL + MCP (Model Context Protocol)
Embedding pipeline Batch and stream embedding generation

Hybrid query:

SELECT p.*, pm.name
FROM products p
JOIN product_embeddings pe ON p.id = pe.product_id
WHERE pe.embedding <-> '[0.1, 0.3, ...]' < 0.8
  AND p.in_stock = true
  AND p.price < 100.00
ORDER BY pe.embedding <-> '[0.1, 0.3, ...]'
LIMIT 10;

Extensions

Extension Purpose
pgvector Vector search for AI/embeddings
PostGIS Geographic data, spatial queries
pg_stat_statements Query performance monitoring
pg_duckdb Analytical queries (DuckDB engine inside PG)
pg_search Full-text and hybrid search
pg_cron DB job scheduling
citus Horizontal scaling (sharding)
timescaledb Time-series optimization
pgaudit Audit logging

Connection pooling

Pooler Type Protocol
PgBouncer Proxy (transaction/session) PostgreSQL wire
Odyssey Proxy (multithreaded) PostgreSQL wire
pgpool-II Proxy (replication, load balancing) PostgreSQL wire
RDS Proxy Managed proxy (AWS) PostgreSQL wire

PgBouncer modes:

  • Session pooling — connection held for entire application session → overhead
  • Transaction pooling — connection returned after transaction completes → more efficient (requires statelessness)

Recommendations — where PostgreSQL is better

Area PostgreSQL Competition Why PG
Extensibility Extensions, custom types, operators, index methods MySQL limited Can add anything from vectors to full-text in DB
SQL standard Closest to ANSI SQL MySQL deviations (GROUP BY, ALTER TABLE) Portability, fewer surprises
Geospatial data PostGIS (gold standard GIS) MySQL GIS (limited) Only real open-source choice for GIS
Consistency SSI serializable, foreign keys, CHECK, exclusions MySQL MyISAM no FK, InnoDB only RC Suitable for financial and critical systems
Concurrent read/write MVCC without reader/writer blocking MySQL InnoDB reader blocks writer (and vice versa) in older versions Better read scalability
AI/vectors pgvector natively in DB Separate vector DB (increased latency) Hybrid queries in single SQL
License PostgreSQL license (MIT-like) MySQL dual license (Oracle) No vendor lock-in

When to use PostgreSQL

  • Enterprise applications — require ACID, referential integrity, complex transactions
  • Geographic systems — GIS, map applications, location services
  • Financial systems — accounting, banking, compliance (audit logging, SSI)
  • AI / RAG applications — hybrid vector + relational queries in one DB
  • Analytics on relational data — pg_duckdb, materialized views, window functions
  • Multi-tenant applications — row-level security, schemas per tenant

PostgreSQL licensing

Variant License Price Restrictions
PostgreSQL PostgreSQL license (MIT-like) $0 None — can use, modify, distribute in commercial products. No "commercial license" needed
Amazon Aurora PostgreSQL Proprietary (AWS) ~$0.10-1.00/hour AWS managed, PostgreSQL compatible. AWS may use PG code thanks to PostgreSQL license
YugabyteDB Apache 2.0 $0 (core) PostgreSQL compatible distributed SQL, built on PG query layer
TimescaleDB Apache 2.0 (community) / Timescale License (enterprise) $0 (community) Time-series extensions for PostgreSQL. Enterprise: tiered storage, compression, multi-node

Key point: The PostgreSQL license is one of the most liberal — it allows cloud providers (AWS, GCP, Azure) to offer PostgreSQL as a managed service without restrictions. This is different from MongoDB (SSPL) and Redis (RSALv2). Thanks to this, PostgreSQL has the broadest cloud support of any database.

Impact on choice: No license risk, no vendor lock-in, no hidden costs. PostgreSQL is a safe choice for any project.

When to use something else

  • Simple web / blog → SQLite (lighter in embedded scenarios)
  • High-throughput key-value → Redis (order of magnitude lower latency)
  • Time-series at massive scale → TimescaleDB, InfluxDB
  • Globally distributed data → CockroachDB, Spanner
  • Full-text search primarily → Elasticsearch

Sources

References, books, and standards: sources/databases/sources.md

Book Authors ISBN Description
PostgreSQL: Up and Running (3rd ed.) Regina Obe, Leo Hsu 978-1491962935 Practical guide to administration, configuration, and extensions
AI-Ready PostgreSQL 18 Kumar, Linster PostgreSQL as unified platform for AI workloads

Last revision: 2026-06-03