comiiit

2026-06-11 15:25:40 +02:00
parent 95d1839f05
commit 3fa11ef0f6
50 changed files with 9336 additions and 33 deletions
--- a/CASSANDRA.en.md
+++ b/CASSANDRA.en.md
@@ -0,0 +1,135 @@
+# 🐘 Cassandra & ScyllaDB
+
+## Overview
+
+Apache Cassandra is a distributed wide-column NoSQL database designed for high availability and linear scalability with no single point of failure. Inspired by the Amazon Dynamo paper (2007) and Google Bigtable. ScyllaDB is a C++ reimplementation compatible with the Cassandra protocol, with drastically lower latency and higher throughput.
+
+## Architecture (Dynamo-inspired)
+
+### Consistent hashing
+
+Data is divided on a hash ring, each node is responsible for a token range:
+
+```text
+   0 ─── node A ─── hash(key1)
+         │
+   90 ─── node B ─── hash(key2)
+         │
+  180 ─── node C ─── hash(key3)
+         │
+  270 ─── node D ─── hash(key4)
+```
+
+- Adding/removing a node affects only K/N keys (thanks to virtual nodes)
+- **Virtual nodes** (vnodes) — each physical node has ~100-200 tokens on the ring (more even distribution)
+
+### Quorum (N, R, W)
+
+- N = replication factor (typically 3)
+- R = read quorum (typically 2)
+- W = write quorum (typically 2)
+- Condition: R + W > N (for strong-ish consistency)
+- **Sloppy quorum** — when a node is unavailable, data is temporarily stored on another
+- **Hinted handoff** — temporary write with hint, data transferred upon recovery
+
+### Gossip protocol
+
+Decentralized dissemination of membership information — each node periodically communicates with 1-3 random nodes. No central point of failure.
+
+### Vector clocks
+
+Capturing causality of object versions. On conflict (partition merge), both versions are returned — application merges.
+
+### Merkle trees
+
+Anti-entropy — hash tree for detecting divergence between replicas. Fast detection of which data ranges differ.
+
+### Write path
+
+```text
+Client → Coordinator → [1. Write to commit log (disk)]
+                       [2. Write to memtable (RAM)]
+                       [3. Acknowledge client]
+                       → [4. Flush memtable → SSTable (periodically)]
+                       → [5. Compaction (merge SSTables)]
+```
+
+### Read path
+
+```text
+Client → Coordinator → [1. Check bloom filter]
+                       [2. Check row cache / key cache]
+                       [3. Read from SSTable (disk)]
+                       [4. Merge with memtable]
+                       [5. Repair if stale (read repair)]
+```
+
+## Cassandra vs ScyllaDB
+
+| Feature | Cassandra | ScyllaDB |
+|---------|-----------|----------|
+| **Language** | Java (JVM) | C++ (seastar framework) |
+| **Architecture** | Thread-per-connection | Shared-nothing, CPU sharding |
+| **Latency** | 5-20 ms (typical) | 1-3 ms (typical) |
+| **Throughput** | Good | 5-10× higher on same HW |
+| **GC pauses** | Yes (JVM) | No (no GC) |
+| **NUMA** | OS-dependent | Native NUMA aware |
+| **Workload** | Standard | High-throughput, real-time |
+| **Price** | Open source | Open source + Enterprise |
+
+## Data model
+
+- **Keyspace** = namespace (analogy to DB)
+- **Table** = column definition (not schema-less)
+- **Partition key** = hash key for ring distribution
+- **Clustering columns** = ordering within a partition
+- **Primary key** = Partition key + Clustering columns
+
+## Recommendations — where Cassandra is better
+
+| Area | Cassandra | Competition | Why Cassandra |
+|------|-----------|-------------|---------------|
+| **Write throughput** | Linear scaling, no master bottleneck | PostgreSQL (master writes) | Every node writes, no single point of failure |
+| **Availability** | AP from CAP — always writable | MongoDB (CP, primary down = read-only) | "Always-writeable" philosophy |
+| **Multi-DC** | Native, per-DC replication | CockroachDB (complex) | Simple configuration, latency-tolerant |
+| **Time-series** | Wide-row model, TTL, compaction | InfluxDB (specialized) | Can combine with other workloads |
+| **IoT / sensor data** | Linear scaling, no master | MongoDB (sharding complex) | Predictable performance under growth |
+| **Geographic distribution** | Native multi-DC, hinted handoff | Spanner (vendor lock-in) | Open source, no dependencies |
+
+### When to use Cassandra / ScyllaDB
+
+- **IoT / sensor data ingest** — millions of writes/s, no data loss
+- **Time-series at massive scale** — metrics, logs, event data
+- **User activity history** — write-heavy workloads
+- **Multi-DC applications** — data available in every location
+- **Recommendation systems** — wide-row model for "what user has seen"
+- **Message / event store** — high-throughput append with TTL
+
+### When to use something else
+
+- **Relations, JOINs, transactions** → PostgreSQL (Cassandra has no JOINs, limited transactions)
+- **Full-text search** → Elasticsearch
+- **Aggregation / OLAP** → ClickHouse (Cassandra is not an analytical DB)
+- **Small data (< 100 GB)** → PostgreSQL (Cassandra overhead not worth it)
+- **Frequent reads by secondary keys** → DynamoDB (SADA indexes) — Cassandra has limited secondary indexes
+
+### ScyllaDB specific
+
+ScyllaDB is advantageous when:
+- You need 5-10× higher throughput on the same HW
+- You have latency-sensitive workload (real-time scoring, ad-tech)
+- You want to eliminate JVM/GC issues
+- You need predictable performance (P99 < 5 ms)
+
+## Sources
+
+References, books, and standards: [sources/databases/sources.md](sources/databases/sources.md)
+
+### Recommended reading
+
+| Paper / Book | Authors | Description |
+|--------------|---------|-------------|
+| Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007) | DeCandia et al. | Foundational paper for Cassandra architecture |
+| Cassandra: The Definitive Guide (3rd ed.) | E. Hewitt | Comprehensive guide to deployment and operations |
+
+*Last revision: 2026-06-03*