comiiit

2026-06-11 15:25:40 +02:00
parent 95d1839f05
commit 3fa11ef0f6
50 changed files with 9336 additions and 33 deletions
--- a/MONITORING.en.md
+++ b/MONITORING.en.md
@@ -0,0 +1,502 @@
+# 📊 Monitoring and observability
+
+## OpenMetrics standard
+
+OpenMetrics (CNCF sandbox) is the de-facto standard for metric exposition in cloud-native environments:
+
+- Supports text representation and Protocol Buffers
+- Foundation for Prometheus exposition format
+- Specifies: counter, gauge, histogram, summary, gaugehistogram, statefulset
+- `_total` suffix for cumulative values, `_bucket` for histograms
+- Metadata: HELP, TYPE, UNIT, (timestamp optional)
+
+The standard is developed within [OpenObservability](https://github.com/OpenObservability/OpenMetrics).
+
+## New tools and trends (2024–2026)
+
+| Tool | Description |
+|------|-------------|
+| **Grafana Sigil** | AI observability for LLM agents (OTel-native) |
+| **InfraLens** | eBPF-based, zero-instrumentation network observability |
+| **Ingero** | GPU causal observability (eBPF, CUDA tracing) |
+| **GreptimeDB** | Unified observability DB — replaces Prometheus + Loki + ES |
+| **Netdata** | AI-powered full-stack monitoring, 800+ integrations, edge ML |
+
+## Three pillars of observability
+
+1. **Logs** — unstructured event data (ERROR, WARN, INFO)
+2. **Metrics** — numerical data over time (latency, error rate, CPU utilization)
+3. **Traces** — request tracking across services (distributed tracing)
+
+## SLI / SLO / SLA
+
+| Term | Meaning | Example |
+|------|---------|---------|
+| **SLI** (Service Level Indicator) | Measured metric | Latency p99 = 250ms |
+| **SLO** (Service Level Objective) | Target value | 99.9 % of requests < 300ms |
+| **SLA** (Service Level Agreement) | Legal commitment | 99.95 % uptime |
+
+### Error budget
+
+`Error Budget = 100 % - SLO`
+- If SLO is 99.9 %, error budget is 0.1 % of time
+- While error budget remains, the team can deploy new features
+- When exhausted — freeze on deploys, stability is priority
+
+## Pyramid of metrics — RED vs USE vs 4 Golden Signals
+
+### 4 Golden Signals (Google SRE)
+
+1. **Latency** — request processing time (distinguish success vs error latency)
+2. **Traffic** — number of requests / throughput (RPS, QPS, throughput)
+3. **Errors** — explicit errors (5xx, 4xx) and implicit (success with wrong result)
+4. **Saturation** — how "full" the service is (CPU, memory, queue depth, connection pool)
+
+### USE (for infrastructure)
+- **U**tilization — how busy the resource is (% time active)
+- **S**aturation — how much is waiting in queue (run queue, I/O wait)
+- **E**rrors — errors (dropped packets, disk errors, OOM)
+
+### RED (for services)
+- **R**ate — requests per second
+- **E**rrors — number of erroneous requests
+- **D**uration — latency (distribution, percentiles)
+
+| Methodology | Focus | Typical metrics |
+|-------------|-------|-----------------|
+| **4 Golden Signals** | Services + infrastructure | Latency, RPS, errors, saturation |
+| **USE** | Infrastructure | CPU util, I/O saturation, disk errors |
+| **RED** | Microservices | RPS, error rate, p50/p95/p99 latency |
+
+## PromQL examples
+
+| Expression | Description |
+|------------|-------------|
+| `rate(http_requests_total[5m])` | Requests per second (average over 5 min) |
+| `increase(http_requests_total[1h])` | Total increase over 1 hour |
+| `sum by (status) (rate(http_requests_total[5m]))` | Requests aggregated by status code |
+| `histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))` | p99 latency |
+| `avg_over_time(cpu_usage[1h])` | Average CPU utilization over an hour |
+| `topk(5, sum(rate(http_requests_total[5m])) by (service))` | Top 5 services by RPS |
+| `max_over_time(memory_usage[24h])` | Max memory usage over 24h |
+| `rate(node_network_drop_total[5m]) > 0` | Networks with dropped packets |
+| `(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))` | CPU utilization (1 - idle) |
+| `delta(http_request_duration_seconds_sum[5m]) / delta(http_request_duration_seconds_count[5m])` | Average latency |
+| `absent(metric)` | Alert when metric is missing |
+
+## Recording rules
+
+Pre-aggregation of frequently used PromQL queries to reduce query load.
+
+### When to use
+- Complex queries used across multiple dashboards
+- Queries over raw data with high cardinality
+- Frequently queried aggregations (e.g., p99 latency over last month)
+
+### Example
+
+```yaml
+groups:
+  - name: service_rules
+    interval: 1m
+    rules:
+      - record: job:http_requests:rate5m
+        expr: sum(rate(http_requests_total[5m])) by (job)
+      - record: instance:cpu:utilization
+        expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))
+      - record: service:http_latency:p99
+        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
+```
+
+- **record** — new metric name (convention: `level:metric:aggregation`)
+- **interval** — how often the rule evaluates (typically 1-5 min)
+
+## Metrics — tools
+
+### Metrics
+| Tool | Description |
+|------|-------------|
+| Prometheus | Pull-based, time-series DB, powerful query language (PromQL) |
+| Grafana | Visualization, dashboards, alerting |
+| Zabbix | Enterprise monitoring, agent + agentless (SNMP/IPMI/JMX), auto-discovery, trigger-based alerting |
+| Datadog | SaaS, APM, logs, metrics in one |
+| New Relic | APM, browser monitoring |
+| CloudWatch | AWS native |
+| Azure Monitor | Azure native |
+| Google Cloud Ops | GCP native |
+
+### Logging
+| Tool | Description |
+|------|-------------|
+| ELK Stack | Elasticsearch, Logstash, Kibana |
+| Loki | Grafana Loki — lightweight, Prometheus-like |
+| Splunk | Enterprise log management |
+| Fluentd / Fluent Bit | Log collector and forwarder |
+| Vector | High-performance log/metric collector |
+
+### Tracing
+| Tool | Description |
+|------|-------------|
+| Jaeger | Open-source distributed tracing |
+| Zipkin | Open-source distributed tracing |
+| OpenTelemetry | Standard for instrumentation (logs, metrics, traces) |
+| Datadog APM | SaaS tracing |
+| AWS X-Ray | AWS tracing |
+
+## OpenTelemetry detail
+
+### Span attributes
+
+```yaml
+resource:
+  attributes:
+    - service.name: "payment-service"
+    - service.version: "1.2.3"
+    - deployment.environment: "production"
+scope:
+  name: "io.opentelemetry.payment"
+spans:
+  - name: "processPayment"
+    kind: SPAN_KIND_INTERNAL
+    attributes:
+      - payment.method: "credit_card"
+      - payment.amount: 2499
+      - payment.currency: "CZK"
+    events:
+      - name: "authorization.complete"
+        timestamp: 1717428000000000000
+```
+
+### Context propagation (W3C TraceContext)
+
+- **`traceparent`** — header carrying trace-id, span-id, trace flags
+  - Format: `00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01`
+  - Version (00) | Trace-ID (32 hex) | Span-ID (16 hex) | TraceFlags (01 = sampled)
+- **`tracestate`** — vendor-specific data, compatible cross-provider
+- Propagation happens via HTTP headers, gRPC metadata, message queue properties
+
+### Sampling
+
+| Type | Description | Use case |
+|------|-------------|----------|
+| **Head-based** | Sampling decision at trace start (based on ID) | Simple, deterministic |
+| **Tail-based** | Decision after trace completion (based on result, latency) | Better sampling, more complex |
+
+- Tail-based sampling: often used for critical traces (5xx, p99+, slow traces)
+- Tools: Grafana Tempo (tail-based), Jaeger (head-based), OTel Collector (head + tail)
+
+## Alerting
+
+### Principles
+
+- **Alert on symptom, not cause** — "500 errors" instead of "high CPU"
+- **Reduce noise** — flapping alerts, alert fatigue
+- **Runbook for every alert** — what to do when alert fires
+- **Alert severity** — P0 (critical), P1 (high), P2 (medium), P3 (low)
+
+### Alertmanager (Prometheus)
+
+```yaml
+route:
+  receiver: "team-pager"
+  group_by: ["alertname", "cluster"]
+  group_wait: 30s
+  group_interval: 5m
+  repeat_interval: 4h
+  routes:
+    - match:
+        severity: critical
+      receiver: "team-pager"
+      repeat_interval: 1h
+    - match:
+        severity: warning
+      receiver: "team-slack"
+
+receivers:
+  - name: "team-pager"
+    pagerduty_configs:
+      - routing_key: "<KEY>"
+        severity: "{{ .CommonLabels.severity }}"
+  - name: "team-slack"
+    slack_configs:
+      - channel: "#alerts"
+        title: "{{ .GroupLabels.alertname }}"
+```
+
+**Concepts**:
+- **Grouping** — grouping alerts by labels (noise reduction, e.g., all down instances in a cluster)
+- **Inhibition** — suppression of less severe alerts when a more severe one exists (e.g., nodedown inhibits pod alerts)
+- **Silencing** — temporary alert suppression (matching labels + duration)
+- **Routing tree** — hierarchical routing by label match (severity, service, team)
+
+### ESM (Event / Incident Management)
+
+- PagerDuty, Opsgenie, OnCall (Grafana)
+- Escalation policies
+- On-call rotations
+
+## Structured logging
+
+```json
+{
+  "timestamp": "2026-06-03T10:30:00Z",
+  "level": "ERROR",
+  "service": "payment-service",
+  "trace_id": "abc123",
+  "user_id": "u456",
+  "message": "Payment gateway timeout",
+  "duration_ms": 1200,
+  "error": {
+    "type": "TimeoutError",
+    "message": "Gateway did not respond in 1000ms"
+  }
+}
+```
+
+### Required fields of structured log
+
+| Field | Description | Example |
+|-------|-------------|---------|
+| `timestamp` | ISO 8601 / RFC 3339 | `2026-06-03T10:30:00Z` |
+| `level` | Log level (RFC 5424) | `ERROR`, `WARN`, `INFO`, `DEBUG` |
+| `message` | Human-readable message | `Payment processed` |
+| `service` | Service name | `payment-service` |
+| `trace_id` | Correlation across services | `abc123def456` |
+
+### RFC 5424 log levels
+
+| Number | Level | Usage |
+|--------|-------|-------|
+| 0 | EMERG | System unusable |
+| 1 | ALERT | Immediate action required |
+| 2 | CRIT | Critical error |
+| 3 | ERROR | Error (non-critical) |
+| 4 | WARN | Warning |
+| 5 | NOTICE | Normal but significant event |
+| 6 | INFO | Informational message |
+| 7 | DEBUG | Debugging (disabled in production) |
+
+### Correlation ID (traceparent)
+
+- Generated at system entry (API gateway, frontend, message consumer)
+- Propagated in HTTP header `X-Correlation-ID` / `traceparent`
+- Enables linking logs across microservices (→ Grafana Explore, Kibana Discover)
+- Implementation: middleware in app, service mesh (Envoy), API gateway
+
+## Distributed tracing detail
+
+### Span kinds
+
+| Kind | Description | Example |
+|------|-------------|---------|
+| **CLIENT** | Calling downstream service (outbound) | HTTP client calling API |
+| **SERVER** | Processing incoming request | HTTP handler |
+| **INTERNAL** | Local operation within service | Computation, transformation |
+| **PRODUCER** | Sending message to queue | Kafka producer |
+| **CONSUMER** | Receiving message from queue | Kafka consumer |
+
+### Trace context chain
+
+```
+Trace: abc123
+├── Span: /checkout (SERVER, root)
+│   ├── Span: validateCart (INTERNAL)
+│   ├── Span: POST /orders (CLIENT → payment-service)
+│   │   └── Span: /processPayment (SERVER)
+│   │       ├── Span: authorizeCard (INTERNAL)
+│   │       └── Span: chargeCard (CLIENT → bank-gateway)
+│   │           └── Span: /charge (SERVER, external)
+│   └── Span: sendConfirmation (PRODUCER → kafka)
+│       └── Span: consumeConfirmation (CONSUMER → email-service)
+```
+
+- **W3C TraceContext** — standardized cross-service tracing
+- **Baggage** — transport of contextual data (tenant, user role) between spans
+
+## Grafana
+
+### Provisioning dashboards as code
+
+```yaml
+apiVersion: 1
+providers:
+  - name: "default"
+    orgId: 1
+    folder: "Services"
+    type: file
+    options:
+      path: /etc/grafana/provisioning/dashboards
+```
+
+Dashboards JSON in git → CI/CD → automatic import into Grafana.
+
+### Variables
+
+- **Query variable** — dynamic values (e.g., list of service names from PromQL: `label_values(up, service)`)
+- **Interval variable** — `$__auto_interval`, `$__interval` for variable time range
+- **Custom variable** — manual list of values (env: prod, staging, dev)
+- **Chained variable** — dependent variable (select namespace → show pods in namespace)
+
+### Annotations
+
+- Drawing events in graphs (deploys, incidents, config changes)
+- Sources: Prometheus alerts, Loki logs, GitHub Actions, custom API
+- Use case: "Deploy at 14:30 → spike in latency at 14:31 → correlation"
+
+## On-call best practices
+
+### Escalation policies
+
+```
+Level 1: Primary on-call (response within 5 min)
+    └── timeout 15 min
+Level 2: Secondary / senior engineer (response within 15 min)
+    └── timeout 15 min
+Level 3: Engineering manager / incident commander
+```
+
+### Incident severity matrix
+
+| Severity | Description | Response | Communication |
+|----------|-------------|----------|---------------|
+| **P0 (Critical)** | Service completely unavailable, data loss, security breach | Immediate, 24/7 | Status page + Stakeholder update |
+| **P1 (High)** | Major functionality degraded, part of users affected | Within 15 min | Slack channel + Team lead |
+| **P2 (Medium)** | Non-critical feature broken, workaround exists | Within 1 h | Slack channel |
+| **P3 (Low)** | Cosmetic issue, no user impact | Next business day | Jira ticket |
+
+### Postmortem
+
+- **Blameless** — goal is to learn, not blame
+- **Structure**: Timeline, detection, root cause, resolution, action items
+- **SRE principle**: every incident → postmortem → systemic improvement
+- **Tools**: Jira, Incident.io, PagerDuty postmortem, Google Docs
+
+## Logging patterns
+
+### Best practices
+
+- **Dashboard for each level** — executive, service, troubleshooting
+- **Synthetic monitoring** — heartbeat checks, browser tests (Playwright, Cypress)
+- **APM** — Application Performance Monitoring (database queries, external calls)
+- **Anomaly detection** — ML-based outlier detection
+- **Retention policy** — raw data short term, aggregations long term
+- **Unified log format** — JSON, structured data
+
+## Recommended literature
+
+### Classic books
+
+| Book | Authors | ISBN | Key topics |
+|------|---------|------|------------|
+| **Site Reliability Engineering** | Beyer, Jones, Petoff, Murphy | 978-1491929124 | How Google runs production systems — SRE principles, error budgets, toil, SLI/SLO |
+| **The Site Reliability Workbook** | Beyer, Murphy, Rensin, Kawahara, Thorne | 978-1492029502 | Practical companion to SRE — case studies from Evernote, Home Depot, NY Times; SLO implementation, monitoring, on-call |
+| **Observability Engineering** | Majors, Fong-Jones, Miranda | 978-1492076445 | First comprehensive book on observability — structured events, iterative hypothesis verification, core analysis loop; 2nd edition in 2026 (32 new chapters on AI, cost governance) |
+
+### Cloud and monitoring
+
+| Book | Author | ISBN/Year | Topics |
+|------|--------|-----------|--------|
+| **Cloud Observability in Action** | Michael Hausenblas | Manning, 2023 | Practical guide to observability in cloud-native environments — signal types (logs, metrics, traces, profiles), OTel Collector, SLOs, signal correlation, developer observability; open-source tools |
+| **Mastering Prometheus** | William Hegedus | 978-1-80512-566-2 | Advanced Prometheus techniques — TSDB internals, custom service discovery, cardinality, remote storage (VictoriaMetrics, Mimir), SLO-based alerting; author is SRE manager at Akamai and Prometheus/Thanos contributor |
+| **Observability with Grafana** | Chapman, Holmes | 978-1-80324-964-3 | Complete guide to LGTM stack (Loki, Grafana, Tempo, Mimir) — OTel instrumentation, LogQL/PromQL/TraceQL, AI/ML alerting, real user monitoring with Faro, Pyroscope profiling, k6 load testing |
+| **Hands-On Monitoring and Alerting with Prometheus** | Muhammad Badawy | 978-9349887565 | Practical Prometheus guide — installation, configuration, service discovery, labeling, PromQL, Alertmanager, monitoring Linux, Windows, Docker, databases |
+
+### AI and observability
+
+| Book | Authors | ISBN/Year | Topics |
+|------|---------|-----------|--------|
+| **Observability in the AI-Native Era** | Lipsig, Grabner, Rati | 978-1-80638-959-9 | Connecting observability with AIOps — ML-based anomaly detection, root-cause analysis, self-healing systems, OTel + Prometheus + Grafana + Dynatrace/Datadog, compliance |
+| **Open Source Observability** | Corless, Pawar | O'Reilly, 2025 | Report on disaggregated, modular observability stacks — flexibility, cost efficiency, data autonomy, blueprint for custom solutions from open-source components |
+
+## Detailed tool overview
+
+Extended information on tools from the table above:
+
+### Grafana Sigil
+
+AI observability product from Grafana Labs. OpenTelemetry-native SDK for instrumenting LLM agents:
+
+- **Repository**: `github.com/grafana/sigil-sdk` (Go SDK) + `sigil-app` (Grafana plugin)
+- **Features**: tracking conversations, generation, tool usage, cost tracking, quality evaluation
+- **Growing problem**: 500M+ conversations, 5M+ agents in production (GrafanaCON 2026)
+- **Integration**: automatic connection with Prometheus (metrics), Tempo (traces), AI Observability API
+
+### InfraLens
+
+Zero-instrumentation Kubernetes observability built on eBPF:
+
+- **Repository**: `github.com/Herenn/Infralens` (Apache 2.0, Go)
+- **Features**: automatic detection of service-to-service communication, topology visualization, AI-powered documentation
+- **Architecture**: eBPF agent + Go backend + React frontend
+- **Status**: early-stage (1 star, 10 commits), but eBPF-based observability concept is proven (Grafana Beyla, Cilium Hubble, Pixie)
+
+### Ingero
+
+GPU causal observability agent — first of its kind:
+
+- **Repository**: `github.com/ingero-io/ingero` (Apache 2.0)
+- **Features**: eBPF tracing from Linux kernel events through CUDA API to Python source code
+- **Overhead**: < 2 %, zero code changes, single binary
+- **MCP server**: native Model Context Protocol support — AI assistants can directly query GPU data
+- **Use case**: diagnosis of GPU stalls, scheduler preemptions, CUDA memory spikes — causal chains instead of plain metrics
+- **Version**: v0.19.0 (2026), active development
+
+### GreptimeDB
+
+Unified observability database — one backend for metrics, logs and traces:
+
+- **Repository**: `github.com/GreptimeTeam/greptimedb` (Apache 2.0, Rust)
+- **Architecture**: compute-storage disaggregation, object storage first (S3, GCS, Azure Blob), columnar storage
+- **Querying**: SQL + PromQL in a single query, JOIN between metrics and logs possible
+- **Drop-in replacement**: Prometheus (PromQL, remote write), Loki (Push API), Elasticsearch (bulk API), Jaeger (Query API)
+- **Cost reduction**: up to 50× lower costs compared to traditional solutions
+- **Roadmap 2026**: v1.0 GA (Q1 2026), v1.1–v1.3 (Vector Index, AI Functions, Auto Rollup, adaptive resource management)
+- **GreptimeDB Enterprise**: enhanced security, HA, enterprise support
+
+### Netdata
+
+Open-source, real-time monitoring platform for entire infrastructure:
+
+- **Repository**: `github.com/netdata/netdata` (GPLv3+, C; 79k★)
+- **Features**: per-second metrics, ML-based anomaly detection, AI-powered troubleshooting, 800+ integrations
+- **Zero configuration**: auto-discovery, pre-configured alerts, ready dashboards
+- **Architecture**: distributed agent → Netdata Cloud (optional), data stays local
+- **Energy efficiency**: according to University of Amsterdam study, the most efficient tool for monitoring Docker systems
+- **Netdata Cloud**: free tier (5 nodes), paid from $12/node/month
+- **Licensing**: agent GPLv3+, dashboard NCUL1, cloud closed-source
+
+## OpenStack Monitoring
+
+OpenStack provides several services for telemetry and monitoring:
+
+### Ceilometer (Telemetry)
+
+- Metric collection (CPU, memory, network, storage) from compute, network and storage nodes
+- Publishing to Gnocchi (time-series DB) or Panko (event storage)
+- Notifications via oslo.messaging (RabbitMQ) — pipeline transformations
+- Alarming: Aodh — threshold-based alarms, metric combinations
+
+### Monasca
+
+- More modern alternative to Ceilometer (primarily developed for telco use cases)
+- Architecture: Monasca API → Log API → Transform → Threshold Engine → Notifier
+- Backend: InfluxDB/Gnocchi, Kafka, Elasticsearch
+- Supports alerting, notifications, graph dashboards
+
+### Prometheus + OpenStack Exporter
+
+- OpenStack-exporter for Prometheus (exports metrics from Ceilometer / API)
+- Service discovery via Prometheus
+- Grafana dashboards for visualization
+
+### Masakari (VM High Availability)
+
+- Detection and automatic recovery of VMs on hypervisor failure (host failure)
+- Evacuation of instances to healthy compute node
+- Integration with Pacemaker for cluster management
+
+## Sources
+
+Links, books and standards: [sources/monitoring/sources.md](sources/monitoring/sources.md)
+
+*Last revision: 2026-06-03*