Files
knowledge-base/MONITORING.en.md
Stanislav Hubacek 3fa11ef0f6 comiiit
2026-06-11 15:27:28 +02:00

503 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 📊 Monitoring and observability
## OpenMetrics standard
OpenMetrics (CNCF sandbox) is the de-facto standard for metric exposition in cloud-native environments:
- Supports text representation and Protocol Buffers
- Foundation for Prometheus exposition format
- Specifies: counter, gauge, histogram, summary, gaugehistogram, statefulset
- `_total` suffix for cumulative values, `_bucket` for histograms
- Metadata: HELP, TYPE, UNIT, (timestamp optional)
The standard is developed within [OpenObservability](https://github.com/OpenObservability/OpenMetrics).
## New tools and trends (20242026)
| Tool | Description |
|------|-------------|
| **Grafana Sigil** | AI observability for LLM agents (OTel-native) |
| **InfraLens** | eBPF-based, zero-instrumentation network observability |
| **Ingero** | GPU causal observability (eBPF, CUDA tracing) |
| **GreptimeDB** | Unified observability DB — replaces Prometheus + Loki + ES |
| **Netdata** | AI-powered full-stack monitoring, 800+ integrations, edge ML |
## Three pillars of observability
1. **Logs** — unstructured event data (ERROR, WARN, INFO)
2. **Metrics** — numerical data over time (latency, error rate, CPU utilization)
3. **Traces** — request tracking across services (distributed tracing)
## SLI / SLO / SLA
| Term | Meaning | Example |
|------|---------|---------|
| **SLI** (Service Level Indicator) | Measured metric | Latency p99 = 250ms |
| **SLO** (Service Level Objective) | Target value | 99.9 % of requests < 300ms |
| **SLA** (Service Level Agreement) | Legal commitment | 99.95 % uptime |
### Error budget
`Error Budget = 100 % - SLO`
- If SLO is 99.9 %, error budget is 0.1 % of time
- While error budget remains, the team can deploy new features
- When exhausted — freeze on deploys, stability is priority
## Pyramid of metrics — RED vs USE vs 4 Golden Signals
### 4 Golden Signals (Google SRE)
1. **Latency** — request processing time (distinguish success vs error latency)
2. **Traffic** — number of requests / throughput (RPS, QPS, throughput)
3. **Errors** — explicit errors (5xx, 4xx) and implicit (success with wrong result)
4. **Saturation** — how "full" the service is (CPU, memory, queue depth, connection pool)
### USE (for infrastructure)
- **U**tilization — how busy the resource is (% time active)
- **S**aturation — how much is waiting in queue (run queue, I/O wait)
- **E**rrors — errors (dropped packets, disk errors, OOM)
### RED (for services)
- **R**ate — requests per second
- **E**rrors — number of erroneous requests
- **D**uration — latency (distribution, percentiles)
| Methodology | Focus | Typical metrics |
|-------------|-------|-----------------|
| **4 Golden Signals** | Services + infrastructure | Latency, RPS, errors, saturation |
| **USE** | Infrastructure | CPU util, I/O saturation, disk errors |
| **RED** | Microservices | RPS, error rate, p50/p95/p99 latency |
## PromQL examples
| Expression | Description |
|------------|-------------|
| `rate(http_requests_total[5m])` | Requests per second (average over 5 min) |
| `increase(http_requests_total[1h])` | Total increase over 1 hour |
| `sum by (status) (rate(http_requests_total[5m]))` | Requests aggregated by status code |
| `histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))` | p99 latency |
| `avg_over_time(cpu_usage[1h])` | Average CPU utilization over an hour |
| `topk(5, sum(rate(http_requests_total[5m])) by (service))` | Top 5 services by RPS |
| `max_over_time(memory_usage[24h])` | Max memory usage over 24h |
| `rate(node_network_drop_total[5m]) > 0` | Networks with dropped packets |
| `(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))` | CPU utilization (1 - idle) |
| `delta(http_request_duration_seconds_sum[5m]) / delta(http_request_duration_seconds_count[5m])` | Average latency |
| `absent(metric)` | Alert when metric is missing |
## Recording rules
Pre-aggregation of frequently used PromQL queries to reduce query load.
### When to use
- Complex queries used across multiple dashboards
- Queries over raw data with high cardinality
- Frequently queried aggregations (e.g., p99 latency over last month)
### Example
```yaml
groups:
- name: service_rules
interval: 1m
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: instance:cpu:utilization
expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))
- record: service:http_latency:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
```
- **record** — new metric name (convention: `level:metric:aggregation`)
- **interval** — how often the rule evaluates (typically 1-5 min)
## Metrics — tools
### Metrics
| Tool | Description |
|------|-------------|
| Prometheus | Pull-based, time-series DB, powerful query language (PromQL) |
| Grafana | Visualization, dashboards, alerting |
| Zabbix | Enterprise monitoring, agent + agentless (SNMP/IPMI/JMX), auto-discovery, trigger-based alerting |
| Datadog | SaaS, APM, logs, metrics in one |
| New Relic | APM, browser monitoring |
| CloudWatch | AWS native |
| Azure Monitor | Azure native |
| Google Cloud Ops | GCP native |
### Logging
| Tool | Description |
|------|-------------|
| ELK Stack | Elasticsearch, Logstash, Kibana |
| Loki | Grafana Loki — lightweight, Prometheus-like |
| Splunk | Enterprise log management |
| Fluentd / Fluent Bit | Log collector and forwarder |
| Vector | High-performance log/metric collector |
### Tracing
| Tool | Description |
|------|-------------|
| Jaeger | Open-source distributed tracing |
| Zipkin | Open-source distributed tracing |
| OpenTelemetry | Standard for instrumentation (logs, metrics, traces) |
| Datadog APM | SaaS tracing |
| AWS X-Ray | AWS tracing |
## OpenTelemetry detail
### Span attributes
```yaml
resource:
attributes:
- service.name: "payment-service"
- service.version: "1.2.3"
- deployment.environment: "production"
scope:
name: "io.opentelemetry.payment"
spans:
- name: "processPayment"
kind: SPAN_KIND_INTERNAL
attributes:
- payment.method: "credit_card"
- payment.amount: 2499
- payment.currency: "CZK"
events:
- name: "authorization.complete"
timestamp: 1717428000000000000
```
### Context propagation (W3C TraceContext)
- **`traceparent`** — header carrying trace-id, span-id, trace flags
- Format: `00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01`
- Version (00) | Trace-ID (32 hex) | Span-ID (16 hex) | TraceFlags (01 = sampled)
- **`tracestate`** — vendor-specific data, compatible cross-provider
- Propagation happens via HTTP headers, gRPC metadata, message queue properties
### Sampling
| Type | Description | Use case |
|------|-------------|----------|
| **Head-based** | Sampling decision at trace start (based on ID) | Simple, deterministic |
| **Tail-based** | Decision after trace completion (based on result, latency) | Better sampling, more complex |
- Tail-based sampling: often used for critical traces (5xx, p99+, slow traces)
- Tools: Grafana Tempo (tail-based), Jaeger (head-based), OTel Collector (head + tail)
## Alerting
### Principles
- **Alert on symptom, not cause** — "500 errors" instead of "high CPU"
- **Reduce noise** — flapping alerts, alert fatigue
- **Runbook for every alert** — what to do when alert fires
- **Alert severity** — P0 (critical), P1 (high), P2 (medium), P3 (low)
### Alertmanager (Prometheus)
```yaml
route:
receiver: "team-pager"
group_by: ["alertname", "cluster"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "team-pager"
repeat_interval: 1h
- match:
severity: warning
receiver: "team-slack"
receivers:
- name: "team-pager"
pagerduty_configs:
- routing_key: "<KEY>"
severity: "{{ .CommonLabels.severity }}"
- name: "team-slack"
slack_configs:
- channel: "#alerts"
title: "{{ .GroupLabels.alertname }}"
```
**Concepts**:
- **Grouping** — grouping alerts by labels (noise reduction, e.g., all down instances in a cluster)
- **Inhibition** — suppression of less severe alerts when a more severe one exists (e.g., nodedown inhibits pod alerts)
- **Silencing** — temporary alert suppression (matching labels + duration)
- **Routing tree** — hierarchical routing by label match (severity, service, team)
### ESM (Event / Incident Management)
- PagerDuty, Opsgenie, OnCall (Grafana)
- Escalation policies
- On-call rotations
## Structured logging
```json
{
"timestamp": "2026-06-03T10:30:00Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123",
"user_id": "u456",
"message": "Payment gateway timeout",
"duration_ms": 1200,
"error": {
"type": "TimeoutError",
"message": "Gateway did not respond in 1000ms"
}
}
```
### Required fields of structured log
| Field | Description | Example |
|-------|-------------|---------|
| `timestamp` | ISO 8601 / RFC 3339 | `2026-06-03T10:30:00Z` |
| `level` | Log level (RFC 5424) | `ERROR`, `WARN`, `INFO`, `DEBUG` |
| `message` | Human-readable message | `Payment processed` |
| `service` | Service name | `payment-service` |
| `trace_id` | Correlation across services | `abc123def456` |
### RFC 5424 log levels
| Number | Level | Usage |
|--------|-------|-------|
| 0 | EMERG | System unusable |
| 1 | ALERT | Immediate action required |
| 2 | CRIT | Critical error |
| 3 | ERROR | Error (non-critical) |
| 4 | WARN | Warning |
| 5 | NOTICE | Normal but significant event |
| 6 | INFO | Informational message |
| 7 | DEBUG | Debugging (disabled in production) |
### Correlation ID (traceparent)
- Generated at system entry (API gateway, frontend, message consumer)
- Propagated in HTTP header `X-Correlation-ID` / `traceparent`
- Enables linking logs across microservices (→ Grafana Explore, Kibana Discover)
- Implementation: middleware in app, service mesh (Envoy), API gateway
## Distributed tracing detail
### Span kinds
| Kind | Description | Example |
|------|-------------|---------|
| **CLIENT** | Calling downstream service (outbound) | HTTP client calling API |
| **SERVER** | Processing incoming request | HTTP handler |
| **INTERNAL** | Local operation within service | Computation, transformation |
| **PRODUCER** | Sending message to queue | Kafka producer |
| **CONSUMER** | Receiving message from queue | Kafka consumer |
### Trace context chain
```
Trace: abc123
├── Span: /checkout (SERVER, root)
│ ├── Span: validateCart (INTERNAL)
│ ├── Span: POST /orders (CLIENT → payment-service)
│ │ └── Span: /processPayment (SERVER)
│ │ ├── Span: authorizeCard (INTERNAL)
│ │ └── Span: chargeCard (CLIENT → bank-gateway)
│ │ └── Span: /charge (SERVER, external)
│ └── Span: sendConfirmation (PRODUCER → kafka)
│ └── Span: consumeConfirmation (CONSUMER → email-service)
```
- **W3C TraceContext** — standardized cross-service tracing
- **Baggage** — transport of contextual data (tenant, user role) between spans
## Grafana
### Provisioning dashboards as code
```yaml
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: "Services"
type: file
options:
path: /etc/grafana/provisioning/dashboards
```
Dashboards JSON in git → CI/CD → automatic import into Grafana.
### Variables
- **Query variable** — dynamic values (e.g., list of service names from PromQL: `label_values(up, service)`)
- **Interval variable** — `$__auto_interval`, `$__interval` for variable time range
- **Custom variable** — manual list of values (env: prod, staging, dev)
- **Chained variable** — dependent variable (select namespace → show pods in namespace)
### Annotations
- Drawing events in graphs (deploys, incidents, config changes)
- Sources: Prometheus alerts, Loki logs, GitHub Actions, custom API
- Use case: "Deploy at 14:30 → spike in latency at 14:31 → correlation"
## On-call best practices
### Escalation policies
```
Level 1: Primary on-call (response within 5 min)
└── timeout 15 min
Level 2: Secondary / senior engineer (response within 15 min)
└── timeout 15 min
Level 3: Engineering manager / incident commander
```
### Incident severity matrix
| Severity | Description | Response | Communication |
|----------|-------------|----------|---------------|
| **P0 (Critical)** | Service completely unavailable, data loss, security breach | Immediate, 24/7 | Status page + Stakeholder update |
| **P1 (High)** | Major functionality degraded, part of users affected | Within 15 min | Slack channel + Team lead |
| **P2 (Medium)** | Non-critical feature broken, workaround exists | Within 1 h | Slack channel |
| **P3 (Low)** | Cosmetic issue, no user impact | Next business day | Jira ticket |
### Postmortem
- **Blameless** — goal is to learn, not blame
- **Structure**: Timeline, detection, root cause, resolution, action items
- **SRE principle**: every incident → postmortem → systemic improvement
- **Tools**: Jira, Incident.io, PagerDuty postmortem, Google Docs
## Logging patterns
### Best practices
- **Dashboard for each level** — executive, service, troubleshooting
- **Synthetic monitoring** — heartbeat checks, browser tests (Playwright, Cypress)
- **APM** — Application Performance Monitoring (database queries, external calls)
- **Anomaly detection** — ML-based outlier detection
- **Retention policy** — raw data short term, aggregations long term
- **Unified log format** — JSON, structured data
## Recommended literature
### Classic books
| Book | Authors | ISBN | Key topics |
|------|---------|------|------------|
| **Site Reliability Engineering** | Beyer, Jones, Petoff, Murphy | 978-1491929124 | How Google runs production systems — SRE principles, error budgets, toil, SLI/SLO |
| **The Site Reliability Workbook** | Beyer, Murphy, Rensin, Kawahara, Thorne | 978-1492029502 | Practical companion to SRE — case studies from Evernote, Home Depot, NY Times; SLO implementation, monitoring, on-call |
| **Observability Engineering** | Majors, Fong-Jones, Miranda | 978-1492076445 | First comprehensive book on observability — structured events, iterative hypothesis verification, core analysis loop; 2nd edition in 2026 (32 new chapters on AI, cost governance) |
### Cloud and monitoring
| Book | Author | ISBN/Year | Topics |
|------|--------|-----------|--------|
| **Cloud Observability in Action** | Michael Hausenblas | Manning, 2023 | Practical guide to observability in cloud-native environments — signal types (logs, metrics, traces, profiles), OTel Collector, SLOs, signal correlation, developer observability; open-source tools |
| **Mastering Prometheus** | William Hegedus | 978-1-80512-566-2 | Advanced Prometheus techniques — TSDB internals, custom service discovery, cardinality, remote storage (VictoriaMetrics, Mimir), SLO-based alerting; author is SRE manager at Akamai and Prometheus/Thanos contributor |
| **Observability with Grafana** | Chapman, Holmes | 978-1-80324-964-3 | Complete guide to LGTM stack (Loki, Grafana, Tempo, Mimir) — OTel instrumentation, LogQL/PromQL/TraceQL, AI/ML alerting, real user monitoring with Faro, Pyroscope profiling, k6 load testing |
| **Hands-On Monitoring and Alerting with Prometheus** | Muhammad Badawy | 978-9349887565 | Practical Prometheus guide — installation, configuration, service discovery, labeling, PromQL, Alertmanager, monitoring Linux, Windows, Docker, databases |
### AI and observability
| Book | Authors | ISBN/Year | Topics |
|------|---------|-----------|--------|
| **Observability in the AI-Native Era** | Lipsig, Grabner, Rati | 978-1-80638-959-9 | Connecting observability with AIOps — ML-based anomaly detection, root-cause analysis, self-healing systems, OTel + Prometheus + Grafana + Dynatrace/Datadog, compliance |
| **Open Source Observability** | Corless, Pawar | O'Reilly, 2025 | Report on disaggregated, modular observability stacks — flexibility, cost efficiency, data autonomy, blueprint for custom solutions from open-source components |
## Detailed tool overview
Extended information on tools from the table above:
### Grafana Sigil
AI observability product from Grafana Labs. OpenTelemetry-native SDK for instrumenting LLM agents:
- **Repository**: `github.com/grafana/sigil-sdk` (Go SDK) + `sigil-app` (Grafana plugin)
- **Features**: tracking conversations, generation, tool usage, cost tracking, quality evaluation
- **Growing problem**: 500M+ conversations, 5M+ agents in production (GrafanaCON 2026)
- **Integration**: automatic connection with Prometheus (metrics), Tempo (traces), AI Observability API
### InfraLens
Zero-instrumentation Kubernetes observability built on eBPF:
- **Repository**: `github.com/Herenn/Infralens` (Apache 2.0, Go)
- **Features**: automatic detection of service-to-service communication, topology visualization, AI-powered documentation
- **Architecture**: eBPF agent + Go backend + React frontend
- **Status**: early-stage (1 star, 10 commits), but eBPF-based observability concept is proven (Grafana Beyla, Cilium Hubble, Pixie)
### Ingero
GPU causal observability agent — first of its kind:
- **Repository**: `github.com/ingero-io/ingero` (Apache 2.0)
- **Features**: eBPF tracing from Linux kernel events through CUDA API to Python source code
- **Overhead**: < 2 %, zero code changes, single binary
- **MCP server**: native Model Context Protocol support — AI assistants can directly query GPU data
- **Use case**: diagnosis of GPU stalls, scheduler preemptions, CUDA memory spikes — causal chains instead of plain metrics
- **Version**: v0.19.0 (2026), active development
### GreptimeDB
Unified observability database — one backend for metrics, logs and traces:
- **Repository**: `github.com/GreptimeTeam/greptimedb` (Apache 2.0, Rust)
- **Architecture**: compute-storage disaggregation, object storage first (S3, GCS, Azure Blob), columnar storage
- **Querying**: SQL + PromQL in a single query, JOIN between metrics and logs possible
- **Drop-in replacement**: Prometheus (PromQL, remote write), Loki (Push API), Elasticsearch (bulk API), Jaeger (Query API)
- **Cost reduction**: up to 50× lower costs compared to traditional solutions
- **Roadmap 2026**: v1.0 GA (Q1 2026), v1.1v1.3 (Vector Index, AI Functions, Auto Rollup, adaptive resource management)
- **GreptimeDB Enterprise**: enhanced security, HA, enterprise support
### Netdata
Open-source, real-time monitoring platform for entire infrastructure:
- **Repository**: `github.com/netdata/netdata` (GPLv3+, C; 79k★)
- **Features**: per-second metrics, ML-based anomaly detection, AI-powered troubleshooting, 800+ integrations
- **Zero configuration**: auto-discovery, pre-configured alerts, ready dashboards
- **Architecture**: distributed agent → Netdata Cloud (optional), data stays local
- **Energy efficiency**: according to University of Amsterdam study, the most efficient tool for monitoring Docker systems
- **Netdata Cloud**: free tier (5 nodes), paid from $12/node/month
- **Licensing**: agent GPLv3+, dashboard NCUL1, cloud closed-source
## OpenStack Monitoring
OpenStack provides several services for telemetry and monitoring:
### Ceilometer (Telemetry)
- Metric collection (CPU, memory, network, storage) from compute, network and storage nodes
- Publishing to Gnocchi (time-series DB) or Panko (event storage)
- Notifications via oslo.messaging (RabbitMQ) — pipeline transformations
- Alarming: Aodh — threshold-based alarms, metric combinations
### Monasca
- More modern alternative to Ceilometer (primarily developed for telco use cases)
- Architecture: Monasca API → Log API → Transform → Threshold Engine → Notifier
- Backend: InfluxDB/Gnocchi, Kafka, Elasticsearch
- Supports alerting, notifications, graph dashboards
### Prometheus + OpenStack Exporter
- OpenStack-exporter for Prometheus (exports metrics from Ceilometer / API)
- Service discovery via Prometheus
- Grafana dashboards for visualization
### Masakari (VM High Availability)
- Detection and automatic recovery of VMs on hypervisor failure (host failure)
- Evacuation of instances to healthy compute node
- Integration with Pacemaker for cluster management
## Sources
Links, books and standards: [sources/monitoring/sources.md](sources/monitoring/sources.md)
*Last revision: 2026-06-03*