# 📊 Monitoring and observability ## OpenMetrics standard OpenMetrics (CNCF sandbox) is the de-facto standard for metric exposition in cloud-native environments: - Supports text representation and Protocol Buffers - Foundation for Prometheus exposition format - Specifies: counter, gauge, histogram, summary, gaugehistogram, statefulset - `_total` suffix for cumulative values, `_bucket` for histograms - Metadata: HELP, TYPE, UNIT, (timestamp optional) The standard is developed within [OpenObservability](https://github.com/OpenObservability/OpenMetrics). ## New tools and trends (2024–2026) | Tool | Description | |------|-------------| | **Grafana Sigil** | AI observability for LLM agents (OTel-native) | | **InfraLens** | eBPF-based, zero-instrumentation network observability | | **Ingero** | GPU causal observability (eBPF, CUDA tracing) | | **GreptimeDB** | Unified observability DB — replaces Prometheus + Loki + ES | | **Netdata** | AI-powered full-stack monitoring, 800+ integrations, edge ML | ## Three pillars of observability 1. **Logs** — unstructured event data (ERROR, WARN, INFO) 2. **Metrics** — numerical data over time (latency, error rate, CPU utilization) 3. **Traces** — request tracking across services (distributed tracing) ## SLI / SLO / SLA | Term | Meaning | Example | |------|---------|---------| | **SLI** (Service Level Indicator) | Measured metric | Latency p99 = 250ms | | **SLO** (Service Level Objective) | Target value | 99.9 % of requests < 300ms | | **SLA** (Service Level Agreement) | Legal commitment | 99.95 % uptime | ### Error budget `Error Budget = 100 % - SLO` - If SLO is 99.9 %, error budget is 0.1 % of time - While error budget remains, the team can deploy new features - When exhausted — freeze on deploys, stability is priority ## Pyramid of metrics — RED vs USE vs 4 Golden Signals ### 4 Golden Signals (Google SRE) 1. **Latency** — request processing time (distinguish success vs error latency) 2. **Traffic** — number of requests / throughput (RPS, QPS, throughput) 3. **Errors** — explicit errors (5xx, 4xx) and implicit (success with wrong result) 4. **Saturation** — how "full" the service is (CPU, memory, queue depth, connection pool) ### USE (for infrastructure) - **U**tilization — how busy the resource is (% time active) - **S**aturation — how much is waiting in queue (run queue, I/O wait) - **E**rrors — errors (dropped packets, disk errors, OOM) ### RED (for services) - **R**ate — requests per second - **E**rrors — number of erroneous requests - **D**uration — latency (distribution, percentiles) | Methodology | Focus | Typical metrics | |-------------|-------|-----------------| | **4 Golden Signals** | Services + infrastructure | Latency, RPS, errors, saturation | | **USE** | Infrastructure | CPU util, I/O saturation, disk errors | | **RED** | Microservices | RPS, error rate, p50/p95/p99 latency | ## PromQL examples | Expression | Description | |------------|-------------| | `rate(http_requests_total[5m])` | Requests per second (average over 5 min) | | `increase(http_requests_total[1h])` | Total increase over 1 hour | | `sum by (status) (rate(http_requests_total[5m]))` | Requests aggregated by status code | | `histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))` | p99 latency | | `avg_over_time(cpu_usage[1h])` | Average CPU utilization over an hour | | `topk(5, sum(rate(http_requests_total[5m])) by (service))` | Top 5 services by RPS | | `max_over_time(memory_usage[24h])` | Max memory usage over 24h | | `rate(node_network_drop_total[5m]) > 0` | Networks with dropped packets | | `(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))` | CPU utilization (1 - idle) | | `delta(http_request_duration_seconds_sum[5m]) / delta(http_request_duration_seconds_count[5m])` | Average latency | | `absent(metric)` | Alert when metric is missing | ## Recording rules Pre-aggregation of frequently used PromQL queries to reduce query load. ### When to use - Complex queries used across multiple dashboards - Queries over raw data with high cardinality - Frequently queried aggregations (e.g., p99 latency over last month) ### Example ```yaml groups: - name: service_rules interval: 1m rules: - record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job) - record: instance:cpu:utilization expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) - record: service:http_latency:p99 expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) ``` - **record** — new metric name (convention: `level:metric:aggregation`) - **interval** — how often the rule evaluates (typically 1-5 min) ## Metrics — tools ### Metrics | Tool | Description | |------|-------------| | Prometheus | Pull-based, time-series DB, powerful query language (PromQL) | | Grafana | Visualization, dashboards, alerting | | Zabbix | Enterprise monitoring, agent + agentless (SNMP/IPMI/JMX), auto-discovery, trigger-based alerting | | Datadog | SaaS, APM, logs, metrics in one | | New Relic | APM, browser monitoring | | CloudWatch | AWS native | | Azure Monitor | Azure native | | Google Cloud Ops | GCP native | ### Logging | Tool | Description | |------|-------------| | ELK Stack | Elasticsearch, Logstash, Kibana | | Loki | Grafana Loki — lightweight, Prometheus-like | | Splunk | Enterprise log management | | Fluentd / Fluent Bit | Log collector and forwarder | | Vector | High-performance log/metric collector | ### Tracing | Tool | Description | |------|-------------| | Jaeger | Open-source distributed tracing | | Zipkin | Open-source distributed tracing | | OpenTelemetry | Standard for instrumentation (logs, metrics, traces) | | Datadog APM | SaaS tracing | | AWS X-Ray | AWS tracing | ## OpenTelemetry detail ### Span attributes ```yaml resource: attributes: - service.name: "payment-service" - service.version: "1.2.3" - deployment.environment: "production" scope: name: "io.opentelemetry.payment" spans: - name: "processPayment" kind: SPAN_KIND_INTERNAL attributes: - payment.method: "credit_card" - payment.amount: 2499 - payment.currency: "CZK" events: - name: "authorization.complete" timestamp: 1717428000000000000 ``` ### Context propagation (W3C TraceContext) - **`traceparent`** — header carrying trace-id, span-id, trace flags - Format: `00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01` - Version (00) | Trace-ID (32 hex) | Span-ID (16 hex) | TraceFlags (01 = sampled) - **`tracestate`** — vendor-specific data, compatible cross-provider - Propagation happens via HTTP headers, gRPC metadata, message queue properties ### Sampling | Type | Description | Use case | |------|-------------|----------| | **Head-based** | Sampling decision at trace start (based on ID) | Simple, deterministic | | **Tail-based** | Decision after trace completion (based on result, latency) | Better sampling, more complex | - Tail-based sampling: often used for critical traces (5xx, p99+, slow traces) - Tools: Grafana Tempo (tail-based), Jaeger (head-based), OTel Collector (head + tail) ## Alerting ### Principles - **Alert on symptom, not cause** — "500 errors" instead of "high CPU" - **Reduce noise** — flapping alerts, alert fatigue - **Runbook for every alert** — what to do when alert fires - **Alert severity** — P0 (critical), P1 (high), P2 (medium), P3 (low) ### Alertmanager (Prometheus) ```yaml route: receiver: "team-pager" group_by: ["alertname", "cluster"] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: "team-pager" repeat_interval: 1h - match: severity: warning receiver: "team-slack" receivers: - name: "team-pager" pagerduty_configs: - routing_key: "" severity: "{{ .CommonLabels.severity }}" - name: "team-slack" slack_configs: - channel: "#alerts" title: "{{ .GroupLabels.alertname }}" ``` **Concepts**: - **Grouping** — grouping alerts by labels (noise reduction, e.g., all down instances in a cluster) - **Inhibition** — suppression of less severe alerts when a more severe one exists (e.g., nodedown inhibits pod alerts) - **Silencing** — temporary alert suppression (matching labels + duration) - **Routing tree** — hierarchical routing by label match (severity, service, team) ### ESM (Event / Incident Management) - PagerDuty, Opsgenie, OnCall (Grafana) - Escalation policies - On-call rotations ## Structured logging ```json { "timestamp": "2026-06-03T10:30:00Z", "level": "ERROR", "service": "payment-service", "trace_id": "abc123", "user_id": "u456", "message": "Payment gateway timeout", "duration_ms": 1200, "error": { "type": "TimeoutError", "message": "Gateway did not respond in 1000ms" } } ``` ### Required fields of structured log | Field | Description | Example | |-------|-------------|---------| | `timestamp` | ISO 8601 / RFC 3339 | `2026-06-03T10:30:00Z` | | `level` | Log level (RFC 5424) | `ERROR`, `WARN`, `INFO`, `DEBUG` | | `message` | Human-readable message | `Payment processed` | | `service` | Service name | `payment-service` | | `trace_id` | Correlation across services | `abc123def456` | ### RFC 5424 log levels | Number | Level | Usage | |--------|-------|-------| | 0 | EMERG | System unusable | | 1 | ALERT | Immediate action required | | 2 | CRIT | Critical error | | 3 | ERROR | Error (non-critical) | | 4 | WARN | Warning | | 5 | NOTICE | Normal but significant event | | 6 | INFO | Informational message | | 7 | DEBUG | Debugging (disabled in production) | ### Correlation ID (traceparent) - Generated at system entry (API gateway, frontend, message consumer) - Propagated in HTTP header `X-Correlation-ID` / `traceparent` - Enables linking logs across microservices (→ Grafana Explore, Kibana Discover) - Implementation: middleware in app, service mesh (Envoy), API gateway ## Distributed tracing detail ### Span kinds | Kind | Description | Example | |------|-------------|---------| | **CLIENT** | Calling downstream service (outbound) | HTTP client calling API | | **SERVER** | Processing incoming request | HTTP handler | | **INTERNAL** | Local operation within service | Computation, transformation | | **PRODUCER** | Sending message to queue | Kafka producer | | **CONSUMER** | Receiving message from queue | Kafka consumer | ### Trace context chain ``` Trace: abc123 ├── Span: /checkout (SERVER, root) │ ├── Span: validateCart (INTERNAL) │ ├── Span: POST /orders (CLIENT → payment-service) │ │ └── Span: /processPayment (SERVER) │ │ ├── Span: authorizeCard (INTERNAL) │ │ └── Span: chargeCard (CLIENT → bank-gateway) │ │ └── Span: /charge (SERVER, external) │ └── Span: sendConfirmation (PRODUCER → kafka) │ └── Span: consumeConfirmation (CONSUMER → email-service) ``` - **W3C TraceContext** — standardized cross-service tracing - **Baggage** — transport of contextual data (tenant, user role) between spans ## Grafana ### Provisioning dashboards as code ```yaml apiVersion: 1 providers: - name: "default" orgId: 1 folder: "Services" type: file options: path: /etc/grafana/provisioning/dashboards ``` Dashboards JSON in git → CI/CD → automatic import into Grafana. ### Variables - **Query variable** — dynamic values (e.g., list of service names from PromQL: `label_values(up, service)`) - **Interval variable** — `$__auto_interval`, `$__interval` for variable time range - **Custom variable** — manual list of values (env: prod, staging, dev) - **Chained variable** — dependent variable (select namespace → show pods in namespace) ### Annotations - Drawing events in graphs (deploys, incidents, config changes) - Sources: Prometheus alerts, Loki logs, GitHub Actions, custom API - Use case: "Deploy at 14:30 → spike in latency at 14:31 → correlation" ## On-call best practices ### Escalation policies ``` Level 1: Primary on-call (response within 5 min) └── timeout 15 min Level 2: Secondary / senior engineer (response within 15 min) └── timeout 15 min Level 3: Engineering manager / incident commander ``` ### Incident severity matrix | Severity | Description | Response | Communication | |----------|-------------|----------|---------------| | **P0 (Critical)** | Service completely unavailable, data loss, security breach | Immediate, 24/7 | Status page + Stakeholder update | | **P1 (High)** | Major functionality degraded, part of users affected | Within 15 min | Slack channel + Team lead | | **P2 (Medium)** | Non-critical feature broken, workaround exists | Within 1 h | Slack channel | | **P3 (Low)** | Cosmetic issue, no user impact | Next business day | Jira ticket | ### Postmortem - **Blameless** — goal is to learn, not blame - **Structure**: Timeline, detection, root cause, resolution, action items - **SRE principle**: every incident → postmortem → systemic improvement - **Tools**: Jira, Incident.io, PagerDuty postmortem, Google Docs ## Logging patterns ### Best practices - **Dashboard for each level** — executive, service, troubleshooting - **Synthetic monitoring** — heartbeat checks, browser tests (Playwright, Cypress) - **APM** — Application Performance Monitoring (database queries, external calls) - **Anomaly detection** — ML-based outlier detection - **Retention policy** — raw data short term, aggregations long term - **Unified log format** — JSON, structured data ## Recommended literature ### Classic books | Book | Authors | ISBN | Key topics | |------|---------|------|------------| | **Site Reliability Engineering** | Beyer, Jones, Petoff, Murphy | 978-1491929124 | How Google runs production systems — SRE principles, error budgets, toil, SLI/SLO | | **The Site Reliability Workbook** | Beyer, Murphy, Rensin, Kawahara, Thorne | 978-1492029502 | Practical companion to SRE — case studies from Evernote, Home Depot, NY Times; SLO implementation, monitoring, on-call | | **Observability Engineering** | Majors, Fong-Jones, Miranda | 978-1492076445 | First comprehensive book on observability — structured events, iterative hypothesis verification, core analysis loop; 2nd edition in 2026 (32 new chapters on AI, cost governance) | ### Cloud and monitoring | Book | Author | ISBN/Year | Topics | |------|--------|-----------|--------| | **Cloud Observability in Action** | Michael Hausenblas | Manning, 2023 | Practical guide to observability in cloud-native environments — signal types (logs, metrics, traces, profiles), OTel Collector, SLOs, signal correlation, developer observability; open-source tools | | **Mastering Prometheus** | William Hegedus | 978-1-80512-566-2 | Advanced Prometheus techniques — TSDB internals, custom service discovery, cardinality, remote storage (VictoriaMetrics, Mimir), SLO-based alerting; author is SRE manager at Akamai and Prometheus/Thanos contributor | | **Observability with Grafana** | Chapman, Holmes | 978-1-80324-964-3 | Complete guide to LGTM stack (Loki, Grafana, Tempo, Mimir) — OTel instrumentation, LogQL/PromQL/TraceQL, AI/ML alerting, real user monitoring with Faro, Pyroscope profiling, k6 load testing | | **Hands-On Monitoring and Alerting with Prometheus** | Muhammad Badawy | 978-9349887565 | Practical Prometheus guide — installation, configuration, service discovery, labeling, PromQL, Alertmanager, monitoring Linux, Windows, Docker, databases | ### AI and observability | Book | Authors | ISBN/Year | Topics | |------|---------|-----------|--------| | **Observability in the AI-Native Era** | Lipsig, Grabner, Rati | 978-1-80638-959-9 | Connecting observability with AIOps — ML-based anomaly detection, root-cause analysis, self-healing systems, OTel + Prometheus + Grafana + Dynatrace/Datadog, compliance | | **Open Source Observability** | Corless, Pawar | O'Reilly, 2025 | Report on disaggregated, modular observability stacks — flexibility, cost efficiency, data autonomy, blueprint for custom solutions from open-source components | ## Detailed tool overview Extended information on tools from the table above: ### Grafana Sigil AI observability product from Grafana Labs. OpenTelemetry-native SDK for instrumenting LLM agents: - **Repository**: `github.com/grafana/sigil-sdk` (Go SDK) + `sigil-app` (Grafana plugin) - **Features**: tracking conversations, generation, tool usage, cost tracking, quality evaluation - **Growing problem**: 500M+ conversations, 5M+ agents in production (GrafanaCON 2026) - **Integration**: automatic connection with Prometheus (metrics), Tempo (traces), AI Observability API ### InfraLens Zero-instrumentation Kubernetes observability built on eBPF: - **Repository**: `github.com/Herenn/Infralens` (Apache 2.0, Go) - **Features**: automatic detection of service-to-service communication, topology visualization, AI-powered documentation - **Architecture**: eBPF agent + Go backend + React frontend - **Status**: early-stage (1 star, 10 commits), but eBPF-based observability concept is proven (Grafana Beyla, Cilium Hubble, Pixie) ### Ingero GPU causal observability agent — first of its kind: - **Repository**: `github.com/ingero-io/ingero` (Apache 2.0) - **Features**: eBPF tracing from Linux kernel events through CUDA API to Python source code - **Overhead**: < 2 %, zero code changes, single binary - **MCP server**: native Model Context Protocol support — AI assistants can directly query GPU data - **Use case**: diagnosis of GPU stalls, scheduler preemptions, CUDA memory spikes — causal chains instead of plain metrics - **Version**: v0.19.0 (2026), active development ### GreptimeDB Unified observability database — one backend for metrics, logs and traces: - **Repository**: `github.com/GreptimeTeam/greptimedb` (Apache 2.0, Rust) - **Architecture**: compute-storage disaggregation, object storage first (S3, GCS, Azure Blob), columnar storage - **Querying**: SQL + PromQL in a single query, JOIN between metrics and logs possible - **Drop-in replacement**: Prometheus (PromQL, remote write), Loki (Push API), Elasticsearch (bulk API), Jaeger (Query API) - **Cost reduction**: up to 50× lower costs compared to traditional solutions - **Roadmap 2026**: v1.0 GA (Q1 2026), v1.1–v1.3 (Vector Index, AI Functions, Auto Rollup, adaptive resource management) - **GreptimeDB Enterprise**: enhanced security, HA, enterprise support ### Netdata Open-source, real-time monitoring platform for entire infrastructure: - **Repository**: `github.com/netdata/netdata` (GPLv3+, C; 79k★) - **Features**: per-second metrics, ML-based anomaly detection, AI-powered troubleshooting, 800+ integrations - **Zero configuration**: auto-discovery, pre-configured alerts, ready dashboards - **Architecture**: distributed agent → Netdata Cloud (optional), data stays local - **Energy efficiency**: according to University of Amsterdam study, the most efficient tool for monitoring Docker systems - **Netdata Cloud**: free tier (5 nodes), paid from $12/node/month - **Licensing**: agent GPLv3+, dashboard NCUL1, cloud closed-source ## OpenStack Monitoring OpenStack provides several services for telemetry and monitoring: ### Ceilometer (Telemetry) - Metric collection (CPU, memory, network, storage) from compute, network and storage nodes - Publishing to Gnocchi (time-series DB) or Panko (event storage) - Notifications via oslo.messaging (RabbitMQ) — pipeline transformations - Alarming: Aodh — threshold-based alarms, metric combinations ### Monasca - More modern alternative to Ceilometer (primarily developed for telco use cases) - Architecture: Monasca API → Log API → Transform → Threshold Engine → Notifier - Backend: InfluxDB/Gnocchi, Kafka, Elasticsearch - Supports alerting, notifications, graph dashboards ### Prometheus + OpenStack Exporter - OpenStack-exporter for Prometheus (exports metrics from Ceilometer / API) - Service discovery via Prometheus - Grafana dashboards for visualization ### Masakari (VM High Availability) - Detection and automatic recovery of VMs on hypervisor failure (host failure) - Evacuation of instances to healthy compute node - Integration with Pacemaker for cluster management ## Sources Links, books and standards: [sources/monitoring/sources.en.md](sources/monitoring/sources.en.md) *Last revision: 2026-06-03*