Files
knowledge-base/MONITORING.en.md
Stanislav Hubacek ef3c2f75b1 18.6.2026
2026-06-18 16:25:33 +02:00

20 KiB
Raw Blame History

📊 Monitoring and observability

OpenMetrics standard

OpenMetrics (CNCF sandbox) is the de-facto standard for metric exposition in cloud-native environments:

  • Supports text representation and Protocol Buffers
  • Foundation for Prometheus exposition format
  • Specifies: counter, gauge, histogram, summary, gaugehistogram, statefulset
  • _total suffix for cumulative values, _bucket for histograms
  • Metadata: HELP, TYPE, UNIT, (timestamp optional)

The standard is developed within OpenObservability.

Tool Description
Grafana Sigil AI observability for LLM agents (OTel-native)
InfraLens eBPF-based, zero-instrumentation network observability
Ingero GPU causal observability (eBPF, CUDA tracing)
GreptimeDB Unified observability DB — replaces Prometheus + Loki + ES
Netdata AI-powered full-stack monitoring, 800+ integrations, edge ML

Three pillars of observability

  1. Logs — unstructured event data (ERROR, WARN, INFO)
  2. Metrics — numerical data over time (latency, error rate, CPU utilization)
  3. Traces — request tracking across services (distributed tracing)

SLI / SLO / SLA

Term Meaning Example
SLI (Service Level Indicator) Measured metric Latency p99 = 250ms
SLO (Service Level Objective) Target value 99.9 % of requests < 300ms
SLA (Service Level Agreement) Legal commitment 99.95 % uptime

Error budget

Error Budget = 100 % - SLO

  • If SLO is 99.9 %, error budget is 0.1 % of time
  • While error budget remains, the team can deploy new features
  • When exhausted — freeze on deploys, stability is priority

Pyramid of metrics — RED vs USE vs 4 Golden Signals

4 Golden Signals (Google SRE)

  1. Latency — request processing time (distinguish success vs error latency)
  2. Traffic — number of requests / throughput (RPS, QPS, throughput)
  3. Errors — explicit errors (5xx, 4xx) and implicit (success with wrong result)
  4. Saturation — how "full" the service is (CPU, memory, queue depth, connection pool)

USE (for infrastructure)

  • Utilization — how busy the resource is (% time active)
  • Saturation — how much is waiting in queue (run queue, I/O wait)
  • Errors — errors (dropped packets, disk errors, OOM)

RED (for services)

  • Rate — requests per second
  • Errors — number of erroneous requests
  • Duration — latency (distribution, percentiles)
Methodology Focus Typical metrics
4 Golden Signals Services + infrastructure Latency, RPS, errors, saturation
USE Infrastructure CPU util, I/O saturation, disk errors
RED Microservices RPS, error rate, p50/p95/p99 latency

PromQL examples

Expression Description
rate(http_requests_total[5m]) Requests per second (average over 5 min)
increase(http_requests_total[1h]) Total increase over 1 hour
sum by (status) (rate(http_requests_total[5m])) Requests aggregated by status code
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) p99 latency
avg_over_time(cpu_usage[1h]) Average CPU utilization over an hour
topk(5, sum(rate(http_requests_total[5m])) by (service)) Top 5 services by RPS
max_over_time(memory_usage[24h]) Max memory usage over 24h
rate(node_network_drop_total[5m]) > 0 Networks with dropped packets
(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))) CPU utilization (1 - idle)
delta(http_request_duration_seconds_sum[5m]) / delta(http_request_duration_seconds_count[5m]) Average latency
absent(metric) Alert when metric is missing

Recording rules

Pre-aggregation of frequently used PromQL queries to reduce query load.

When to use

  • Complex queries used across multiple dashboards
  • Queries over raw data with high cardinality
  • Frequently queried aggregations (e.g., p99 latency over last month)

Example

groups:
  - name: service_rules
    interval: 1m
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
      - record: instance:cpu:utilization
        expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))
      - record: service:http_latency:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
  • record — new metric name (convention: level:metric:aggregation)
  • interval — how often the rule evaluates (typically 1-5 min)

Metrics — tools

Metrics

Tool Description
Prometheus Pull-based, time-series DB, powerful query language (PromQL)
Grafana Visualization, dashboards, alerting
Zabbix Enterprise monitoring, agent + agentless (SNMP/IPMI/JMX), auto-discovery, trigger-based alerting
Datadog SaaS, APM, logs, metrics in one
New Relic APM, browser monitoring
CloudWatch AWS native
Azure Monitor Azure native
Google Cloud Ops GCP native

Logging

Tool Description
ELK Stack Elasticsearch, Logstash, Kibana
Loki Grafana Loki — lightweight, Prometheus-like
Splunk Enterprise log management
Fluentd / Fluent Bit Log collector and forwarder
Vector High-performance log/metric collector

Tracing

Tool Description
Jaeger Open-source distributed tracing
Zipkin Open-source distributed tracing
OpenTelemetry Standard for instrumentation (logs, metrics, traces)
Datadog APM SaaS tracing
AWS X-Ray AWS tracing

OpenTelemetry detail

Span attributes

resource:
  attributes:
    - service.name: "payment-service"
    - service.version: "1.2.3"
    - deployment.environment: "production"
scope:
  name: "io.opentelemetry.payment"
spans:
  - name: "processPayment"
    kind: SPAN_KIND_INTERNAL
    attributes:
      - payment.method: "credit_card"
      - payment.amount: 2499
      - payment.currency: "CZK"
    events:
      - name: "authorization.complete"
        timestamp: 1717428000000000000

Context propagation (W3C TraceContext)

  • traceparent — header carrying trace-id, span-id, trace flags
    • Format: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
    • Version (00) | Trace-ID (32 hex) | Span-ID (16 hex) | TraceFlags (01 = sampled)
  • tracestate — vendor-specific data, compatible cross-provider
  • Propagation happens via HTTP headers, gRPC metadata, message queue properties

Sampling

Type Description Use case
Head-based Sampling decision at trace start (based on ID) Simple, deterministic
Tail-based Decision after trace completion (based on result, latency) Better sampling, more complex
  • Tail-based sampling: often used for critical traces (5xx, p99+, slow traces)
  • Tools: Grafana Tempo (tail-based), Jaeger (head-based), OTel Collector (head + tail)

Alerting

Principles

  • Alert on symptom, not cause — "500 errors" instead of "high CPU"
  • Reduce noise — flapping alerts, alert fatigue
  • Runbook for every alert — what to do when alert fires
  • Alert severity — P0 (critical), P1 (high), P2 (medium), P3 (low)

Alertmanager (Prometheus)

route:
  receiver: "team-pager"
  group_by: ["alertname", "cluster"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "team-pager"
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: "team-slack"

receivers:
  - name: "team-pager"
    pagerduty_configs:
      - routing_key: "<KEY>"
        severity: "{{ .CommonLabels.severity }}"
  - name: "team-slack"
    slack_configs:
      - channel: "#alerts"
        title: "{{ .GroupLabels.alertname }}"

Concepts:

  • Grouping — grouping alerts by labels (noise reduction, e.g., all down instances in a cluster)
  • Inhibition — suppression of less severe alerts when a more severe one exists (e.g., nodedown inhibits pod alerts)
  • Silencing — temporary alert suppression (matching labels + duration)
  • Routing tree — hierarchical routing by label match (severity, service, team)

ESM (Event / Incident Management)

  • PagerDuty, Opsgenie, OnCall (Grafana)
  • Escalation policies
  • On-call rotations

Structured logging

{
  "timestamp": "2026-06-03T10:30:00Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123",
  "user_id": "u456",
  "message": "Payment gateway timeout",
  "duration_ms": 1200,
  "error": {
    "type": "TimeoutError",
    "message": "Gateway did not respond in 1000ms"
  }
}

Required fields of structured log

Field Description Example
timestamp ISO 8601 / RFC 3339 2026-06-03T10:30:00Z
level Log level (RFC 5424) ERROR, WARN, INFO, DEBUG
message Human-readable message Payment processed
service Service name payment-service
trace_id Correlation across services abc123def456

RFC 5424 log levels

Number Level Usage
0 EMERG System unusable
1 ALERT Immediate action required
2 CRIT Critical error
3 ERROR Error (non-critical)
4 WARN Warning
5 NOTICE Normal but significant event
6 INFO Informational message
7 DEBUG Debugging (disabled in production)

Correlation ID (traceparent)

  • Generated at system entry (API gateway, frontend, message consumer)
  • Propagated in HTTP header X-Correlation-ID / traceparent
  • Enables linking logs across microservices (→ Grafana Explore, Kibana Discover)
  • Implementation: middleware in app, service mesh (Envoy), API gateway

Distributed tracing detail

Span kinds

Kind Description Example
CLIENT Calling downstream service (outbound) HTTP client calling API
SERVER Processing incoming request HTTP handler
INTERNAL Local operation within service Computation, transformation
PRODUCER Sending message to queue Kafka producer
CONSUMER Receiving message from queue Kafka consumer

Trace context chain

Trace: abc123
├── Span: /checkout (SERVER, root)
│   ├── Span: validateCart (INTERNAL)
│   ├── Span: POST /orders (CLIENT → payment-service)
│   │   └── Span: /processPayment (SERVER)
│   │       ├── Span: authorizeCard (INTERNAL)
│   │       └── Span: chargeCard (CLIENT → bank-gateway)
│   │           └── Span: /charge (SERVER, external)
│   └── Span: sendConfirmation (PRODUCER → kafka)
│       └── Span: consumeConfirmation (CONSUMER → email-service)
  • W3C TraceContext — standardized cross-service tracing
  • Baggage — transport of contextual data (tenant, user role) between spans

Grafana

Provisioning dashboards as code

apiVersion: 1
providers:
  - name: "default"
    orgId: 1
    folder: "Services"
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards

Dashboards JSON in git → CI/CD → automatic import into Grafana.

Variables

  • Query variable — dynamic values (e.g., list of service names from PromQL: label_values(up, service))
  • Interval variable$__auto_interval, $__interval for variable time range
  • Custom variable — manual list of values (env: prod, staging, dev)
  • Chained variable — dependent variable (select namespace → show pods in namespace)

Annotations

  • Drawing events in graphs (deploys, incidents, config changes)
  • Sources: Prometheus alerts, Loki logs, GitHub Actions, custom API
  • Use case: "Deploy at 14:30 → spike in latency at 14:31 → correlation"

On-call best practices

Escalation policies

Level 1: Primary on-call (response within 5 min)
    └── timeout 15 min
Level 2: Secondary / senior engineer (response within 15 min)
    └── timeout 15 min
Level 3: Engineering manager / incident commander

Incident severity matrix

Severity Description Response Communication
P0 (Critical) Service completely unavailable, data loss, security breach Immediate, 24/7 Status page + Stakeholder update
P1 (High) Major functionality degraded, part of users affected Within 15 min Slack channel + Team lead
P2 (Medium) Non-critical feature broken, workaround exists Within 1 h Slack channel
P3 (Low) Cosmetic issue, no user impact Next business day Jira ticket

Postmortem

  • Blameless — goal is to learn, not blame
  • Structure: Timeline, detection, root cause, resolution, action items
  • SRE principle: every incident → postmortem → systemic improvement
  • Tools: Jira, Incident.io, PagerDuty postmortem, Google Docs

Logging patterns

Best practices

  • Dashboard for each level — executive, service, troubleshooting
  • Synthetic monitoring — heartbeat checks, browser tests (Playwright, Cypress)
  • APM — Application Performance Monitoring (database queries, external calls)
  • Anomaly detection — ML-based outlier detection
  • Retention policy — raw data short term, aggregations long term
  • Unified log format — JSON, structured data

Classic books

Book Authors ISBN Key topics
Site Reliability Engineering Beyer, Jones, Petoff, Murphy 978-1491929124 How Google runs production systems — SRE principles, error budgets, toil, SLI/SLO
The Site Reliability Workbook Beyer, Murphy, Rensin, Kawahara, Thorne 978-1492029502 Practical companion to SRE — case studies from Evernote, Home Depot, NY Times; SLO implementation, monitoring, on-call
Observability Engineering Majors, Fong-Jones, Miranda 978-1492076445 First comprehensive book on observability — structured events, iterative hypothesis verification, core analysis loop; 2nd edition in 2026 (32 new chapters on AI, cost governance)

Cloud and monitoring

Book Author ISBN/Year Topics
Cloud Observability in Action Michael Hausenblas Manning, 2023 Practical guide to observability in cloud-native environments — signal types (logs, metrics, traces, profiles), OTel Collector, SLOs, signal correlation, developer observability; open-source tools
Mastering Prometheus William Hegedus 978-1-80512-566-2 Advanced Prometheus techniques — TSDB internals, custom service discovery, cardinality, remote storage (VictoriaMetrics, Mimir), SLO-based alerting; author is SRE manager at Akamai and Prometheus/Thanos contributor
Observability with Grafana Chapman, Holmes 978-1-80324-964-3 Complete guide to LGTM stack (Loki, Grafana, Tempo, Mimir) — OTel instrumentation, LogQL/PromQL/TraceQL, AI/ML alerting, real user monitoring with Faro, Pyroscope profiling, k6 load testing
Hands-On Monitoring and Alerting with Prometheus Muhammad Badawy 978-9349887565 Practical Prometheus guide — installation, configuration, service discovery, labeling, PromQL, Alertmanager, monitoring Linux, Windows, Docker, databases

AI and observability

Book Authors ISBN/Year Topics
Observability in the AI-Native Era Lipsig, Grabner, Rati 978-1-80638-959-9 Connecting observability with AIOps — ML-based anomaly detection, root-cause analysis, self-healing systems, OTel + Prometheus + Grafana + Dynatrace/Datadog, compliance
Open Source Observability Corless, Pawar O'Reilly, 2025 Report on disaggregated, modular observability stacks — flexibility, cost efficiency, data autonomy, blueprint for custom solutions from open-source components

Detailed tool overview

Extended information on tools from the table above:

Grafana Sigil

AI observability product from Grafana Labs. OpenTelemetry-native SDK for instrumenting LLM agents:

  • Repository: github.com/grafana/sigil-sdk (Go SDK) + sigil-app (Grafana plugin)
  • Features: tracking conversations, generation, tool usage, cost tracking, quality evaluation
  • Growing problem: 500M+ conversations, 5M+ agents in production (GrafanaCON 2026)
  • Integration: automatic connection with Prometheus (metrics), Tempo (traces), AI Observability API

InfraLens

Zero-instrumentation Kubernetes observability built on eBPF:

  • Repository: github.com/Herenn/Infralens (Apache 2.0, Go)
  • Features: automatic detection of service-to-service communication, topology visualization, AI-powered documentation
  • Architecture: eBPF agent + Go backend + React frontend
  • Status: early-stage (1 star, 10 commits), but eBPF-based observability concept is proven (Grafana Beyla, Cilium Hubble, Pixie)

Ingero

GPU causal observability agent — first of its kind:

  • Repository: github.com/ingero-io/ingero (Apache 2.0)
  • Features: eBPF tracing from Linux kernel events through CUDA API to Python source code
  • Overhead: < 2 %, zero code changes, single binary
  • MCP server: native Model Context Protocol support — AI assistants can directly query GPU data
  • Use case: diagnosis of GPU stalls, scheduler preemptions, CUDA memory spikes — causal chains instead of plain metrics
  • Version: v0.19.0 (2026), active development

GreptimeDB

Unified observability database — one backend for metrics, logs and traces:

  • Repository: github.com/GreptimeTeam/greptimedb (Apache 2.0, Rust)
  • Architecture: compute-storage disaggregation, object storage first (S3, GCS, Azure Blob), columnar storage
  • Querying: SQL + PromQL in a single query, JOIN between metrics and logs possible
  • Drop-in replacement: Prometheus (PromQL, remote write), Loki (Push API), Elasticsearch (bulk API), Jaeger (Query API)
  • Cost reduction: up to 50× lower costs compared to traditional solutions
  • Roadmap 2026: v1.0 GA (Q1 2026), v1.1v1.3 (Vector Index, AI Functions, Auto Rollup, adaptive resource management)
  • GreptimeDB Enterprise: enhanced security, HA, enterprise support

Netdata

Open-source, real-time monitoring platform for entire infrastructure:

  • Repository: github.com/netdata/netdata (GPLv3+, C; 79k★)
  • Features: per-second metrics, ML-based anomaly detection, AI-powered troubleshooting, 800+ integrations
  • Zero configuration: auto-discovery, pre-configured alerts, ready dashboards
  • Architecture: distributed agent → Netdata Cloud (optional), data stays local
  • Energy efficiency: according to University of Amsterdam study, the most efficient tool for monitoring Docker systems
  • Netdata Cloud: free tier (5 nodes), paid from $12/node/month
  • Licensing: agent GPLv3+, dashboard NCUL1, cloud closed-source

OpenStack Monitoring

OpenStack provides several services for telemetry and monitoring:

Ceilometer (Telemetry)

  • Metric collection (CPU, memory, network, storage) from compute, network and storage nodes
  • Publishing to Gnocchi (time-series DB) or Panko (event storage)
  • Notifications via oslo.messaging (RabbitMQ) — pipeline transformations
  • Alarming: Aodh — threshold-based alarms, metric combinations

Monasca

  • More modern alternative to Ceilometer (primarily developed for telco use cases)
  • Architecture: Monasca API → Log API → Transform → Threshold Engine → Notifier
  • Backend: InfluxDB/Gnocchi, Kafka, Elasticsearch
  • Supports alerting, notifications, graph dashboards

Prometheus + OpenStack Exporter

  • OpenStack-exporter for Prometheus (exports metrics from Ceilometer / API)
  • Service discovery via Prometheus
  • Grafana dashboards for visualization

Masakari (VM High Availability)

  • Detection and automatic recovery of VMs on hypervisor failure (host failure)
  • Evacuation of instances to healthy compute node
  • Integration with Pacemaker for cluster management

Sources

Links, books and standards: sources/monitoring/sources.en.md

Last revision: 2026-06-03