20 KiB
📊 Monitoring and observability
OpenMetrics standard
OpenMetrics (CNCF sandbox) is the de-facto standard for metric exposition in cloud-native environments:
- Supports text representation and Protocol Buffers
- Foundation for Prometheus exposition format
- Specifies: counter, gauge, histogram, summary, gaugehistogram, statefulset
_totalsuffix for cumulative values,_bucketfor histograms- Metadata: HELP, TYPE, UNIT, (timestamp optional)
The standard is developed within OpenObservability.
New tools and trends (2024–2026)
| Tool | Description |
|---|---|
| Grafana Sigil | AI observability for LLM agents (OTel-native) |
| InfraLens | eBPF-based, zero-instrumentation network observability |
| Ingero | GPU causal observability (eBPF, CUDA tracing) |
| GreptimeDB | Unified observability DB — replaces Prometheus + Loki + ES |
| Netdata | AI-powered full-stack monitoring, 800+ integrations, edge ML |
Three pillars of observability
- Logs — unstructured event data (ERROR, WARN, INFO)
- Metrics — numerical data over time (latency, error rate, CPU utilization)
- Traces — request tracking across services (distributed tracing)
SLI / SLO / SLA
| Term | Meaning | Example |
|---|---|---|
| SLI (Service Level Indicator) | Measured metric | Latency p99 = 250ms |
| SLO (Service Level Objective) | Target value | 99.9 % of requests < 300ms |
| SLA (Service Level Agreement) | Legal commitment | 99.95 % uptime |
Error budget
Error Budget = 100 % - SLO
- If SLO is 99.9 %, error budget is 0.1 % of time
- While error budget remains, the team can deploy new features
- When exhausted — freeze on deploys, stability is priority
Pyramid of metrics — RED vs USE vs 4 Golden Signals
4 Golden Signals (Google SRE)
- Latency — request processing time (distinguish success vs error latency)
- Traffic — number of requests / throughput (RPS, QPS, throughput)
- Errors — explicit errors (5xx, 4xx) and implicit (success with wrong result)
- Saturation — how "full" the service is (CPU, memory, queue depth, connection pool)
USE (for infrastructure)
- Utilization — how busy the resource is (% time active)
- Saturation — how much is waiting in queue (run queue, I/O wait)
- Errors — errors (dropped packets, disk errors, OOM)
RED (for services)
- Rate — requests per second
- Errors — number of erroneous requests
- Duration — latency (distribution, percentiles)
| Methodology | Focus | Typical metrics |
|---|---|---|
| 4 Golden Signals | Services + infrastructure | Latency, RPS, errors, saturation |
| USE | Infrastructure | CPU util, I/O saturation, disk errors |
| RED | Microservices | RPS, error rate, p50/p95/p99 latency |
PromQL examples
| Expression | Description |
|---|---|
rate(http_requests_total[5m]) |
Requests per second (average over 5 min) |
increase(http_requests_total[1h]) |
Total increase over 1 hour |
sum by (status) (rate(http_requests_total[5m])) |
Requests aggregated by status code |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) |
p99 latency |
avg_over_time(cpu_usage[1h]) |
Average CPU utilization over an hour |
topk(5, sum(rate(http_requests_total[5m])) by (service)) |
Top 5 services by RPS |
max_over_time(memory_usage[24h]) |
Max memory usage over 24h |
rate(node_network_drop_total[5m]) > 0 |
Networks with dropped packets |
(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))) |
CPU utilization (1 - idle) |
delta(http_request_duration_seconds_sum[5m]) / delta(http_request_duration_seconds_count[5m]) |
Average latency |
absent(metric) |
Alert when metric is missing |
Recording rules
Pre-aggregation of frequently used PromQL queries to reduce query load.
When to use
- Complex queries used across multiple dashboards
- Queries over raw data with high cardinality
- Frequently queried aggregations (e.g., p99 latency over last month)
Example
groups:
- name: service_rules
interval: 1m
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: instance:cpu:utilization
expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))
- record: service:http_latency:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
- record — new metric name (convention:
level:metric:aggregation) - interval — how often the rule evaluates (typically 1-5 min)
Metrics — tools
Metrics
| Tool | Description |
|---|---|
| Prometheus | Pull-based, time-series DB, powerful query language (PromQL) |
| Grafana | Visualization, dashboards, alerting |
| Zabbix | Enterprise monitoring, agent + agentless (SNMP/IPMI/JMX), auto-discovery, trigger-based alerting |
| Datadog | SaaS, APM, logs, metrics in one |
| New Relic | APM, browser monitoring |
| CloudWatch | AWS native |
| Azure Monitor | Azure native |
| Google Cloud Ops | GCP native |
Logging
| Tool | Description |
|---|---|
| ELK Stack | Elasticsearch, Logstash, Kibana |
| Loki | Grafana Loki — lightweight, Prometheus-like |
| Splunk | Enterprise log management |
| Fluentd / Fluent Bit | Log collector and forwarder |
| Vector | High-performance log/metric collector |
Tracing
| Tool | Description |
|---|---|
| Jaeger | Open-source distributed tracing |
| Zipkin | Open-source distributed tracing |
| OpenTelemetry | Standard for instrumentation (logs, metrics, traces) |
| Datadog APM | SaaS tracing |
| AWS X-Ray | AWS tracing |
OpenTelemetry detail
Span attributes
resource:
attributes:
- service.name: "payment-service"
- service.version: "1.2.3"
- deployment.environment: "production"
scope:
name: "io.opentelemetry.payment"
spans:
- name: "processPayment"
kind: SPAN_KIND_INTERNAL
attributes:
- payment.method: "credit_card"
- payment.amount: 2499
- payment.currency: "CZK"
events:
- name: "authorization.complete"
timestamp: 1717428000000000000
Context propagation (W3C TraceContext)
traceparent— header carrying trace-id, span-id, trace flags- Format:
00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 - Version (00) | Trace-ID (32 hex) | Span-ID (16 hex) | TraceFlags (01 = sampled)
- Format:
tracestate— vendor-specific data, compatible cross-provider- Propagation happens via HTTP headers, gRPC metadata, message queue properties
Sampling
| Type | Description | Use case |
|---|---|---|
| Head-based | Sampling decision at trace start (based on ID) | Simple, deterministic |
| Tail-based | Decision after trace completion (based on result, latency) | Better sampling, more complex |
- Tail-based sampling: often used for critical traces (5xx, p99+, slow traces)
- Tools: Grafana Tempo (tail-based), Jaeger (head-based), OTel Collector (head + tail)
Alerting
Principles
- Alert on symptom, not cause — "500 errors" instead of "high CPU"
- Reduce noise — flapping alerts, alert fatigue
- Runbook for every alert — what to do when alert fires
- Alert severity — P0 (critical), P1 (high), P2 (medium), P3 (low)
Alertmanager (Prometheus)
route:
receiver: "team-pager"
group_by: ["alertname", "cluster"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "team-pager"
repeat_interval: 1h
- match:
severity: warning
receiver: "team-slack"
receivers:
- name: "team-pager"
pagerduty_configs:
- routing_key: "<KEY>"
severity: "{{ .CommonLabels.severity }}"
- name: "team-slack"
slack_configs:
- channel: "#alerts"
title: "{{ .GroupLabels.alertname }}"
Concepts:
- Grouping — grouping alerts by labels (noise reduction, e.g., all down instances in a cluster)
- Inhibition — suppression of less severe alerts when a more severe one exists (e.g., nodedown inhibits pod alerts)
- Silencing — temporary alert suppression (matching labels + duration)
- Routing tree — hierarchical routing by label match (severity, service, team)
ESM (Event / Incident Management)
- PagerDuty, Opsgenie, OnCall (Grafana)
- Escalation policies
- On-call rotations
Structured logging
{
"timestamp": "2026-06-03T10:30:00Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123",
"user_id": "u456",
"message": "Payment gateway timeout",
"duration_ms": 1200,
"error": {
"type": "TimeoutError",
"message": "Gateway did not respond in 1000ms"
}
}
Required fields of structured log
| Field | Description | Example |
|---|---|---|
timestamp |
ISO 8601 / RFC 3339 | 2026-06-03T10:30:00Z |
level |
Log level (RFC 5424) | ERROR, WARN, INFO, DEBUG |
message |
Human-readable message | Payment processed |
service |
Service name | payment-service |
trace_id |
Correlation across services | abc123def456 |
RFC 5424 log levels
| Number | Level | Usage |
|---|---|---|
| 0 | EMERG | System unusable |
| 1 | ALERT | Immediate action required |
| 2 | CRIT | Critical error |
| 3 | ERROR | Error (non-critical) |
| 4 | WARN | Warning |
| 5 | NOTICE | Normal but significant event |
| 6 | INFO | Informational message |
| 7 | DEBUG | Debugging (disabled in production) |
Correlation ID (traceparent)
- Generated at system entry (API gateway, frontend, message consumer)
- Propagated in HTTP header
X-Correlation-ID/traceparent - Enables linking logs across microservices (→ Grafana Explore, Kibana Discover)
- Implementation: middleware in app, service mesh (Envoy), API gateway
Distributed tracing detail
Span kinds
| Kind | Description | Example |
|---|---|---|
| CLIENT | Calling downstream service (outbound) | HTTP client calling API |
| SERVER | Processing incoming request | HTTP handler |
| INTERNAL | Local operation within service | Computation, transformation |
| PRODUCER | Sending message to queue | Kafka producer |
| CONSUMER | Receiving message from queue | Kafka consumer |
Trace context chain
Trace: abc123
├── Span: /checkout (SERVER, root)
│ ├── Span: validateCart (INTERNAL)
│ ├── Span: POST /orders (CLIENT → payment-service)
│ │ └── Span: /processPayment (SERVER)
│ │ ├── Span: authorizeCard (INTERNAL)
│ │ └── Span: chargeCard (CLIENT → bank-gateway)
│ │ └── Span: /charge (SERVER, external)
│ └── Span: sendConfirmation (PRODUCER → kafka)
│ └── Span: consumeConfirmation (CONSUMER → email-service)
- W3C TraceContext — standardized cross-service tracing
- Baggage — transport of contextual data (tenant, user role) between spans
Grafana
Provisioning dashboards as code
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: "Services"
type: file
options:
path: /etc/grafana/provisioning/dashboards
Dashboards JSON in git → CI/CD → automatic import into Grafana.
Variables
- Query variable — dynamic values (e.g., list of service names from PromQL:
label_values(up, service)) - Interval variable —
$__auto_interval,$__intervalfor variable time range - Custom variable — manual list of values (env: prod, staging, dev)
- Chained variable — dependent variable (select namespace → show pods in namespace)
Annotations
- Drawing events in graphs (deploys, incidents, config changes)
- Sources: Prometheus alerts, Loki logs, GitHub Actions, custom API
- Use case: "Deploy at 14:30 → spike in latency at 14:31 → correlation"
On-call best practices
Escalation policies
Level 1: Primary on-call (response within 5 min)
└── timeout 15 min
Level 2: Secondary / senior engineer (response within 15 min)
└── timeout 15 min
Level 3: Engineering manager / incident commander
Incident severity matrix
| Severity | Description | Response | Communication |
|---|---|---|---|
| P0 (Critical) | Service completely unavailable, data loss, security breach | Immediate, 24/7 | Status page + Stakeholder update |
| P1 (High) | Major functionality degraded, part of users affected | Within 15 min | Slack channel + Team lead |
| P2 (Medium) | Non-critical feature broken, workaround exists | Within 1 h | Slack channel |
| P3 (Low) | Cosmetic issue, no user impact | Next business day | Jira ticket |
Postmortem
- Blameless — goal is to learn, not blame
- Structure: Timeline, detection, root cause, resolution, action items
- SRE principle: every incident → postmortem → systemic improvement
- Tools: Jira, Incident.io, PagerDuty postmortem, Google Docs
Logging patterns
Best practices
- Dashboard for each level — executive, service, troubleshooting
- Synthetic monitoring — heartbeat checks, browser tests (Playwright, Cypress)
- APM — Application Performance Monitoring (database queries, external calls)
- Anomaly detection — ML-based outlier detection
- Retention policy — raw data short term, aggregations long term
- Unified log format — JSON, structured data
Recommended literature
Classic books
| Book | Authors | ISBN | Key topics |
|---|---|---|---|
| Site Reliability Engineering | Beyer, Jones, Petoff, Murphy | 978-1491929124 | How Google runs production systems — SRE principles, error budgets, toil, SLI/SLO |
| The Site Reliability Workbook | Beyer, Murphy, Rensin, Kawahara, Thorne | 978-1492029502 | Practical companion to SRE — case studies from Evernote, Home Depot, NY Times; SLO implementation, monitoring, on-call |
| Observability Engineering | Majors, Fong-Jones, Miranda | 978-1492076445 | First comprehensive book on observability — structured events, iterative hypothesis verification, core analysis loop; 2nd edition in 2026 (32 new chapters on AI, cost governance) |
Cloud and monitoring
| Book | Author | ISBN/Year | Topics |
|---|---|---|---|
| Cloud Observability in Action | Michael Hausenblas | Manning, 2023 | Practical guide to observability in cloud-native environments — signal types (logs, metrics, traces, profiles), OTel Collector, SLOs, signal correlation, developer observability; open-source tools |
| Mastering Prometheus | William Hegedus | 978-1-80512-566-2 | Advanced Prometheus techniques — TSDB internals, custom service discovery, cardinality, remote storage (VictoriaMetrics, Mimir), SLO-based alerting; author is SRE manager at Akamai and Prometheus/Thanos contributor |
| Observability with Grafana | Chapman, Holmes | 978-1-80324-964-3 | Complete guide to LGTM stack (Loki, Grafana, Tempo, Mimir) — OTel instrumentation, LogQL/PromQL/TraceQL, AI/ML alerting, real user monitoring with Faro, Pyroscope profiling, k6 load testing |
| Hands-On Monitoring and Alerting with Prometheus | Muhammad Badawy | 978-9349887565 | Practical Prometheus guide — installation, configuration, service discovery, labeling, PromQL, Alertmanager, monitoring Linux, Windows, Docker, databases |
AI and observability
| Book | Authors | ISBN/Year | Topics |
|---|---|---|---|
| Observability in the AI-Native Era | Lipsig, Grabner, Rati | 978-1-80638-959-9 | Connecting observability with AIOps — ML-based anomaly detection, root-cause analysis, self-healing systems, OTel + Prometheus + Grafana + Dynatrace/Datadog, compliance |
| Open Source Observability | Corless, Pawar | O'Reilly, 2025 | Report on disaggregated, modular observability stacks — flexibility, cost efficiency, data autonomy, blueprint for custom solutions from open-source components |
Detailed tool overview
Extended information on tools from the table above:
Grafana Sigil
AI observability product from Grafana Labs. OpenTelemetry-native SDK for instrumenting LLM agents:
- Repository:
github.com/grafana/sigil-sdk(Go SDK) +sigil-app(Grafana plugin) - Features: tracking conversations, generation, tool usage, cost tracking, quality evaluation
- Growing problem: 500M+ conversations, 5M+ agents in production (GrafanaCON 2026)
- Integration: automatic connection with Prometheus (metrics), Tempo (traces), AI Observability API
InfraLens
Zero-instrumentation Kubernetes observability built on eBPF:
- Repository:
github.com/Herenn/Infralens(Apache 2.0, Go) - Features: automatic detection of service-to-service communication, topology visualization, AI-powered documentation
- Architecture: eBPF agent + Go backend + React frontend
- Status: early-stage (1 star, 10 commits), but eBPF-based observability concept is proven (Grafana Beyla, Cilium Hubble, Pixie)
Ingero
GPU causal observability agent — first of its kind:
- Repository:
github.com/ingero-io/ingero(Apache 2.0) - Features: eBPF tracing from Linux kernel events through CUDA API to Python source code
- Overhead: < 2 %, zero code changes, single binary
- MCP server: native Model Context Protocol support — AI assistants can directly query GPU data
- Use case: diagnosis of GPU stalls, scheduler preemptions, CUDA memory spikes — causal chains instead of plain metrics
- Version: v0.19.0 (2026), active development
GreptimeDB
Unified observability database — one backend for metrics, logs and traces:
- Repository:
github.com/GreptimeTeam/greptimedb(Apache 2.0, Rust) - Architecture: compute-storage disaggregation, object storage first (S3, GCS, Azure Blob), columnar storage
- Querying: SQL + PromQL in a single query, JOIN between metrics and logs possible
- Drop-in replacement: Prometheus (PromQL, remote write), Loki (Push API), Elasticsearch (bulk API), Jaeger (Query API)
- Cost reduction: up to 50× lower costs compared to traditional solutions
- Roadmap 2026: v1.0 GA (Q1 2026), v1.1–v1.3 (Vector Index, AI Functions, Auto Rollup, adaptive resource management)
- GreptimeDB Enterprise: enhanced security, HA, enterprise support
Netdata
Open-source, real-time monitoring platform for entire infrastructure:
- Repository:
github.com/netdata/netdata(GPLv3+, C; 79k★) - Features: per-second metrics, ML-based anomaly detection, AI-powered troubleshooting, 800+ integrations
- Zero configuration: auto-discovery, pre-configured alerts, ready dashboards
- Architecture: distributed agent → Netdata Cloud (optional), data stays local
- Energy efficiency: according to University of Amsterdam study, the most efficient tool for monitoring Docker systems
- Netdata Cloud: free tier (5 nodes), paid from $12/node/month
- Licensing: agent GPLv3+, dashboard NCUL1, cloud closed-source
OpenStack Monitoring
OpenStack provides several services for telemetry and monitoring:
Ceilometer (Telemetry)
- Metric collection (CPU, memory, network, storage) from compute, network and storage nodes
- Publishing to Gnocchi (time-series DB) or Panko (event storage)
- Notifications via oslo.messaging (RabbitMQ) — pipeline transformations
- Alarming: Aodh — threshold-based alarms, metric combinations
Monasca
- More modern alternative to Ceilometer (primarily developed for telco use cases)
- Architecture: Monasca API → Log API → Transform → Threshold Engine → Notifier
- Backend: InfluxDB/Gnocchi, Kafka, Elasticsearch
- Supports alerting, notifications, graph dashboards
Prometheus + OpenStack Exporter
- OpenStack-exporter for Prometheus (exports metrics from Ceilometer / API)
- Service discovery via Prometheus
- Grafana dashboards for visualization
Masakari (VM High Availability)
- Detection and automatic recovery of VMs on hypervisor failure (host failure)
- Evacuation of instances to healthy compute node
- Integration with Pacemaker for cluster management
Sources
Links, books and standards: sources/monitoring/sources.md
Last revision: 2026-06-03