Fossil/knowledge-base

Fork 0

Files

Stanislav Hubacek 3fa11ef0f6 comiiit

2026-06-11 15:27:28 +02:00

20 KiB

Raw Blame History

📊 Monitoring and observability

OpenMetrics standard

OpenMetrics (CNCF sandbox) is the de-facto standard for metric exposition in cloud-native environments:

Supports text representation and Protocol Buffers
Foundation for Prometheus exposition format
Specifies: counter, gauge, histogram, summary, gaugehistogram, statefulset
_total suffix for cumulative values, _bucket for histograms
Metadata: HELP, TYPE, UNIT, (timestamp optional)

The standard is developed within OpenObservability.

New tools and trends (2024–2026)

Tool	Description
Grafana Sigil	AI observability for LLM agents (OTel-native)
InfraLens	eBPF-based, zero-instrumentation network observability
Ingero	GPU causal observability (eBPF, CUDA tracing)
GreptimeDB	Unified observability DB — replaces Prometheus + Loki + ES
Netdata	AI-powered full-stack monitoring, 800+ integrations, edge ML

Three pillars of observability

Logs — unstructured event data (ERROR, WARN, INFO)
Metrics — numerical data over time (latency, error rate, CPU utilization)
Traces — request tracking across services (distributed tracing)

SLI / SLO / SLA

Term	Meaning	Example
SLI (Service Level Indicator)	Measured metric	Latency p99 = 250ms
SLO (Service Level Objective)	Target value	99.9 % of requests < 300ms
SLA (Service Level Agreement)	Legal commitment	99.95 % uptime

Error budget

Error Budget = 100 % - SLO

If SLO is 99.9 %, error budget is 0.1 % of time
While error budget remains, the team can deploy new features
When exhausted — freeze on deploys, stability is priority

Pyramid of metrics — RED vs USE vs 4 Golden Signals

4 Golden Signals (Google SRE)

Latency — request processing time (distinguish success vs error latency)
Traffic — number of requests / throughput (RPS, QPS, throughput)
Errors — explicit errors (5xx, 4xx) and implicit (success with wrong result)
Saturation — how "full" the service is (CPU, memory, queue depth, connection pool)

USE (for infrastructure)

Utilization — how busy the resource is (% time active)
Saturation — how much is waiting in queue (run queue, I/O wait)
Errors — errors (dropped packets, disk errors, OOM)

RED (for services)

Rate — requests per second
Errors — number of erroneous requests
Duration — latency (distribution, percentiles)

Methodology	Focus	Typical metrics
4 Golden Signals	Services + infrastructure	Latency, RPS, errors, saturation
USE	Infrastructure	CPU util, I/O saturation, disk errors
RED	Microservices	RPS, error rate, p50/p95/p99 latency

PromQL examples

Expression	Description
`rate(http_requests_total[5m])`	Requests per second (average over 5 min)
`increase(http_requests_total[1h])`	Total increase over 1 hour
`sum by (status) (rate(http_requests_total[5m]))`	Requests aggregated by status code
`histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))`	p99 latency
`avg_over_time(cpu_usage[1h])`	Average CPU utilization over an hour
`topk(5, sum(rate(http_requests_total[5m])) by (service))`	Top 5 services by RPS
`max_over_time(memory_usage[24h])`	Max memory usage over 24h
`rate(node_network_drop_total[5m]) > 0`	Networks with dropped packets
`(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))`	CPU utilization (1 - idle)
`delta(http_request_duration_seconds_sum[5m]) / delta(http_request_duration_seconds_count[5m])`	Average latency
`absent(metric)`	Alert when metric is missing

Recording rules

Pre-aggregation of frequently used PromQL queries to reduce query load.

When to use

Complex queries used across multiple dashboards
Queries over raw data with high cardinality
Frequently queried aggregations (e.g., p99 latency over last month)

Example

groups:
  - name: service_rules
    interval: 1m
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
      - record: instance:cpu:utilization
        expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))
      - record: service:http_latency:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

record — new metric name (convention: level:metric:aggregation)
interval — how often the rule evaluates (typically 1-5 min)

Metrics — tools

Metrics

Tool	Description
Prometheus	Pull-based, time-series DB, powerful query language (PromQL)
Grafana	Visualization, dashboards, alerting
Zabbix	Enterprise monitoring, agent + agentless (SNMP/IPMI/JMX), auto-discovery, trigger-based alerting
Datadog	SaaS, APM, logs, metrics in one
New Relic	APM, browser monitoring
CloudWatch	AWS native
Azure Monitor	Azure native
Google Cloud Ops	GCP native

Logging

Tool	Description
ELK Stack	Elasticsearch, Logstash, Kibana
Loki	Grafana Loki — lightweight, Prometheus-like
Splunk	Enterprise log management
Fluentd / Fluent Bit	Log collector and forwarder
Vector	High-performance log/metric collector

Tracing

Tool	Description
Jaeger	Open-source distributed tracing
Zipkin	Open-source distributed tracing
OpenTelemetry	Standard for instrumentation (logs, metrics, traces)
Datadog APM	SaaS tracing
AWS X-Ray	AWS tracing

OpenTelemetry detail

Span attributes

resource:
  attributes:
    - service.name: "payment-service"
    - service.version: "1.2.3"
    - deployment.environment: "production"
scope:
  name: "io.opentelemetry.payment"
spans:
  - name: "processPayment"
    kind: SPAN_KIND_INTERNAL
    attributes:
      - payment.method: "credit_card"
      - payment.amount: 2499
      - payment.currency: "CZK"
    events:
      - name: "authorization.complete"
        timestamp: 1717428000000000000

Context propagation (W3C TraceContext)

traceparent — header carrying trace-id, span-id, trace flags
- Format: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
- Version (00) | Trace-ID (32 hex) | Span-ID (16 hex) | TraceFlags (01 = sampled)
tracestate — vendor-specific data, compatible cross-provider
Propagation happens via HTTP headers, gRPC metadata, message queue properties

Sampling

Type	Description	Use case
Head-based	Sampling decision at trace start (based on ID)	Simple, deterministic
Tail-based	Decision after trace completion (based on result, latency)	Better sampling, more complex

Tail-based sampling: often used for critical traces (5xx, p99+, slow traces)
Tools: Grafana Tempo (tail-based), Jaeger (head-based), OTel Collector (head + tail)

Alerting

Principles

Alert on symptom, not cause — "500 errors" instead of "high CPU"
Reduce noise — flapping alerts, alert fatigue
Runbook for every alert — what to do when alert fires
Alert severity — P0 (critical), P1 (high), P2 (medium), P3 (low)

Alertmanager (Prometheus)

route:
  receiver: "team-pager"
  group_by: ["alertname", "cluster"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "team-pager"
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: "team-slack"

receivers:
  - name: "team-pager"
    pagerduty_configs:
      - routing_key: "<KEY>"
        severity: "{{ .CommonLabels.severity }}"
  - name: "team-slack"
    slack_configs:
      - channel: "#alerts"
        title: "{{ .GroupLabels.alertname }}"

Concepts:

Grouping — grouping alerts by labels (noise reduction, e.g., all down instances in a cluster)
Inhibition — suppression of less severe alerts when a more severe one exists (e.g., nodedown inhibits pod alerts)
Silencing — temporary alert suppression (matching labels + duration)
Routing tree — hierarchical routing by label match (severity, service, team)

ESM (Event / Incident Management)

PagerDuty, Opsgenie, OnCall (Grafana)
Escalation policies
On-call rotations

Structured logging

{
  "timestamp": "2026-06-03T10:30:00Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123",
  "user_id": "u456",
  "message": "Payment gateway timeout",
  "duration_ms": 1200,
  "error": {
    "type": "TimeoutError",
    "message": "Gateway did not respond in 1000ms"
  }
}

Required fields of structured log

Field	Description	Example
`timestamp`	ISO 8601 / RFC 3339	`2026-06-03T10:30:00Z`
`level`	Log level (RFC 5424)	`ERROR`, `WARN`, `INFO`, `DEBUG`
`message`	Human-readable message	`Payment processed`
`service`	Service name	`payment-service`
`trace_id`	Correlation across services	`abc123def456`

RFC 5424 log levels

Number	Level	Usage
0	EMERG	System unusable
1	ALERT	Immediate action required
2	CRIT	Critical error
3	ERROR	Error (non-critical)
4	WARN	Warning
5	NOTICE	Normal but significant event
6	INFO	Informational message
7	DEBUG	Debugging (disabled in production)

Correlation ID (traceparent)

Generated at system entry (API gateway, frontend, message consumer)
Propagated in HTTP header X-Correlation-ID / traceparent
Enables linking logs across microservices (→ Grafana Explore, Kibana Discover)
Implementation: middleware in app, service mesh (Envoy), API gateway

Distributed tracing detail

Span kinds

Kind	Description	Example
CLIENT	Calling downstream service (outbound)	HTTP client calling API
SERVER	Processing incoming request	HTTP handler
INTERNAL	Local operation within service	Computation, transformation
PRODUCER	Sending message to queue	Kafka producer
CONSUMER	Receiving message from queue	Kafka consumer

Trace context chain

Trace: abc123
├── Span: /checkout (SERVER, root)
│   ├── Span: validateCart (INTERNAL)
│   ├── Span: POST /orders (CLIENT → payment-service)
│   │   └── Span: /processPayment (SERVER)
│   │       ├── Span: authorizeCard (INTERNAL)
│   │       └── Span: chargeCard (CLIENT → bank-gateway)
│   │           └── Span: /charge (SERVER, external)
│   └── Span: sendConfirmation (PRODUCER → kafka)
│       └── Span: consumeConfirmation (CONSUMER → email-service)

W3C TraceContext — standardized cross-service tracing
Baggage — transport of contextual data (tenant, user role) between spans

Grafana

Provisioning dashboards as code

apiVersion: 1
providers:
  - name: "default"
    orgId: 1
    folder: "Services"
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards

Dashboards JSON in git → CI/CD → automatic import into Grafana.

Variables

Query variable — dynamic values (e.g., list of service names from PromQL: label_values(up, service))
Interval variable — $__auto_interval, $__interval for variable time range
Custom variable — manual list of values (env: prod, staging, dev)
Chained variable — dependent variable (select namespace → show pods in namespace)

Annotations

Drawing events in graphs (deploys, incidents, config changes)
Sources: Prometheus alerts, Loki logs, GitHub Actions, custom API
Use case: "Deploy at 14:30 → spike in latency at 14:31 → correlation"

On-call best practices

Escalation policies

Level 1: Primary on-call (response within 5 min)
    └── timeout 15 min
Level 2: Secondary / senior engineer (response within 15 min)
    └── timeout 15 min
Level 3: Engineering manager / incident commander

Incident severity matrix

Severity	Description	Response	Communication
P0 (Critical)	Service completely unavailable, data loss, security breach	Immediate, 24/7	Status page + Stakeholder update
P1 (High)	Major functionality degraded, part of users affected	Within 15 min	Slack channel + Team lead
P2 (Medium)	Non-critical feature broken, workaround exists	Within 1 h	Slack channel
P3 (Low)	Cosmetic issue, no user impact	Next business day	Jira ticket

Postmortem

Blameless — goal is to learn, not blame
Structure: Timeline, detection, root cause, resolution, action items
SRE principle: every incident → postmortem → systemic improvement
Tools: Jira, Incident.io, PagerDuty postmortem, Google Docs

Logging patterns

Best practices

Dashboard for each level — executive, service, troubleshooting
Synthetic monitoring — heartbeat checks, browser tests (Playwright, Cypress)
APM — Application Performance Monitoring (database queries, external calls)
Anomaly detection — ML-based outlier detection
Retention policy — raw data short term, aggregations long term
Unified log format — JSON, structured data

Recommended literature

Classic books

Book	Authors	ISBN	Key topics
Site Reliability Engineering	Beyer, Jones, Petoff, Murphy	978-1491929124	How Google runs production systems — SRE principles, error budgets, toil, SLI/SLO
The Site Reliability Workbook	Beyer, Murphy, Rensin, Kawahara, Thorne	978-1492029502	Practical companion to SRE — case studies from Evernote, Home Depot, NY Times; SLO implementation, monitoring, on-call
Observability Engineering	Majors, Fong-Jones, Miranda	978-1492076445	First comprehensive book on observability — structured events, iterative hypothesis verification, core analysis loop; 2nd edition in 2026 (32 new chapters on AI, cost governance)

Cloud and monitoring

Book	Author	ISBN/Year	Topics
Cloud Observability in Action	Michael Hausenblas	Manning, 2023	Practical guide to observability in cloud-native environments — signal types (logs, metrics, traces, profiles), OTel Collector, SLOs, signal correlation, developer observability; open-source tools
Mastering Prometheus	William Hegedus	978-1-80512-566-2	Advanced Prometheus techniques — TSDB internals, custom service discovery, cardinality, remote storage (VictoriaMetrics, Mimir), SLO-based alerting; author is SRE manager at Akamai and Prometheus/Thanos contributor
Observability with Grafana	Chapman, Holmes	978-1-80324-964-3	Complete guide to LGTM stack (Loki, Grafana, Tempo, Mimir) — OTel instrumentation, LogQL/PromQL/TraceQL, AI/ML alerting, real user monitoring with Faro, Pyroscope profiling, k6 load testing
Hands-On Monitoring and Alerting with Prometheus	Muhammad Badawy	978-9349887565	Practical Prometheus guide — installation, configuration, service discovery, labeling, PromQL, Alertmanager, monitoring Linux, Windows, Docker, databases

AI and observability

Book	Authors	ISBN/Year	Topics
Observability in the AI-Native Era	Lipsig, Grabner, Rati	978-1-80638-959-9	Connecting observability with AIOps — ML-based anomaly detection, root-cause analysis, self-healing systems, OTel + Prometheus + Grafana + Dynatrace/Datadog, compliance
Open Source Observability	Corless, Pawar	O'Reilly, 2025	Report on disaggregated, modular observability stacks — flexibility, cost efficiency, data autonomy, blueprint for custom solutions from open-source components

Detailed tool overview

Extended information on tools from the table above:

Grafana Sigil

AI observability product from Grafana Labs. OpenTelemetry-native SDK for instrumenting LLM agents:

Repository: github.com/grafana/sigil-sdk (Go SDK) + sigil-app (Grafana plugin)
Features: tracking conversations, generation, tool usage, cost tracking, quality evaluation
Growing problem: 500M+ conversations, 5M+ agents in production (GrafanaCON 2026)
Integration: automatic connection with Prometheus (metrics), Tempo (traces), AI Observability API

InfraLens

Zero-instrumentation Kubernetes observability built on eBPF:

Repository: github.com/Herenn/Infralens (Apache 2.0, Go)
Features: automatic detection of service-to-service communication, topology visualization, AI-powered documentation
Architecture: eBPF agent + Go backend + React frontend
Status: early-stage (1 star, 10 commits), but eBPF-based observability concept is proven (Grafana Beyla, Cilium Hubble, Pixie)

Ingero

GPU causal observability agent — first of its kind:

Repository: github.com/ingero-io/ingero (Apache 2.0)
Features: eBPF tracing from Linux kernel events through CUDA API to Python source code
Overhead: < 2 %, zero code changes, single binary
MCP server: native Model Context Protocol support — AI assistants can directly query GPU data
Use case: diagnosis of GPU stalls, scheduler preemptions, CUDA memory spikes — causal chains instead of plain metrics
Version: v0.19.0 (2026), active development

GreptimeDB

Unified observability database — one backend for metrics, logs and traces:

Repository: github.com/GreptimeTeam/greptimedb (Apache 2.0, Rust)
Architecture: compute-storage disaggregation, object storage first (S3, GCS, Azure Blob), columnar storage
Querying: SQL + PromQL in a single query, JOIN between metrics and logs possible
Drop-in replacement: Prometheus (PromQL, remote write), Loki (Push API), Elasticsearch (bulk API), Jaeger (Query API)
Cost reduction: up to 50× lower costs compared to traditional solutions
Roadmap 2026: v1.0 GA (Q1 2026), v1.1–v1.3 (Vector Index, AI Functions, Auto Rollup, adaptive resource management)
GreptimeDB Enterprise: enhanced security, HA, enterprise support

Netdata

Open-source, real-time monitoring platform for entire infrastructure:

Repository: github.com/netdata/netdata (GPLv3+, C; 79k★)
Features: per-second metrics, ML-based anomaly detection, AI-powered troubleshooting, 800+ integrations
Zero configuration: auto-discovery, pre-configured alerts, ready dashboards
Architecture: distributed agent → Netdata Cloud (optional), data stays local
Energy efficiency: according to University of Amsterdam study, the most efficient tool for monitoring Docker systems
Netdata Cloud: free tier (5 nodes), paid from $12/node/month
Licensing: agent GPLv3+, dashboard NCUL1, cloud closed-source

OpenStack Monitoring

OpenStack provides several services for telemetry and monitoring:

Ceilometer (Telemetry)

Metric collection (CPU, memory, network, storage) from compute, network and storage nodes
Publishing to Gnocchi (time-series DB) or Panko (event storage)
Notifications via oslo.messaging (RabbitMQ) — pipeline transformations
Alarming: Aodh — threshold-based alarms, metric combinations

Monasca

More modern alternative to Ceilometer (primarily developed for telco use cases)
Architecture: Monasca API → Log API → Transform → Threshold Engine → Notifier
Backend: InfluxDB/Gnocchi, Kafka, Elasticsearch
Supports alerting, notifications, graph dashboards

Prometheus + OpenStack Exporter

OpenStack-exporter for Prometheus (exports metrics from Ceilometer / API)
Service discovery via Prometheus
Grafana dashboards for visualization

Masakari (VM High Availability)

Detection and automatic recovery of VMs on hypervisor failure (host failure)
Evacuation of instances to healthy compute node
Integration with Pacemaker for cluster management

Sources

Links, books and standards: sources/monitoring/sources.md

Last revision: 2026-06-03

20 KiB Raw Blame History Unescape Escape