OpenTelemetry in practice: turn traces, metrics, and logs into insight
Observability is not a tool; it’s a habit. OpenTelemetry (OTel) gives you a vendor‑neutral standard for emitting traces, metrics, and logs. This post shows how to instrument production systems so you can answer real questions quickly: what broke, who is affected, and where to fix it.
The three pillars, unified
Traces show request lifecycles across services. Metrics summarize health over time. Logs carry details and context. OTel unifies their context: trace‑id connects them so a single incident story emerges instead of three separate dashboards.
Instrumentation strategy
- Start at the edge (gateway/web) and propagate context (
traceparent
) everywhere—HTTP, gRPC, messaging. - Add automatic instrumentation first (HTTP clients/servers, DB drivers). Then layer custom spans for business steps (checkout, quote, shipment create).
- Record attributes you actually query (tenant, country, plan, error code). Avoid high‑cardinality junk.
Collectors and pipelines
Run an OpenTelemetry Collector near your services. Receive OTLP, batch, sample, and export to your backends (Azure Monitor, Grafana Tempo/Loki, Datadog, etc.). Separate production from staging exporters; never mix.
Minimal collector snippet:
receivers:
otlp:
protocols: { http: {}, grpc: {} }
processors:
batch: {}
tail_sampling:
policies:
- name: errors
type: status_code
status_codes: [ ERROR ]
exporters:
otlphttp/tempo: { endpoint: https://tempo.example/api/traces }
service:
pipelines:
traces: { receivers: [otlp], processors: [tail_sampling, batch], exporters: [otlphttp/tempo] }
Metrics that matter
Start with the USE (Utilization, Saturation, Errors) and RED (Rate, Errors, Duration) methods. Emit SLO‑aligned metrics (latency P95/P99, error rate) per route and per tenant. Use exemplars to link spikes to traces.
Logs without noise
Structure logs (JSON) with a consistent schema: timestamp, severity, trace‑id/span‑id, event, and fields. Sample chatty logs; keep personally identifiable information out.
Dashboards that tell a story
One page per service:
- SLOs + burn rates at the top
- Golden paths by route/queue topic with latency and errors over time
- Top error codes and affected tenants
- Link to recent deployments; overlays for incidents
Alerts that don’t wake you up for nothing
Alert on symptoms, not causes: SLO burn, error rate, queue depth, saturation. Use multi‑window, multi‑burn rate alerts to balance speed and noise. Page humans only when action is needed.
Cost and sampling
Full tracing is expensive at scale. Use tail‑based sampling to keep all errors and a representative slice of normal traffic. Downsample DEBUG logs in production; keep info for the last N minutes during incidents.
Rollout playbook
- Instrument gateways and a single service. 2) Add DB spans and external calls. 3) Turn on tail sampling. 4) Create one dashboard and two SLOs. 5) Review weekly, prune unused signals.
Good observability pays for itself: faster incident resolution, confidence in deploys, and the ability to ask new questions without shipping new code.