OpenTelemetry in practice: turn traces, metrics, and logs into insight

Sep 12, 2025•

observabilityopentelemetrytracingmetrics

•

Observability is not a tool; it’s a habit. OpenTelemetry (OTel) gives you a vendor‑neutral standard for emitting traces, metrics, and logs. This post shows how to instrument production systems so you can answer real questions quickly: what broke, who is affected, and where to fix it.

The three pillars, unified

Traces show request lifecycles across services. Metrics summarize health over time. Logs carry details and context. OTel unifies their context: trace‑id connects them so a single incident story emerges instead of three separate dashboards.

Instrumentation strategy

Start at the edge (gateway/web) and propagate context (traceparent) everywhere—HTTP, gRPC, messaging.
Add automatic instrumentation first (HTTP clients/servers, DB drivers). Then layer custom spans for business steps (checkout, quote, shipment create).
Record attributes you actually query (tenant, country, plan, error code). Avoid high‑cardinality junk.

Collectors and pipelines

Run an OpenTelemetry Collector near your services. Receive OTLP, batch, sample, and export to your backends (Azure Monitor, Grafana Tempo/Loki, Datadog, etc.). Separate production from staging exporters; never mix.

Minimal collector snippet:

receivers:
  otlp:
    protocols: { http: {}, grpc: {} }
processors:
  batch: {}
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_codes: [ ERROR ]
exporters:
  otlphttp/tempo: { endpoint: https://tempo.example/api/traces }
service:
  pipelines:
    traces: { receivers: [otlp], processors: [tail_sampling, batch], exporters: [otlphttp/tempo] }

Metrics that matter

Start with the USE (Utilization, Saturation, Errors) and RED (Rate, Errors, Duration) methods. Emit SLO‑aligned metrics (latency P95/P99, error rate) per route and per tenant. Use exemplars to link spikes to traces.

Logs without noise

Structure logs (JSON) with a consistent schema: timestamp, severity, trace‑id/span‑id, event, and fields. Sample chatty logs; keep personally identifiable information out.

Dashboards that tell a story

One page per service:

SLOs + burn rates at the top
Golden paths by route/queue topic with latency and errors over time
Top error codes and affected tenants
Link to recent deployments; overlays for incidents

Alerts that don’t wake you up for nothing

Alert on symptoms, not causes: SLO burn, error rate, queue depth, saturation. Use multi‑window, multi‑burn rate alerts to balance speed and noise. Page humans only when action is needed.

Cost and sampling

Full tracing is expensive at scale. Use tail‑based sampling to keep all errors and a representative slice of normal traffic. Downsample DEBUG logs in production; keep info for the last N minutes during incidents.

Rollout playbook

Instrument gateways and a single service. 2) Add DB spans and external calls. 3) Turn on tail sampling. 4) Create one dashboard and two SLOs. 5) Review weekly, prune unused signals.

Good observability pays for itself: faster incident resolution, confidence in deploys, and the ability to ask new questions without shipping new code.