Reliable queues with Azure Service Bus: patterns that survive incidents
Queues decouple producers from consumers and smooth load spikes, but only when you treat failure as a normal case. This expanded guide shows how to build Service Bus systems that keep working through retries, restarts, and network gremlins—and how to observe, test, and budget capacity so incidents don’t become all‑hands fire drills.
Architecture at a glance
- Producers write to a queue or topic (pub/sub). Use topics when multiple services react to the same event.
- Consumers run as horizontally scalable workers with explicit concurrency limits.
- DLQs capture messages that repeatedly fail. A replay tool drains DLQs when fixes ship.
- Observability emits metrics and traces with correlation IDs so you can answer “what failed and why?” quickly.
Common topology:
- Commands → queues (one consumer updates state)
- Events → topics with subscriptions per consumer (fan‑out)
Idempotency first
Design consumers to safely handle duplicates. Use a dedupe key (message ID or business key) and track processed IDs in a fast store with TTL. If the same work arrives twice, do nothing.
Idempotency storage options:
- Cache (Redis/Azure Cache for Redis) with TTL ~24–72h
- Database table with unique constraint on (dedupe_key) and a small retention job
Outbox pattern (producer side): write the business change and message payload in one DB transaction, then a background process publishes and marks sent. This prevents “DB committed but message lost” scenarios.
Retry and backoff
Handle transient errors with exponential backoff and jitter. Limit concurrent message processing to match downstream capacity. Avoid tight retry loops that amplify incidents.
Recommended knobs (per processor):
MaxConcurrentCalls
: start small (2–8) and scale with monitoring- Retry policy: exponential, max 3–5 attempts before dead‑letter
- Circuit breaker around downstreams to reduce pressure when they fail
Poison messages and dead-letter queues
When a message fails repeatedly, dead-letter it with a reason and diagnostic context. Build a small replay tool that fixes the root cause and requeues safely.
DLQ workflow:
- DLQ alert fires (count or age threshold)
- Triage sample messages: identify root cause (bad payload, downstream outage, code defect)
- Fix or add compensating logic
- Replay through a controlled tool in batches with backoff
Ordering
For strict ordering, use sessions (FIFO) and single-consumer semantics per session. Otherwise, design handlers to be commutative so order doesn’t matter.
Session tips:
- Use a stable session ID (e.g., OrderId) to preserve per‑entity order
- Process one session per worker to avoid interleaving
- Set session idle timeout to move stuck sessions along
Observability
Emit metrics: messages processed, failures, DLQ count, processing latency. Add correlation IDs through logs for end-to-end tracing.
What to log per message:
correlationId
(propagated from producer) andmessageId
dedupeKey
, retry count, handler name- Downstream response codes and durations
Key metrics and alerts:
- Queue length, age of oldest message
- Success rate, error rate (by exception type)
- DLQ depth and age; alerts when age > N minutes
- Handler latency P95/P99
Example handler (C#)
public async Task HandleAsync(ProcessMessageEventArgs args)
{
var msg = args.Message;
var key = msg.MessageId;
if (await cache.ExistsAsync(key)) { await args.CompleteMessageAsync(msg); return; }
try
{
await HandleBusinessLogicAsync(msg.Body.ToString());
await cache.SetAsync(key, true, TimeSpan.FromHours(24));
await args.CompleteMessageAsync(msg);
}
catch (TransientException)
{
await args.AbandonMessageAsync(msg); // will be retried
}
catch (Exception ex)
{
await args.DeadLetterMessageAsync(msg, "handler_error", ex.Message);
}
}
Throughput knobs
Tune PrefetchCount
, MaxConcurrentCalls
, and message size. Batch send where possible; compress big payloads but keep them under limits.
Capacity planning rough cut:
- Max sustainable rate ≈
MaxConcurrentCalls
× (1 / average processing seconds) - Keep CPU < 70% and watch downstream saturation before increasing concurrency
- Prefer more workers with moderate concurrency over one worker with huge concurrency
Disaster recovery
Enable geo-redundant namespaces or paired regions; make producers resilient to failover. Persist outbox operations so messages aren’t lost when databases commit but send fails.
Failover plan:
- Producers: write via alias/Geo‑DR connection string; on failover, reconnect and resume outbox drain
- Consumers: detect namespace alias change; restart workers with exponential backoff
Testing and chaos drills
- Unit test handlers with realistic payloads and edge cases
- Contract fixtures: validate payload schemas and versioning between services
- Integration tests using Service Bus emulator or ephemeral namespaces
- Chaos: pause downstream dependency or inject timeouts; ensure retries and DLQ behavior match expectations
Security and governance
- Use Managed Identity for producers/consumers; assign least‑privilege roles
- Queue/topic names and access as code (Bicep/Terraform) with reviews
- Encrypt sensitive fields; avoid large PII payloads—store references
End‑to‑end flow (command + events)
- API receives request → writes domain change and outbox row
- Outbox publisher sends command to queue; worker processes; emits domain event to topic
- Subscribed services react to event; any failure routes to their DLQs
- Observability links all spans with a shared correlation ID
Runbook snippet: DLQ replay
- Stop auto‑scalers to reduce noise
- Sample 20 messages; categorize root cause
- If payload fix needed, run a transform step; if downstream transient, requeue in batches of 50 with 1–5s delay
- Monitor error budget while replaying; halt on new spikes
Reference configuration (worker)
{
"ServiceBus": {
"PrefetchCount": 20,
"MaxConcurrentCalls": 8,
"MaxAutoRenewDurationSeconds": 300,
"Retry": { "Mode": "Exponential", "MaxRetries": 5, "DelayMs": 500, "MaxDelayMs": 15000 }
}
}
Production checklist
- Idempotency keys stored with TTL; outbox on producers
- Exponential retries with max attempts; circuit breakers
- DLQ alerts on depth and age; replay tool with audit trail
- Sessions for entities that require ordering
- Prefetch and concurrency sized from measurements
- Correlation IDs + metrics + traces; dashboards/alerts
- Managed Identity; infra as code; secrets out of code
- Chaos drills quarterly; runbooks updated
With these patterns, Service Bus becomes a safety net—not a source of surprises. Your queues can absorb spikes, isolate failures, and recover gracefully—while giving operators the tools to see and fix issues fast.