How to Build a Reliable Webhook Delivery System
You'll need queues, retries, state machines, idempotency, observability, and six months you didn't budget for. Here's the real architecture.
The Deceptively Simple Starting Point
Every webhook delivery system starts the same way: receive an event, POST it to a URL, check the status code. Fifty lines of code. It works on localhost, it works in staging, and it works in production until it doesn't.
We've written before about why webhooks break in production. This post is different. This is the engineering guide for teams who've decided to build the infrastructure themselves. I've done it. I know exactly what you're about to walk into.
This isn't theory. I built this system because my own services needed it. Every architectural decision here comes from real production failures, not whitepapers.
Let's walk through what a production-grade webhook delivery system actually requires.
Component 1: The Ingestion Layer
Before you deliver anything, you need to receive events reliably. This means:
- An HTTP endpoint that accepts events, validates the payload schema, and returns a 202 (accepted, not processed) within milliseconds
- A durable write. The event must be persisted before you acknowledge it. If your process crashes between the 200 response and writing to the queue, the event is gone forever
- Schema validation at the gate. Malformed payloads should be rejected with a 400, not discovered three retries later in a handler
The temptation is to process inline: receive the event, call the destination, return the result. This falls apart the moment your destination is slow, down, or rate-limited. You've coupled your ingestion availability to your destination's availability.
Separate ingestion from delivery. Accept fast, process async. This is the first architectural decision that separates toy systems from real ones.
Component 2: The Event Store
You need a durable, queryable record of every event that enters the system. Not just for debugging, but for replay, audit, and idempotency.
Your schema needs at minimum:
- Event ID (caller-provided or system-generated UUID)
- Source identifier
- Event type
- Payload (stored verbatim, never transform on write)
- Received timestamp
- Delivery status (pending, delivering, delivered, failed, dead-lettered)
This store is your source of truth. When something goes wrong, and it will, you need to answer: what event came in, when, from whom, and what happened to it. If you can't answer that in under a minute, your on-call engineer is going to have a long night.
A Postgres table works fine here until you're processing thousands of events per second. Don't prematurely reach for Kafka. You'll know when you need it.
Component 3: The Dispatch Queue
Events move from the store to a dispatch queue. This is where most teams reach for SQS, RabbitMQ, or Redis streams. Any of them work. What matters is how you configure them:
- Visibility timeout must exceed your maximum delivery attempt time (including TCP connect, TLS handshake, request, and response read). Set it too low and you'll deliver the same event concurrently from two workers.
- Ordering guarantees matter more than you think. If a user updates their email and then deletes their account, processing those out of order creates a ghost record. FIFO per-source or per-entity is the safe default, but it kills throughput. You'll spend a week on this tradeoff alone.
- Backpressure. When the queue grows faster than you drain it, what happens? If the answer is "the queue gets bigger until we run out of memory," you don't have a system. You have a time bomb.
Component 4: The Retry Engine
This is where teams consistently underestimate the complexity.
A naive retry is: delivery failed, wait, try again. A production retry engine needs:
Exponential backoff with jitter. Without jitter, when a destination comes back online after an outage, every queued retry hits it at the same instant. You've just created a thundering herd that takes it down again. The formula is straightforward, min(base * 2^attempt + random_jitter, max_delay), but getting the constants right for your traffic patterns takes iteration.
Per-destination retry budgets. An endpoint that's been returning 503 for six hours shouldn't consume the same retry resources as one that had a single transient timeout. You need circuit-breaker logic: after N consecutive failures, back off aggressively or pause delivery to that endpoint entirely.
Retry state machine. Each delivery attempt is a state transition:
pending -> delivering -> delivered
-> retrying (schedule next attempt)
-> failed (retries exhausted, dead letter)
Every transition must be atomic and durable. If your worker crashes mid-delivery, the event must return to a retryable state, not vanish and not deliver twice.
Status code interpretation. Not all failures are retryable. A 500 is temporary; retry it. A 410 (Gone) means the endpoint was deliberately removed; stop retrying and alert. A 429 means slow down and respect the Retry-After header if present. A timeout might mean the server is processing slowly or that it's completely dead. You need different strategies for each.
Component 5: Idempotency at the Infrastructure Layer
Your delivery system will, at some point, deliver the same event more than once. Network partitions, worker crashes during the acknowledgment window, queue visibility timeout races: at-least-once delivery is the best you can guarantee without extraordinary complexity.
The question is: where do you handle deduplication?
If you push it to every consumer, every handler on every endpoint must implement its own idempotency logic. That's a maintenance burden that scales linearly with your integration count, and one missed handler means duplicate charges or duplicate orders.
If you handle it at the infrastructure layer, you need a deduplication store, typically a fast key-value lookup (Redis, or a Postgres table with a unique constraint on event ID + endpoint). Before dispatching, check if this event was already successfully delivered to this endpoint. If yes, skip it.
The catch: your deduplication window. You can't store every event ID forever. Most systems use a rolling window, 24 to 72 hours. Events replayed outside that window will be delivered again. Document this clearly, because someone will replay a week-old event and be surprised.
Component 6: Endpoint Health Tracking
Knowing that a single delivery failed isn't enough. You need to know that an endpoint is unhealthy.
Track per-endpoint metrics:
- Success rate over rolling windows (1 hour, 24 hours)
- Current consecutive failure count
- Average response latency
- Last successful delivery timestamp
When an endpoint's health degrades past a threshold, you need automated responses: reduce delivery rate, activate circuit breaker, notify the endpoint owner. Without this, a single failing endpoint consumes your retry budget and starves healthy endpoints of delivery capacity.
This is one of those components nobody thinks about until it causes an outage. You'll end up building a small monitoring system inside your delivery system.
Component 7: Security
Every event in transit must be signed. The standard approach is HMAC-SHA256: hash the raw request body with a shared secret, include the signature in a header. The destination verifies it before processing.
But signing creates operational complexity:
- Secret rotation. You need to support two active secrets simultaneously during rotation. Verify against both, sign with the new one. This means your signing logic isn't just
hmac(secret, body). It'shmac(secrets[current], body)with dual verification. - Replay protection. A signature alone doesn't prevent an attacker from capturing a signed request and replaying it. Include a timestamp in the signed payload. The destination should reject events older than a tolerance window (typically 5 minutes).
- Payload integrity. Sign the exact bytes you send. If any middleware transforms the payload between signing and delivery (JSON reformatting, encoding changes) the signature breaks. This bug is subtle and infuriating to debug.
Component 8: Observability
You're now operating a distributed system with queues, workers, retries, state machines, and external HTTP calls. You need to see inside it.
At minimum:
- Structured logs on every state transition (event received, delivery attempted, delivery succeeded/failed, retried, dead-lettered)
- Metrics: delivery latency percentiles, retry rates, queue depth, per-endpoint success rates, dead-letter queue size
- Traces: ideally a single trace ID that follows an event from ingestion through every delivery attempt to final disposition
- Alerting: dead-letter queue growing, endpoint circuit breaker tripped, delivery latency exceeding SLA
If you're already running Prometheus/Grafana or Datadog, instrumenting the system is straightforward. Building the instrumentation points into every component from day one is the discipline that separates a maintainable system from one that becomes a black box under pressure.
The Iceberg
If you've been keeping count, you're now maintaining: an ingestion API, a durable event store, a dispatch queue with ordering and backpressure, a retry engine with exponential backoff and circuit breakers, a state machine with atomic transitions, an idempotency layer with windowed deduplication, per-endpoint health tracking, signature generation with key rotation, replay protection, and a full observability stack.
That's eight components, each with its own failure modes, edge cases, and operational burden. And we haven't discussed multi-tenancy, rate limiting, event filtering, transformation pipelines, or geographic distribution.
And then there's the question nobody asks until it's too late: what happens when the thing responding to your webhooks isn't a human, but an AI agent? An agent that provisions resources, issues refunds, or updates configuration based on incoming events needs more than a delivery pipeline. It needs governed execution. Spending limits, approval gates, policy enforcement, an immutable audit trail. That's an entire control plane on top of the delivery system you just built.
Most teams that set out to build this estimate a quarter. Most teams that ship it production-ready report six to twelve months of dedicated engineering. And then you maintain it. Forever.
The Alternative
I know what it takes because I've done it. I built this system for my own services. My own projects needed reliable event delivery, and nothing on the market fit. The architecture above isn't hypothetical; it's the blueprint of what became Duerelay.
My own services are still Duerelay's first customers. Every event they process goes through the same control plane you'd use. And when I started connecting AI agents to those events, the need for governed execution became obvious. That's why Duerelay includes an Agent Control Plane: policy enforcement, spend controls, approval workflows, and full observability for every action an agent takes in response to a webhook.
If you have the team, the time, and the appetite, build it. It's a fascinating engineering challenge. But if your goal is to ship product and not maintain webhook infrastructure, the control plane is ready.
Ready to govern your webhooks?
Start with the sandbox. No credit card, no commitment. 500 events/day to see the control plane in action.
Open Sandbox →