Engineering March 27, 2026 · 8 min read

Why Webhooks Break in Production — and What We Built to Fix It

Every engineering team that depends on webhooks eventually hits the same wall: a silent failure, a duplicate charge, a retry storm that takes down a queue. We spent years watching these incidents repeat across companies. This is the story of why we built a webhook control plane.

The Problem Nobody Talks About

Webhooks are the connective tissue of modern software. Stripe sends a payment_intent.succeeded event and your app provisions access. Shopify fires orders/create and your fulfillment pipeline picks it up. GitHub pushes push events and your CI/CD springs to life.

This works beautifully in development. Then you go to production.

In production, the source sends a webhook and your server happens to be mid-deploy. The event is lost. Or your handler throws an exception on the third retry and the source gives up. Or, worse, the event arrives twice and your billing system processes the same charge for the same customer two minutes apart.

These are not edge cases. They are the default behavior of webhooks when left ungoverned.

The Three Ways Webhooks Fail

After years of watching production incidents pile up, we found that webhook failures cluster into three categories:

1. Silent drops

The webhook source sends an event. Your endpoint returns a 500, or a timeout, or your load balancer eats the request during a rolling deploy. The source retries a few times using exponential backoff, then stops. No alert fires on your side because your application never saw the event in the first place.

The customer notices two days later when their subscription wasn't activated. Your support team opens a ticket. An engineer digs through logs manually. This cycle repeats.

2. Duplicate processing

The source sends an event. Your handler processes it and returns a 200, but the source's network didn't see the response in time. It retries. Your handler processes the event again. The customer gets charged twice, or their account is downgraded and re-upgraded in rapid succession, or two competing fulfillment orders are dispatched.

Idempotency is the textbook answer, but implementing it correctly across every handler, for every event type, with distributed state, that's a project in itself.

3. Cascade failures

A spike in webhook traffic overwhelms your consumer. Processing slows down. The backlog grows. Retries from the source pile on top of the backlog. Your queue fills up. Other, unrelated events start failing because they share the same processing pipeline.

Now one misbehaving source has taken down event processing for every integration.

What Teams Usually Build

Most teams hit these problems and reach for the same toolkit:

A retry queue, usually SQS or Redis, to buffer events and retry on failure
Idempotency keys stored in a database, checked before each handler runs
Dead-letter queues that collect failed events for manual review
Log-based debugging: grep through CloudWatch or Datadog to piece together what happened

This works. For a while. But it's bespoke infrastructure that every team rebuilds from scratch, maintained by engineers who'd rather be building product. And it doesn't address the deeper question: who governs the pipeline?

The deeper question

Retry logic tells you how to recover from a failure. It doesn't tell you whether an event should have been processed in the first place, whether the source was authorized to send it, or what policy was in effect when the decision was made.

The Control Plane Approach

We started Duerelay with a simple premise: webhook delivery is an infrastructure problem, not an application problem. The same way you don't write your own TLS stack or build your own load balancer, you shouldn't be hand-rolling webhook governance.

A webhook control plane sits between the source and your application. Every event passes through it. The control plane handles four responsibilities:

Authentication. Verify that the webhook actually came from who it claims to be. Validate signatures. Reject unauthorized sources before the event ever touches your application.
Policy enforcement. Apply rules before execution. Rate limits, budget caps, content validation. If an event violates a policy, it's blocked, not silently dropped, but explicitly denied with a full audit record.
Deterministic execution. Deliver each event exactly once. Handle retries with idempotency guarantees built into the infrastructure layer. If a delivery fails, the control plane retries with full context, not a blind re-send.
Observability. Every event, every delivery attempt, every policy decision is traced and auditable. Not as an afterthought bolted onto application logs, but as a first-class property of the pipeline.

What Changes When You Have Governance

When the pipeline itself enforces the rules, the failure modes change fundamentally:

Silent drops become impossible. Every event is acknowledged, tracked, and retried with full visibility. If delivery fails after all retries, an incident is created, not a silent void.

Duplicate processing is handled at the infrastructure layer. Your application code doesn't need to know about idempotency keys or deduplication windows. The control plane handles it.

Cascade failures are contained. Per-source rate limiting and policy gates mean that one misbehaving integration can't take down the entire pipeline. Events are isolated by source, by endpoint, by policy.

And when an AI agent needs to take an action triggered by a webhook, provisioning a resource, executing a refund, updating a configuration, the control plane ensures that action runs under a governed identity, with spend limits, approval gates, and an immutable audit trail.

Why Now

Webhooks have been around for over a decade. So why build a control plane now?

Two things changed. First, webhooks moved from "nice to have" integrations to revenue-critical infrastructure. When your billing runs on Stripe webhooks and your fulfillment runs on Shopify webhooks and your deployment pipeline runs on GitHub webhooks, a failure in any of those chains costs real money and erodes customer trust.

Second, AI agents entered the picture. Automated systems are now taking actions in response to webhook events, not just logging them. An agent that processes a refund webhook needs governed access, spending limits, and a clear audit trail. The stakes are too high for a retry queue and a prayer.

What's Next

This blog is where we'll share what we're learning as we build Duerelay. Expect deep dives into webhook reliability patterns, event-driven architecture, policy design for automated systems, and the engineering decisions behind the control plane.

If you're dealing with webhook reliability problems today, we'd love to hear what you're running into. Every failure pattern we learn about makes the control plane better for everyone.

Get started with Duerelay or reach out. We read every message.

Ready to govern your webhooks?

Start with the sandbox. No credit card, no commitment. 500 events/day to see the control plane in action.

Open Sandbox →

Duerelay Team

Engineering