Engineering April 7, 2026 · 7 min read

Your Webhook Retry Logic Is Probably Wrong

You can try all of this for 30 days on Core AI Monthly, no credit card required. But first, let's talk about why the retry code you wrote last quarter is quietly losing events.

Try It Before We Tell You What's Broken

We recently opened Core AI Monthly as a 30-day free trial with no credit card required. No payment form, no checkout screen, no "we'll charge you if you forget." You sign up, you're on the plan, and the clock starts.

We did this because telling you what's wrong with hand-rolled retry logic is one thing. Letting you watch governed retries handle your own traffic is another. Everything we describe below is something the control plane handles for you out of the box.

Start your 30-day trial and see it in action. Or keep reading to understand what your current retry code is getting wrong.

The Five Mistakes

Every team that builds webhook retry logic from scratch makes a variation of the same mistakes. We've watched this pattern repeat across dozens of integrations. These aren't edge cases or theoretical concerns. They're the bugs you discover at 2 AM when your payment processor fires a spike and your queue goes sideways.

1. You're retrying on the wrong status codes

Most retry implementations look something like this:

if (response.status >= 500) {
  scheduleRetry(event);
}

This misses several critical cases. A 429 (rate limited) should be retried, but your code treats it as a permanent failure. A timeout with no response at all never hits your status check because there's no response object to inspect. A 301 redirect to an error page returns 200 and your code marks delivery as successful.

Meanwhile, you're probably retrying on 501 (Not Implemented) and 505 (HTTP Version Not Supported), which will never succeed no matter how many times you try.

The right retry surface is not a simple range. It's a decision matrix that accounts for transient failures, rate limits, timeouts, network errors, and genuinely permanent rejections. Collapsing that into status >= 500 is the first thing that breaks.

2. Your backoff is a thundering herd factory

Exponential backoff is the right idea. But vanilla exponential backoff without jitter is a coordination mechanism, not a recovery mechanism.

Here's what happens. Your endpoint goes down for 90 seconds. During that window, 200 events arrive and fail. All 200 events schedule their first retry at T+1 minute. All 200 hit the endpoint simultaneously. The endpoint, which just recovered, goes down again. Now all 200 schedule retry two at T+4 minutes. They all hit simultaneously again.

You've turned a 90-second blip into a 30-minute outage.

The fix is full jitter: randomize the delay within each backoff window so retries spread across time instead of clustering at the same instant. Without jitter, exponential backoff makes cascading failures worse, not better.

The formula that actually works

delay = random(0, min(cap, base * 2^attempt)). Full jitter. Not "add a little randomness" — truly random within the entire window. This is the approach AWS recommends and what we use internally.

3. You have no dead-letter strategy

Events that exhaust all retries need to go somewhere. Most implementations do one of three things, all wrong:

Log and discard. The event appears in your application logs, maybe. If anyone notices before the log rotates, they might investigate. Usually nobody notices.
Write to a dead-letter queue that nobody monitors. Better than discarding, but a DLQ without alerting, inspection tooling, and a replay mechanism is just a graveyard with extra steps.
Retry forever. The event stays in the queue and keeps retrying on a backoff schedule that eventually stretches to hours. Your queue grows unbounded. Processing slows for healthy events because they're competing with a backlog of permanently-failed deliveries.

A real dead-letter strategy has three components: an alert fires the moment an event exhausts retries, the failed event is inspectable with full context (payload, headers, every delivery attempt and response), and there's a one-click replay path for when the underlying issue is fixed.

4. You're not deduplicating at the infrastructure layer

Your endpoint recovers and processes an event. It returns 200. But the response takes 31 seconds, your retry system already fired attempt two at the 30-second timeout, and now the event is processed twice.

"Just add idempotency keys" is the textbook answer. In practice, this means:

Every handler must check a key before processing
You need a shared store (Redis, database) to track seen keys
You need a TTL policy on that store so it doesn't grow unbounded
Every developer on your team must remember to implement this correctly for every new event type
If a handler has side effects across multiple systems, partial failures within a handler can leave you in an inconsistent state even with idempotency checks

Deduplication belongs in the infrastructure layer, not in application code. When the delivery pipeline itself guarantees that each event is delivered exactly once, your handlers don't need to know about idempotency at all.

5. You have no per-source isolation

This is the one that causes the worst incidents. You have 12 webhook sources feeding into the same retry queue. Source A starts sending malformed payloads that always fail. Your retry queue fills up with source A's retries. Sources B through L, all perfectly healthy, start experiencing delays because they're sharing queue capacity with a poison source.

One bad integration takes down event processing for every integration. We've seen this bring down payment processing because a low-priority notification webhook was misconfigured and its retries consumed all queue workers.

Per-source isolation means each source gets its own retry budget, its own backoff schedule, and its own dead-letter path. A misbehaving source can only hurt itself.

What Changes With a Control Plane

All five of these mistakes share a root cause: retry logic is infrastructure, but it's being implemented as application code. Every team rebuilds the same thing, makes the same mistakes, and discovers them in production.

A webhook control plane moves these responsibilities into the infrastructure layer:

Status-aware routing that distinguishes transient failures from permanent rejections, rate limits from server errors, timeouts from refused connections
Full-jitter backoff applied consistently across every source and endpoint
Incident creation when events exhaust retries, with full delivery history, payload inspection, and one-click replay
Infrastructure-level deduplication so your application handlers never see the same event twice
Per-source isolation that contains failures to the source that caused them

You stop writing retry code. You stop debugging retry code. You stop explaining to your VP of Engineering why a notification webhook took down billing.

See It on Your Own Traffic

The best way to understand the difference is to see it. Start a 30-day free trial of Core AI Monthly — no credit card, no commitment. Point your webhook sources at Duerelay, send real traffic, and watch how the control plane handles the exact scenarios we described above.

Your retry queue won't miss them. But it won't handle them correctly either.

Ready to govern your webhooks?

Start with the sandbox. No credit card, no commitment. 500 events/day to see the control plane in action.

Open Sandbox →

Duerelay Team

Engineering