Webhook Retry Logic Examples: Patterns, Code, Best Practices

Introduction

Webhooks fail in production for ordinary reasons: a receiver times out, a network path drops, a service goes down briefly, or an API returns 429 Too Many Requests because of rate limiting. Without retry logic, those failures become lost events. With the wrong retry logic, they become duplicate processing, retry storms, and noisy incidents.

This guide explains webhook retry logic, how retries work, which HTTP status codes should trigger a retry, and how to implement safe handling in event-driven architecture and distributed systems. It also covers exponential backoff, fixed interval retries, linear backoff, jitter, idempotency, duplicate delivery, testing, and monitoring.

For a broader view of delivery behavior, see webhook delivery retries. For retry patterns and mechanisms, see webhook delivery retry mechanisms. For the surrounding system design, webhook architecture best practices shows how retries fit into a resilient webhook stack.

What webhook retry logic is and why it matters

Webhook retry logic is the policy a sender uses to re-attempt failed deliveries after transient failures such as timeouts, HTTP 429 Too Many Requests, HTTP 5xx server errors, or brief outages. A one-off resend is just another try; a real strategy defines how many attempts to make, how long to wait between them, and when to stop. That difference matters because webhooks in event-driven architecture often need eventual consistency, not instant success.

Good retry logic keeps integrations working during short disruptions and protects customer trust by making missed events less likely. It also has to be safe in distributed systems, where duplicate delivery can happen even if the first attempt actually succeeded. That is why webhook best practices emphasize idempotency, backoff, and observability: retries should improve reliability without creating duplicate processing or retry storms.

How webhook retries work in practice

A webhook retry starts when an event is created, such as a Stripe payment update, a GitHub push, or a Shopify order event. The sender sends an HTTP POST to the receiver, then evaluates the response: 2xx usually means success, while HTTP 5xx server errors, HTTP 429 Too Many Requests, timeouts, DNS failures, connection resets, and other network failures usually trigger another attempt.

The sender uses its retry policy to decide whether the failure looks transient or permanent. Clear responses help: a 400 or 401 tells the sender not to keep retrying, while a 503 suggests a temporary outage. Retries are then scheduled with backoff until the event succeeds, the max attempt count is reached, or the retry window expires. After that, systems usually hand the event to a dead-letter queue or manual recovery flow.

Common webhook retry strategies

Fixed interval retries resend on a schedule like every 30 seconds. They are simple and fine for low-volume webhooks, but they can hammer a receiver that is already struggling. Linear backoff slows that pressure by increasing waits in a steady pattern, such as 30s, 60s, 90s, which is easier on webhook performance but still predictable.

Exponential backoff is the default choice for most production distributed systems: each failure waits longer than the last, so the sender recovers quickly from brief blips without flooding the receiver. Add jitter on top of exponential backoff to randomize retry timing and avoid the thundering herd problem when many deliveries fail together. Poorly implemented retry logic can turn a small outage into a retry storm.

Use simpler strategies only when the integration is low-risk, low-volume, and easy to replay. For production webhooks, webhook best practices usually mean exponential backoff with jitter.

Choosing the right retry policy

Use exponential backoff with jitter for most production webhook retry logic. It spaces retries farther apart after repeated failures, which reduces pressure on a struggling receiver and avoids synchronized retry spikes in an event-driven architecture. Stripe, GitHub, and Shopify-style delivery systems commonly use this pattern because it protects webhook performance under load.

Choose fixed interval retries only for low-volume, low-risk integrations where synchronized retries won’t overload the receiver. Use linear backoff if you want a simple middle ground, but it is usually less resilient than exponential backoff.

Set policy by event criticality, receiver capacity, and delivery deadline. For unrecoverable failures, cap attempts and route the event to a dead-letter queue or manual recovery path so operators can replay it later.

Which HTTP status codes should trigger a retry?

Treat HTTP 5xx server errors as retryable: 500, 502, 503, and 504 usually mean the receiver failed temporarily. Retry on timeouts, network failures, DNS failures, and connection resets too, because the sender never got a definitive success response.

Most HTTP 4xx client errors should not be retried. 400, 401, 403, and 404 usually point to bad input, expired credentials, or a wrong endpoint, so repeating the same request will fail again. A common exception is 408 Request Timeout, and HTTP 429 Too Many Requests is often retryable, especially when the receiver includes a Retry-After header.

Should 429 Too Many Requests be retried? Usually yes, but only after respecting the Retry-After header when present and only if the sender can safely delay delivery. If the receiver is rate limiting aggressively, backoff plus jitter helps avoid making the overload worse.

Webhook retry logic examples

Simple fixed-delay loop:

attempt = 1
max_attempts = 5
delay = 30s

while attempt <= max_attempts:
  response = send_http_webhook()
  if response is 2xx: return success
  if response is 429 or 5xx or network_error:
    log("delivery attempt failed", attempt)
    sleep(delay)
    attempt += 1
  else: return failure
return failure

Production pattern: use exponential backoff with jitter and a cap.

base = 1s; cap = 60s
for attempt in 1..max_attempts:
  response = send_http_webhook()
  if success: log("delivered", attempt); return
  if response has Retry-After header: delay = parse_retry_after()
  else if response is 429: delay = min(cap, base * 2^(attempt-1)) + jitter
  else if response is 5xx: delay = min(cap, base * 2^(attempt-1)) + jitter
  else: fail fast
  log("retrying", attempt, delay, status)
  sleep(delay)

Respect the Retry-After header when present; it overrides your local delay. In Node.js, Python, or Go, keep the same policy and swap only the HTTP client, sleep, and logging calls.

Idempotency and duplicate delivery handling

Idempotency means the same webhook event produces the same final result even if it arrives multiple times. Retries create duplicate delivery when the sender times out or loses the response after the receiver already processed the event, so the sender sends it again. Use event IDs or idempotency keys to detect repeats and skip duplicate side effects.

Store processed IDs in a deduplication table, a Redis cache with TTL, or a PostgreSQL unique constraint on event_id. For orders, use a safe upsert so order_123 is created once. For payments, never charge twice; record the first successful capture and ignore later duplicates. Retries are only reliable when duplicate processing cannot create extra side effects.

How to implement webhook retry logic in code

A practical implementation usually has four parts: send the HTTP request, classify the result, compute the next delay, and persist delivery attempts so retries survive process restarts. In Node.js, Python, or Go, the same logic applies even though the syntax changes.

A sender should record the event ID, delivery attempts, last status code, and next retry time in durable storage such as PostgreSQL, Redis, Kafka, or RabbitMQ metadata. That makes it possible to resume retries after a crash and to support replay later.

A simple implementation flow looks like this:

Send the webhook over HTTP.
If the response is 2xx, mark the event delivered.
If the response is retryable, increment the delivery attempt count.
Compute delay using exponential backoff, fixed interval retries, or linear backoff.
Add jitter to avoid synchronized retries.
Stop after the max attempt count or retry window.
Move the event to a dead-letter queue or manual recovery queue if it still fails.

The important part is not the language; it is the retry policy, idempotency handling, and observability around each attempt.

Best practices for implementing webhook retries

Retry only transient failures: timeouts, 429, 5xx, DNS errors, and connection resets. Do not retry permanent failures like invalid signatures, malformed payloads, or schema validation errors; those need code fixes, not another delivery attempt. Use exponential backoff with jitter to avoid the thundering herd problem when many events fail at once.

Set both a maximum retry count and a maximum retry window so retries stop predictably. Keep payloads stable across attempts, and log the event ID, attempt number, status code, and next retry time for monitoring and alerting. When retries are exhausted, move the event to a dead-letter queue and support replay or manual recovery after the root cause is fixed.

What should happen after the final retry fails?

After the final retry fails, the event should not disappear silently. Mark it as failed, store the final error details, and route it to a dead-letter queue or a durable failed-delivery store. That gives operators a place to inspect the payload, the last response, and the full sequence of delivery attempts.

From there, teams can choose manual recovery, replay, or a targeted fix. This is where observability, logging, monitoring, and alerting matter most: they tell you which failures are transient and which are permanent.

Log, monitor, and alert on failed deliveries

Treat observability as part of retry logic, not an afterthought. Log every delivery attempt with the event ID, endpoint URL, attempt number, HTTP status code, latency, and next retry time so you can trace a failure end to end.

Track metrics that show retry health: failure rate, retry success rate, age of the oldest pending retry, and dead-letter queue volume. Alert on spikes in HTTP 429 Too Many Requests, repeated timeouts, or sustained 5xx responses. Use dashboards to compare integrations and separate sender-side issues from receiver-side outages.

How do I test webhook retry logic?

Write unit tests for backoff math: confirm exponential delays grow as expected, jitter stays within the allowed range, max attempts stop retries, and permanent failures exit immediately. For integration testing, mock HTTP 5xx server errors, HTTP 429 Too Many Requests, timeouts, DNS failures, and other network failures so you can verify retry timing and logging without hitting a real endpoint.

Use end-to-end tests to replay the same event IDs multiple times and ensure your handler blocks duplicate delivery while still accepting a valid replay after recovery. Test a receiver coming back online after several failed attempts, then confirm the next retry succeeds. Also verify dead-letter queue behavior and manual replay paths after the failure is resolved.

Common webhook retry mistakes

Aggressive schedules, like retrying every second, can overload a receiver that is already failing and turn a brief outage into a longer one. Ignoring idempotency causes duplicate delivery to create duplicate orders, double charges, or repeated ticket updates. Retrying permanent HTTP 4xx client errors such as invalid signatures or malformed payloads wastes resources and hides bugs that need code fixes, not another attempt.

Missing jitter and max limits can synchronize senders into retry storms, especially after an outage clears. Weak logging, monitoring, and broader observability make it hard to see which event failed, how often, and whether retries are helping or making webhook performance worse.

Conclusion

The safest default for webhook retry logic is still the same: exponential backoff with jitter, a hard cap on attempts and total retry time, and clear stop conditions for permanent failures. That combination gives transient issues time to recover without creating retry storms or masking bad requests that should fail fast.

Reliable webhooks also depend on idempotency. Every receiver should assume duplicate delivery can happen and protect against it with event IDs, idempotency keys, or deduplication storage. If your handler can safely process the same event twice, retries become a resilience feature instead of a source of corruption.

After final failure, you still need a recovery path: strong observability, alerting, and a dead-letter queue or equivalent backlog for manual replay and investigation. Review your current webhooks now: check retry policy, status handling, duplicate protection, and monitoring gaps.