Webhook Troubleshooting Checklist: Fix Delivery Issues Fast

Introduction to webhook troubleshooting

A webhook can work for weeks and then fail after a deploy, secret rotation, provider-side change, or new network rule. The hard part is that the symptoms often look the same whether the problem is delivery, your handler, or a downstream dependency like a database or queue.

A webhook troubleshooting checklist is a repeatable incident workflow for isolating where a webhook broke: delivery failures, processing failures, or downstream application errors. Delivery failures mean the request never arrives or comes back with a non-2xx response. Processing failures mean your endpoint receives the webhook but cannot parse, authenticate, or handle it correctly. Downstream errors mean the webhook handler works, but something it calls next fails.

The fastest path is usually simple: confirm the event was sent, inspect the HTTP status code and response, then trace logs using event IDs and correlation IDs before digging into signatures and application logic. That approach applies across Stripe webhooks, GitHub webhooks, Shopify webhooks, and Twilio webhooks alike.

What should a webhook troubleshooting checklist include?

A useful checklist should cover delivery, authentication, payload validation, processing, retries, monitoring, and escalation. It should also tell you what to capture during an incident so you can compare provider-side behavior with your own logs.

At minimum, include:

The provider event ID, request ID, and timestamp
The endpoint URL and recent deploy or config changes
HTTP status codes, especially 2xx, 4xx, and 5xx responses
Request headers, raw request body, and payload schema
HMAC signatures and shared secret rotation status
Retry history and exponential backoff behavior
Idempotency and deduplication checks
Structured logging, distributed tracing, and correlation IDs
A dead-letter queue or other place to park poison messages
Escalation criteria for provider support or your platform team

Webhook troubleshooting checklist: quick triage steps

Confirm delivery happened. Find the provider’s event ID, timestamp, and request ID first; compare them with your own logs.
Check the HTTP status code. A 2xx response usually means the provider accepted the webhook. 4xx or 5xx responses point to a handler, auth, or server issue and often trigger retry logic.
Verify endpoint health. Test DNS resolution, TLS/SSL certificates, and firewall rules. Make sure the endpoint returns quickly and isn’t timing out under load.
Inspect application logs. Search for the matching correlation ID, event ID, or request headers to see where processing stopped.
Validate the payload and signature. Compare the raw request body against the expected payload schema before blaming the provider.
Review retries. Repeated attempts can create duplicates or push failed deliveries into a dead-letter queue.

What should I check first when a webhook fails?

Start with the provider delivery record and your server logs. If the provider shows a timeout, connection failure, or non-2xx response, you already know the failure happened before successful completion.

Then check three things in order:

Did the request reach the endpoint? Look for the event ID, request ID, and timestamp in your logs.
Did the endpoint return a valid response? Confirm the HTTP status code and whether the response was 2xx, 4xx, or 5xx.
Did authentication or parsing fail? Review the raw request body, request headers, and signature validation result.

If you still cannot find the request, move to DNS, TLS, SSL certificates, firewalls, and rate limiting. Those issues often prevent the webhook from reaching your application at all.

Common webhook delivery problems

DNS failures, expired SSL certificates, and TLS handshake errors stop the request before your app ever sees it. In provider consoles, these usually appear as connection errors, certificate validation failures, or repeated timeouts rather than clean HTTP status codes.

Firewall rules, WAFs, and IP allowlists often block provider traffic with no obvious app log entry. Some systems reject the connection outright; others return generic 4xx responses or silently drop packets. Malformed payloads or a changed payload schema break JSON parsing and validation, causing 400-series errors even when delivery succeeded.

Rate limiting creates bursts of 429 responses or delayed retries after traffic spikes. Slow dependencies inside your handler—database calls, external APIs, queue backlogs—often surface as latency, timeouts, or 5xx responses. Strong webhook observability helps separate endpoint health from downstream failures.

Why do webhook requests fail with non-2xx responses?

Providers usually treat non-2xx responses as delivery failures because they cannot confirm your handler accepted the event. A 4xx response often means the request was rejected by your application, while a 5xx response usually means your server or downstream dependency failed.

Common causes include:

Invalid authentication or signature verification failures
Missing required fields in the payload schema
Application exceptions during parsing or business logic
Rate limiting or WAF rules returning 429 or 403 responses

If the provider retries on non-2xx responses, make sure your handler is safe to run more than once.

How to debug webhook requests step by step

Start in the provider console or delivery log and capture the exact request, response code, and retry history for the failed webhook. Compare the event ID and any correlation IDs with your server logs, then inspect the request headers, raw request body, and payload schema to confirm the event matches what your consumer expects.

Check authentication before business logic runs: validate HMAC signatures against the shared secret, or verify the Bearer token path if that’s how the provider authenticates. If the signature fails, the request should stop there.

Use structured logging and distributed tracing to pinpoint whether the failure is in routing, validation, or downstream processing, and watch latency for timeouts. After the fix, replay the same payload to confirm the issue is resolved without regressions.

Why are my webhook events timing out?

Slow database queries, external API calls, or synchronous processing can trigger timeouts even when your endpoint is healthy. The provider only sees that your handler did not respond fast enough, so it may retry using retry logic and exponential backoff.

To reduce timeouts:

Return a fast 2xx response as soon as the event is validated
Move expensive work to a background job or queue
Set explicit request and job time limits
Monitor latency at the handler, queue, and downstream service levels

If timeouts happen only during traffic spikes, check rate limiting, thread exhaustion, connection pool limits, and cold starts in serverless environments such as AWS Lambda.

Why am I receiving duplicate webhook events?

Under at-least-once delivery, retries can create duplicate webhooks. Prevent double charges or duplicate records with idempotency, deduplication, and unique event IDs.

A practical approach is to store the provider event ID or delivery ID before processing. If the same ID arrives again, return a 2xx response and skip the side effect. For payments or order updates, use database uniqueness constraints or a processed-events table so duplicates cannot create a second write.

Duplicate events can also appear when a provider retries after a timeout even though your handler eventually completed. That is why fast acknowledgment matters.

How do I handle out-of-order webhook events?

Order is not guaranteed in distributed systems, so never assume events arrive strictly in sequence. Use state reconciliation, version checks, and timestamp-aware updates to handle late-arriving events safely.

For example, if a later event says an order was canceled but an earlier event still says it was paid, your handler should compare the current stored state, the event timestamp, and any version number before overwriting data. If the provider includes event IDs or sequence numbers, store them and process only the newest valid state transition.

This matters for systems that emit bursts of updates, such as Stripe webhooks, GitHub webhooks, Shopify webhooks, and Twilio webhooks.

How do I verify a webhook signature?

Most providers sign the raw request body with an HMAC signature and a shared secret. To verify it correctly:

Capture the raw request body before any parsing or normalization.
Recompute the HMAC signature using the provider’s documented algorithm.
Compare the computed value with the signature in the request headers.
Reject the request if the signature does not match.
Rotate the shared secret carefully and test both old and new values during the transition window if the provider supports it.

If the provider uses a Bearer token instead of or in addition to HMAC signatures, verify the token in the authorization header before processing the payload.

What logs are most useful for webhook debugging?

The most useful logs are the ones that let you match one provider delivery to one application execution. Capture:

Event ID
Request ID
Correlation ID
Timestamp
HTTP status code
Signature validation result
Payload type or event type
Handler latency
Downstream dependency errors

Structured logging makes these fields searchable. Distributed tracing helps you see whether the failure happened in the webhook handler, a queue worker, a database call, or an external API request. If you use Datadog, New Relic, Sentry, Prometheus, or Grafana, make sure the webhook path is visible in both logs and metrics.

How can I test webhooks locally?

Expose your local handler with ngrok or Cloudflare Tunnel so Stripe, GitHub, or Shopify can send real events to your machine while you debug. Then replay captured requests with Postman or curl to verify headers, payload shape, and exact 2xx responses your provider expects.

For signature checks, preserve the raw request body and verify the HMAC signatures with the same shared secret your provider uses. To test failure handling, return 4xx responses and 5xx responses, delay replies past the provider timeout, and send malformed JSON or missing fields.

Use this loop to reproduce duplicate events, out-of-order events, and partial downstream outages without production traffic. The same approach works for AWS Lambda, Node.js/Express, Python/Flask/FastAPI, and Ruby on Rails handlers in local or staging environments.

What tools help debug webhook delivery issues?

The right tools depend on where the failure occurs:

Postman and curl for replaying requests and checking headers
ngrok and Cloudflare Tunnel for exposing local services
Datadog, New Relic, Sentry, Prometheus, and Grafana for observability
Provider dashboards for Stripe webhooks, GitHub webhooks, Shopify webhooks, and Twilio webhooks

Use these tools together so you can compare provider delivery records with application behavior instead of guessing.

How do I make webhook handling more reliable?

Design handlers to return a fast 2xx response, then push expensive work to a queue or background job. That prevents the timeout failures covered earlier when you call databases, payment APIs, or email services synchronously. Use idempotency and deduplication so repeated deliveries do not double-charge, double-send, or double-write; enforce this with unique database constraints on event IDs or provider delivery IDs.

Build retry logic with exponential backoff, but stop after a defined limit and move poison messages to a dead-letter queue for manual review. Monitor delivery success rate, latency, error rates, retry counts, and signature failures with structured logging and distributed tracing. Tools like Datadog, New Relic, Sentry, Prometheus, and Grafana help you spot regressions before customers do.

Webhook troubleshooting checklist template and escalation guidance

Use this checklist as a runbook entry you can copy for any incident. Work through it in order so you separate delivery problems from authentication, schema, processing, retry, and monitoring issues.

1) Delivery

Confirm the provider attempted delivery.
Record the event ID, timestamp, and request ID.
Verify the endpoint URL, DNS resolution, TLS certificate, and any recent deploy or config change.
Check whether the provider shows a timeout, connection failure, or a non-2xx response.

2) Authentication

Validate the webhook signature against the raw request body.
Confirm the HMAC signature matches the expected algorithm.
Check that the shared secret is current, especially after rotation.
If the provider uses a Bearer token, confirm the authorization header is present and unmodified.

3) Payload validation

Inspect the raw payload before parsing.
Confirm the schema matches what your handler expects.
Check for missing fields, renamed properties, or provider-specific event type changes.

4) Processing

Confirm your handler returns the expected HTTP status code quickly.
Review application logs for exceptions, timeouts, and downstream failures.
Use structured logging and correlation IDs so you can trace one delivery across services.
Check whether queue workers, database calls, or external API calls are blocking completion.

5) Retries

Test how the provider retries failed deliveries.
Look for duplicate events, backoff patterns, and repeated non-2xx responses.
Confirm your deduplication or idempotency logic is working.

6) Monitoring

Review alerting, dashboards, and delivery-rate trends.
Look for missing logs, gaps in metrics, or a spike in failures.
Confirm your monitoring captures provider-side and app-side errors.
Recheck recent deploys, secret rotations, and infrastructure changes.

Evidence to collect before escalation

Gather these items before you hand the incident off:

Timestamps for each failed delivery
Event IDs
Request IDs
Response codes and HTTP status codes
Relevant log excerpts from your app, queue, or gateway
Signature validation details, including the HMAC signature result
Any correlation IDs used in structured logging
Latency measurements and timeout values
Notes on DNS, TLS, SSL certificates, firewalls, and rate limiting

When to escalate

Escalate to the provider or platform team when local debugging no longer explains the failure:

The provider shows an outage or degraded delivery service
Delivery queues are backing up on the provider side
Signature mismatches continue after shared secret rotation and verification
Network restrictions, firewall rules, WAF policies, or IP allowlists are outside your control
The same failure reproduces across multiple endpoints or environments

When you escalate, include the provider name, event IDs, request IDs, timestamps, response codes, and a short summary of what you already tested.