Webhook Troubleshooting Checklist: Fix Delivery Issues Fast
Use this webhook troubleshooting checklist to quickly isolate delivery, processing, and downstream errors, fix failures fast, and restore events.
WebhookGuide
April 10, 2026
Introduction to webhook troubleshooting
A webhook can work for weeks and then fail after a deploy, secret rotation, provider-side change, or new network rule. The hard part is that the symptoms often look the same whether the problem is delivery, your handler, or a downstream dependency like a database or queue.
A webhook troubleshooting checklist is a repeatable incident workflow for isolating where a webhook broke: delivery failures, processing failures, or downstream application errors. Delivery failures mean the request never arrives or comes back with a non-2xx response. Processing failures mean your endpoint receives the webhook but cannot parse, authenticate, or handle it correctly. Downstream errors mean the webhook handler works, but something it calls next fails.
The fastest path is usually simple: confirm the event was sent, inspect the HTTP status code and response, then trace logs using event IDs and correlation IDs before digging into signatures and application logic. That approach applies across Stripe webhooks, GitHub webhooks, Shopify webhooks, and Twilio webhooks alike.
What should a webhook troubleshooting checklist include?
A useful checklist should cover delivery, authentication, payload validation, processing, retries, monitoring, and escalation. It should also tell you what to capture during an incident so you can compare provider-side behavior with your own logs.
At minimum, include:
- The provider event ID, request ID, and timestamp
- The endpoint URL and recent deploy or config changes
- HTTP status codes, especially 2xx, 4xx, and 5xx responses
- Request headers, raw request body, and payload schema
- HMAC signatures and shared secret rotation status
- Retry history and exponential backoff behavior
- Idempotency and deduplication checks
- Structured logging, distributed tracing, and correlation IDs
- A dead-letter queue or other place to park poison messages
- Escalation criteria for provider support or your platform team
Webhook troubleshooting checklist: quick triage steps
- Confirm delivery happened. Find the provider’s event ID, timestamp, and request ID first; compare them with your own logs.
- Check the HTTP status code. A
2xxresponse usually means the provider accepted the webhook.4xxor5xxresponses point to a handler, auth, or server issue and often trigger retry logic. - Verify endpoint health. Test DNS resolution, TLS/SSL certificates, and firewall rules. Make sure the endpoint returns quickly and isn’t timing out under load.
- Inspect application logs. Search for the matching correlation ID, event ID, or request headers to see where processing stopped.
- Validate the payload and signature. Compare the raw request body against the expected payload schema before blaming the provider.
- Review retries. Repeated attempts can create duplicates or push failed deliveries into a dead-letter queue.
What should I check first when a webhook fails?
Start with the provider delivery record and your server logs. If the provider shows a timeout, connection failure, or non-2xx response, you already know the failure happened before successful completion.
Then check three things in order:
- Did the request reach the endpoint? Look for the event ID, request ID, and timestamp in your logs.
- Did the endpoint return a valid response? Confirm the HTTP status code and whether the response was 2xx, 4xx, or 5xx.
- Did authentication or parsing fail? Review the raw request body, request headers, and signature validation result.
If you still cannot find the request, move to DNS, TLS, SSL certificates, firewalls, and rate limiting. Those issues often prevent the webhook from reaching your application at all.
Common webhook delivery problems
DNS failures, expired SSL certificates, and TLS handshake errors stop the request before your app ever sees it. In provider consoles, these usually appear as connection errors, certificate validation failures, or repeated timeouts rather than clean HTTP status codes.
Firewall rules, WAFs, and IP allowlists often block provider traffic with no obvious app log entry. Some systems reject the connection outright; others return generic 4xx responses or silently drop packets. Malformed payloads or a changed payload schema break JSON parsing and validation, causing 400-series errors even when delivery succeeded.
Rate limiting creates bursts of 429 responses or delayed retries after traffic spikes. Slow dependencies inside your handler—database calls, external APIs, queue backlogs—often surface as latency, timeouts, or 5xx responses. Strong webhook observability helps separate endpoint health from downstream failures.
Why do webhook requests fail with non-2xx responses?
Providers usually treat non-2xx responses as delivery failures because they cannot confirm your handler accepted the event. A 4xx response often means the request was rejected by your application, while a 5xx response usually means your server or downstream dependency failed.
Common causes include:
- Invalid authentication or signature verification failures
- Missing required fields in the payload schema
- Application exceptions during parsing or business logic
- Rate limiting or WAF rules returning 429 or 403 responses
If the provider retries on non-2xx responses, make sure your handler is safe to run more than once.
How to debug webhook requests step by step
Start in the provider console or delivery log and capture the exact request, response code, and retry history for the failed webhook. Compare the event ID and any correlation IDs with your server logs, then inspect the request headers, raw request body, and payload schema to confirm the event matches what your consumer expects.
Check authentication before business logic runs: validate HMAC signatures against the shared secret, or verify the Bearer token path if that’s how the provider authenticates. If the signature fails, the request should stop there.
Use structured logging and distributed tracing to pinpoint whether the failure is in routing, validation, or downstream processing, and watch latency for timeouts. After the fix, replay the same payload to confirm the issue is resolved without regressions.
Why are my webhook events timing out?
Slow database queries, external API calls, or synchronous processing can trigger timeouts even when your endpoint is healthy. The provider only sees that your handler did not respond fast enough, so it may retry using retry logic and exponential backoff.
To reduce timeouts:
- Return a fast 2xx response as soon as the event is validated
- Move expensive work to a background job or queue
- Set explicit request and job time limits
- Monitor latency at the handler, queue, and downstream service levels
If timeouts happen only during traffic spikes, check rate limiting, thread exhaustion, connection pool limits, and cold starts in serverless environments such as AWS Lambda.
Why am I receiving duplicate webhook events?
Under at-least-once delivery, retries can create duplicate webhooks. Prevent double charges or duplicate records with idempotency, deduplication, and unique event IDs.
A practical approach is to store the provider event ID or delivery ID before processing. If the same ID arrives again, return a 2xx response and skip the side effect. For payments or order updates, use database uniqueness constraints or a processed-events table so duplicates cannot create a second write.
Duplicate events can also appear when a provider retries after a timeout even though your handler eventually completed. That is why fast acknowledgment matters.
How do I handle out-of-order webhook events?
Order is not guaranteed in distributed systems, so never assume events arrive strictly in sequence. Use state reconciliation, version checks, and timestamp-aware updates to handle late-arriving events safely.
For example, if a later event says an order was canceled but an earlier event still says it was paid, your handler should compare the current stored state, the event timestamp, and any version number before overwriting data. If the provider includes event IDs or sequence numbers, store them and process only the newest valid state transition.
This matters for systems that emit bursts of updates, such as Stripe webhooks, GitHub webhooks, Shopify webhooks, and Twilio webhooks.
How do I verify a webhook signature?
Most providers sign the raw request body with an HMAC signature and a shared secret. To verify it correctly:
- Capture the raw request body before any parsing or normalization.
- Recompute the HMAC signature using the provider’s documented algorithm.
- Compare the computed value with the signature in the request headers.
- Reject the request if the signature does not match.
- Rotate the shared secret carefully and test both old and new values during the transition window if the provider supports it.
If the provider uses a Bearer token instead of or in addition to HMAC signatures, verify the token in the authorization header before processing the payload.
What logs are most useful for webhook debugging?
The most useful logs are the ones that let you match one provider delivery to one application execution. Capture:
- Event ID
- Request ID
- Correlation ID
- Timestamp
- HTTP status code
- Signature validation result
- Payload type or event type
- Handler latency
- Downstream dependency errors
Structured logging makes these fields searchable. Distributed tracing helps you see whether the failure happened in the webhook handler, a queue worker, a database call, or an external API request. If you use Datadog, New Relic, Sentry, Prometheus, or Grafana, make sure the webhook path is visible in both logs and metrics.
How can I test webhooks locally?
Expose your local handler with ngrok or Cloudflare Tunnel so Stripe, GitHub, or Shopify can send real events to your machine while you debug. Then replay captured requests with Postman or curl to verify headers, payload shape, and exact 2xx responses your provider expects.
For signature checks, preserve the raw request body and verify the HMAC signatures with the same shared secret your provider uses. To test failure handling, return 4xx responses and 5xx responses, delay replies past the provider timeout, and send malformed JSON or missing fields.
Use this loop to reproduce duplicate events, out-of-order events, and partial downstream outages without production traffic. The same approach works for AWS Lambda, Node.js/Express, Python/Flask/FastAPI, and Ruby on Rails handlers in local or staging environments.
What tools help debug webhook delivery issues?
The right tools depend on where the failure occurs:
- Postman and curl for replaying requests and checking headers
- ngrok and Cloudflare Tunnel for exposing local services
- Datadog, New Relic, Sentry, Prometheus, and Grafana for observability
- Provider dashboards for Stripe webhooks, GitHub webhooks, Shopify webhooks, and Twilio webhooks
Use these tools together so you can compare provider delivery records with application behavior instead of guessing.
How do I make webhook handling more reliable?
Design handlers to return a fast 2xx response, then push expensive work to a queue or background job. That prevents the timeout failures covered earlier when you call databases, payment APIs, or email services synchronously. Use idempotency and deduplication so repeated deliveries do not double-charge, double-send, or double-write; enforce this with unique database constraints on event IDs or provider delivery IDs.
Build retry logic with exponential backoff, but stop after a defined limit and move poison messages to a dead-letter queue for manual review. Monitor delivery success rate, latency, error rates, retry counts, and signature failures with structured logging and distributed tracing. Tools like Datadog, New Relic, Sentry, Prometheus, and Grafana help you spot regressions before customers do.
Webhook troubleshooting checklist template and escalation guidance
Use this checklist as a runbook entry you can copy for any incident. Work through it in order so you separate delivery problems from authentication, schema, processing, retry, and monitoring issues.
1) Delivery
- Confirm the provider attempted delivery.
- Record the event ID, timestamp, and request ID.
- Verify the endpoint URL, DNS resolution, TLS certificate, and any recent deploy or config change.
- Check whether the provider shows a timeout, connection failure, or a non-2xx response.
2) Authentication
- Validate the webhook signature against the raw request body.
- Confirm the HMAC signature matches the expected algorithm.
- Check that the shared secret is current, especially after rotation.
- If the provider uses a Bearer token, confirm the authorization header is present and unmodified.
3) Payload validation
- Inspect the raw payload before parsing.
- Confirm the schema matches what your handler expects.
- Check for missing fields, renamed properties, or provider-specific event type changes.
4) Processing
- Confirm your handler returns the expected HTTP status code quickly.
- Review application logs for exceptions, timeouts, and downstream failures.
- Use structured logging and correlation IDs so you can trace one delivery across services.
- Check whether queue workers, database calls, or external API calls are blocking completion.
5) Retries
- Test how the provider retries failed deliveries.
- Look for duplicate events, backoff patterns, and repeated non-2xx responses.
- Confirm your deduplication or idempotency logic is working.
6) Monitoring
- Review alerting, dashboards, and delivery-rate trends.
- Look for missing logs, gaps in metrics, or a spike in failures.
- Confirm your monitoring captures provider-side and app-side errors.
- Recheck recent deploys, secret rotations, and infrastructure changes.
Evidence to collect before escalation
Gather these items before you hand the incident off:
- Timestamps for each failed delivery
- Event IDs
- Request IDs
- Response codes and HTTP status codes
- Relevant log excerpts from your app, queue, or gateway
- Signature validation details, including the HMAC signature result
- Any correlation IDs used in structured logging
- Latency measurements and timeout values
- Notes on DNS, TLS, SSL certificates, firewalls, and rate limiting
When to escalate
Escalate to the provider or platform team when local debugging no longer explains the failure:
- The provider shows an outage or degraded delivery service
- Delivery queues are backing up on the provider side
- Signature mismatches continue after shared secret rotation and verification
- Network restrictions, firewall rules, WAF policies, or IP allowlists are outside your control
- The same failure reproduces across multiple endpoints or environments
When you escalate, include the provider name, event IDs, request IDs, timestamps, response codes, and a short summary of what you already tested.
Webhook Delivery Retry Mechanisms: Best Practices Guide
Learn webhook delivery retry mechanisms best practices to boost reliability, prevent duplicates, and handle failures with confidence.
Webhook Architecture Best Practices for Reliable Scale
Webhook architecture best practices for reliable scale: learn how to improve delivery, retries, idempotency, and observability. Read more.