js

How to Build a Reliable, Scalable Webhook Delivery System from Scratch

Learn how to design a fault-tolerant webhook system with retries, idempotency, and secure delivery at scale using Node.js and PostgreSQL.

How to Build a Reliable, Scalable Webhook Delivery System from Scratch

I’ve been thinking a lot about the silent workhorses of the modern web lately. You know, those little automated messengers that tell your accounting software about a new sale, or ping your team’s chat when a customer files a support ticket. They’re called webhooks, and while they seem simple—just an HTTP POST request—building a system that reliably delivers them at scale is a surprisingly complex puzzle. I’ve seen too many projects treat them as an afterthought, leading to lost notifications and frustrated users. So, let’s build one properly, from the ground up.

Think about the last time you received a payment confirmation email instantly after checking out online. That immediacy is often powered by a webhook. Your goal is to create a system that never loses a message, even if a subscriber’s server is down, and does so securely for thousands of events per minute. How do you even start designing for that level of reliability?

The foundation is a clear mental model. Your system has two main jobs: accepting events and delivering them. The trick is to never do these jobs at the same time. When an event comes in, like user.payment_succeeded, your job is to record it and queue it for delivery—fast. The actual work of calling external URLs happens separately, in the background. This separation is what lets your API remain responsive.

We’ll use a simple stack: Express.js to receive events, PostgreSQL to store everything, and a job queue to manage the delivery work. Let’s start with the database, the single source of truth.

-- Core table for who wants notifications
CREATE TABLE webhook_subscriptions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url TEXT NOT NULL,
    secret TEXT NOT NULL,
    events TEXT[] NOT NULL,
    active BOOLEAN DEFAULT true,
    created_at TIMESTAMP DEFAULT NOW()
);

-- The ledger of every delivery attempt
CREATE TABLE webhook_deliveries (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    subscription_id UUID REFERENCES webhook_subscriptions(id),
    payload JSONB NOT NULL,
    status TEXT NOT NULL,
    attempt_count INTEGER DEFAULT 0,
    next_retry_at TIMESTAMP,
    response_status INTEGER,
    error_message TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

See the webhook_deliveries table? That’s your audit trail. Every single attempt to send a notification is recorded here. This is non-negotiable for debugging. When a user asks, “Why didn’t I get the notification?”, you can show them exactly what happened. The status field tells the story: pending, success, failed.

Now, let’s handle the incoming event. The first rule is idempotency. If the same event somehow gets sent twice, you should only process it once. A client can provide a unique key with each event.

// src/middleware/idempotency.js
import { query } from '../db.js';

export async function idempotencyMiddleware(req, res, next) {
  const idempotencyKey = req.headers['idempotency-key'];
  
  if (!idempotencyKey) {
    return next(); // Client chooses to skip idempotency
  }

  // Have we seen this key before?
  const result = await query(
    'SELECT id FROM webhook_events WHERE idempotency_key = $1',
    [idempotencyKey]
  );

  if (result.rows.length > 0) {
    // This is a duplicate request
    return res.status(200).json({ 
      message: 'Event already processed',
      eventId: result.rows[0].id 
    });
  }

  // Store the key and proceed
  req.idempotencyKey = idempotencyKey;
  next();
}

Security is next. When you send a webhook, how does the receiver know it’s truly from you and not an imposter? You sign it, like a digital wax seal. We use an HMAC signature.

// src/utils/signing.js
import crypto from 'crypto';

export function generateSignature(payload, secret) {
  const hmac = crypto.createHmac('sha256', secret);
  hmac.update(JSON.stringify(payload));
  return `sha256=${hmac.digest('hex')}`;
}

// The subscriber verifies this on their end
export function verifySignature(payload, signature, secret) {
  const expectedSig = generateSignature(payload, secret);
  
  // Use timing-safe comparison!
  return crypto.timingSafeEqual(
    Buffer.from(signature),
    Buffer.from(expectedSig)
  );
}

Here’s a question: what happens when you send a webhook and the subscriber’s server takes 30 seconds to respond? Or doesn’t respond at all? Your delivery system can’t wait around. This is where the job queue becomes essential. We’ll use a simple, PostgreSQL-based queue. When an event occurs, we quickly insert jobs for each subscriber and immediately respond to our API caller. The heavy lifting happens later.

// src/publisher.js
import PgBoss from 'pg-boss';
import { query } from './db.js';

const boss = new PgBoss(process.env.DATABASE_URL);

export async function publishEvent(eventType, payload) {
  // 1. Find all active subscribers for this event
  const subscribers = await query(
    `SELECT id, url, secret FROM webhook_subscriptions 
     WHERE $1 = ANY(events) AND active = true`,
    [eventType]
  );

  // 2. Create a delivery record and queue job for each
  for (const sub of subscribers.rows) {
    const deliveryResult = await query(
      `INSERT INTO webhook_deliveries 
       (subscription_id, payload, status) 
       VALUES ($1, $2, 'pending') 
       RETURNING id`,
      [sub.id, JSON.stringify(payload)]
    );

    const deliveryId = deliveryResult.rows[0].id;

    // 3. Send to background job queue
    await boss.send('webhook-delivery', {
      deliveryId,
      subscription: sub,
      payload
    });
  }
}

The worker process is the heart of the system. It pulls jobs from the queue and attempts delivery. But it needs to be smart about failures. A simple “try again in 5 seconds” approach will overwhelm a struggling server. You need exponential backoff.

// src/worker.js
import axios from 'axios';
import { query } from './db.js';
import { generateSignature } from './utils/signing.js';

export async function processDeliveryJob(job) {
  const { deliveryId, subscription, payload } = job.data;
  
  try {
    const signature = generateSignature(payload, subscription.secret);
    
    const response = await axios.post(subscription.url, payload, {
      headers: {
        'Content-Type': 'application/json',
        'X-Webhook-Signature': signature,
        'User-Agent': 'YourApp-Webhooks/1.0'
      },
      timeout: 10000 // 10 second timeout
    });

    // Success!
    await query(
      `UPDATE webhook_deliveries 
       SET status = 'success', 
           response_status = $1,
           attempt_count = attempt_count + 1
       WHERE id = $2`,
      [response.status, deliveryId]
    );

  } catch (error) {
    
    const attempt = await getCurrentAttempt(deliveryId);
    
    if (attempt >= 5) {
      // Too many failures
      await query(
        `UPDATE webhook_deliveries 
         SET status = 'exhausted', 
             error_message = $1
         WHERE id = $2`,
        [error.message, deliveryId]
      );
    } else {
      // Schedule a retry with exponential backoff
      const delay = Math.pow(2, attempt) * 1000; // 2s, 4s, 8s...
      const nextRetry = new Date(Date.now() + delay);
      
      await query(
        `UPDATE webhook_deliveries 
         SET status = 'pending',
             next_retry_at = $1,
             attempt_count = attempt_count + 1
         WHERE id = $2`,
        [nextRetry, deliveryId]
      );

      // Re-queue the job for later
      await boss.send('webhook-delivery', job.data, { delay });
    }
  }
}

async function getCurrentAttempt(deliveryId) {
  const result = await query(
    'SELECT attempt_count FROM webhook_deliveries WHERE id = $1',
    [deliveryId]
  );
  return result.rows[0]?.attempt_count || 0;
}

Notice the state management? Each delivery moves through a clear lifecycle: pending → (success or failed) → if failed, retry as pending again → finally success or exhausted. This makes the system’s behavior predictable and monitorable.

What about when a subscriber’s endpoint is completely down for hours? Continuously hammering it wastes resources. This is where a circuit breaker pattern helps. Think of it like a fuse. After too many consecutive failures, you “trip the circuit” and stop sending for a while.

// src/utils/circuitBreaker.js
const failureTracker = new Map(); // subscription_id -> failure count

export function shouldAttemptDelivery(subscriptionId) {
  const failures = failureTracker.get(subscriptionId) || 0;
  
  if (failures > 10) {
    // Circuit is open - don't attempt
    return false;
  }
  
  return true;
}

export function recordFailure(subscriptionId) {
  const current = failureTracker.get(subscriptionId) || 0;
  failureTracker.set(subscriptionId, current + 1);
}

export function recordSuccess(subscriptionId) {
  // Reset on success
  failureTracker.delete(subscriptionId);
}

You’d integrate this check at the start of your processDeliveryJob function. If the circuit is open, you can skip the attempt and schedule a check for later. After a cool-down period, you allow one test request through to see if the service has recovered.

Building this changes how you think about your application. Events become first-class citizens. You start seeing notification opportunities everywhere: “When this data changes, who might want to know?” The system you’ve built handles the complexity of delivery so you can focus on the business logic of what to send.

The final piece is visibility. You need a way to see the health of your webhook deliveries. A simple dashboard that shows success rates, common failure status codes, and pending retries is invaluable. It turns a black box into a transparent system you can understand and trust.

This infrastructure might seem like overkill for sending a simple HTTP request. But that’s the point. Reliability isn’t about handling the easy cases; it’s about gracefully managing the inevitable failures. Lost notifications erode trust. A robust webhook system builds it.

What problems have you faced with notifications in your own projects? Did you find a simpler solution, or did you need to add even more complexity? I’d love to hear about your experiences. If this guide helped you think differently about backend infrastructure, please consider sharing it with others who might be facing similar challenges. Your comments and questions are always welcome below.


As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!


📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!


Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

Keywords: webhooks,backend architecture,nodejs,postgresql,event-driven systems



Similar Posts
Blog Image
Build Scalable Event-Driven Architecture: Node.js, EventStore & Temporal Workflows Complete Guide

Learn to build scalable event-driven systems with Node.js, EventStore & Temporal workflows. Master event sourcing, CQRS patterns & microservices architecture.

Blog Image
Complete Guide to Integrating Svelte with Firebase: Build Real-Time Apps Fast

Learn to integrate Svelte with Firebase for seamless full-stack development. Build reactive apps with real-time data, authentication & cloud services effortlessly.

Blog Image
Complete Guide to Next.js Prisma Integration: Build Type-Safe Full-Stack Applications in 2024

Learn how to integrate Next.js with Prisma ORM for type-safe full-stack apps. Build seamless database operations with auto-generated schemas and TypeScript support.

Blog Image
Build a Real-time Collaborative Editor with Socket.io, Redis, and Operational Transforms

Learn to build real-time collaborative document editors using Socket.io, Redis & Operational Transforms. Master conflict resolution, scalable architecture & production deployment.

Blog Image
How to Build a Distributed Task Queue System with BullMQ, Redis, and TypeScript

Learn to build a scalable distributed task queue system using BullMQ, Redis, and TypeScript. Complete guide with type-safe job processing, error handling, and monitoring.

Blog Image
Complete Guide to Integrating Next.js with Prisma ORM for Type-Safe Full-Stack Applications

Learn how to integrate Next.js with Prisma ORM for type-safe, full-stack applications. Complete guide with setup, API routes, and best practices.