Background Jobs Architecture: BullMQ, Retries, Dead Letter Queues, and Concurrency
Build a production background job system with BullMQ: configure job queues, implement exponential backoff retries, route failed jobs to dead letter queues, tune worker concurrency, and monitor queue health with Prometheus.
Background jobs are where most production reliability problems hide. The HTTP request returns 200, the job is queued, and somewhere in the background it silently fails โ no retry, no alerting, no dead letter queue. The user never finds out their export wasn't generated or their email wasn't sent.
BullMQ on Redis solves this with persistent queues, configurable retries, and first-class dead letter queue (DLQ) support. The architecture question isn't whether to use a job queue โ it's how to configure it correctly for your failure modes.
Queue Architecture Overview
Producer (API server)
โ
โโโ emailQueue โ email-worker (concurrency: 10)
โโโ exportQueue โ export-worker (concurrency: 3, heavy CPU)
โโโ notificationQueue โ notification-worker (concurrency: 20)
โโโ webhookQueue โ webhook-worker (concurrency: 5)
Failed jobs (after maxAttempts) โ DLQ (dead letter queue)
โโโ Manual review / reprocessing
Queue and Worker Setup
// src/queues/index.ts
import { Queue, QueueEvents } from "bullmq";
import { Redis } from "ioredis";
// Separate Redis connection for each queue component
// BullMQ requires dedicated connections (no multiplexing)
const connection = new Redis(process.env.REDIS_URL!, {
maxRetriesPerRequest: null, // Required by BullMQ
enableReadyCheck: false,
});
// Shared queue options
const defaultQueueOptions = {
connection,
defaultJobOptions: {
attempts: 3,
backoff: {
type: "exponential" as const,
delay: 1000, // Start at 1s, then 2s, 4s
},
removeOnComplete: {
age: 24 * 3600, // Keep completed jobs for 24 hours
count: 1000, // Keep last 1000 completed jobs
},
removeOnFail: false, // Keep failed jobs for DLQ processing
},
};
// Define queue payloads as discriminated union for type safety
export interface EmailJobData {
type: "welcome" | "password-reset" | "invoice" | "notification";
to: string;
userId: string;
templateData: Record<string, unknown>;
}
export interface ExportJobData {
type: "csv" | "pdf" | "xlsx";
userId: string;
tenantId: string;
filters: Record<string, unknown>;
notifyEmail: string;
}
export interface WebhookJobData {
url: string;
payload: unknown;
secret: string;
tenantId: string;
eventType: string;
attemptNumber?: number;
}
// Create queues
export const emailQueue = new Queue<EmailJobData>("email", defaultQueueOptions);
export const exportQueue = new Queue<ExportJobData>("export", {
...defaultQueueOptions,
defaultJobOptions: {
...defaultQueueOptions.defaultJobOptions,
attempts: 2, // Exports are expensive โ limit retries
timeout: 5 * 60_000, // 5-minute timeout
},
});
export const webhookQueue = new Queue<WebhookJobData>("webhook", {
...defaultQueueOptions,
defaultJobOptions: {
...defaultQueueOptions.defaultJobOptions,
attempts: 5, // Webhooks: retry more aggressively
backoff: {
type: "exponential",
delay: 5000, // Start at 5s for webhooks
},
},
});
// Dead letter queue โ receives jobs that exhausted all retries
export const dlq = new Queue("dead-letter", { connection });
๐ Looking for a Dev Team That Actually Delivers?
Most agencies sell you a project manager and assign juniors. Viprasol is different โ senior engineers only, direct Slack access, and a 5.0โ Upwork record across 100+ projects.
- React, Next.js, Node.js, TypeScript โ production-grade stack
- Fixed-price contracts โ no surprise invoices
- Full source code ownership from day one
- 90-day post-launch support included
Worker Implementation
// src/workers/email.worker.ts
import { Worker, Job, UnrecoverableError } from "bullmq";
import { Redis } from "ioredis";
import { EmailJobData, dlq } from "../queues";
const connection = new Redis(process.env.REDIS_URL!, {
maxRetriesPerRequest: null,
});
export const emailWorker = new Worker<EmailJobData>(
"email",
async (job: Job<EmailJobData>) => {
const { type, to, userId, templateData } = job.data;
// Update progress for long-running jobs
await job.updateProgress(10);
try {
// Route to appropriate email handler
switch (type) {
case "welcome":
await sendWelcomeEmail(to, templateData);
break;
case "invoice":
await sendInvoiceEmail(to, templateData);
break;
case "password-reset":
await sendPasswordResetEmail(to, templateData);
break;
default:
throw new UnrecoverableError(`Unknown email type: ${type}`);
// UnrecoverableError: BullMQ will NOT retry โ goes straight to failed
}
await job.updateProgress(100);
// Return value is stored with completed job
return { sentAt: new Date().toISOString(), provider: "resend" };
} catch (error) {
// Classify errors: recoverable vs unrecoverable
if (error instanceof EmailAddressInvalidError) {
// Don't retry invalid addresses โ burn through retries
throw new UnrecoverableError(
`Invalid email address: ${to}. ${error.message}`
);
}
// All other errors: let BullMQ retry according to backoff config
throw error;
}
},
{
connection,
concurrency: 10, // Process 10 emails simultaneously per worker process
// Rate limiting: max 100 emails per 10 seconds (Resend free tier)
limiter: {
max: 100,
duration: 10_000,
},
}
);
// Handle job lifecycle events
emailWorker.on("completed", (job, result) => {
console.log(`[email] Job ${job.id} completed:`, result);
});
emailWorker.on("failed", async (job, error) => {
if (!job) return;
console.error(`[email] Job ${job.id} failed (attempt ${job.attemptsMade}):`, error.message);
// Move to DLQ after all retries exhausted
if (job.attemptsMade >= (job.opts.attempts ?? 3)) {
await dlq.add(
"failed-email",
{
originalQueue: "email",
jobId: job.id,
jobData: job.data,
error: error.message,
failedAt: new Date().toISOString(),
attemptsMade: job.attemptsMade,
},
{ removeOnComplete: false }
);
}
});
emailWorker.on("error", (error) => {
console.error("[email] Worker error:", error);
});
Dead Letter Queue Processing
// src/workers/dlq.worker.ts
// Process the DLQ: alert, log to database, enable manual retry
import { Worker, Job } from "bullmq";
interface DLQJobData {
originalQueue: string;
jobId: string | undefined;
jobData: unknown;
error: string;
failedAt: string;
attemptsMade: number;
}
const dlqWorker = new Worker<DLQJobData>(
"dead-letter",
async (job: Job<DLQJobData>) => {
const { originalQueue, jobId, jobData, error, failedAt, attemptsMade } =
job.data;
// 1. Persist to database for audit trail and manual review
await db.query(
`INSERT INTO failed_jobs
(original_queue, original_job_id, job_data, error_message, failed_at, attempts_made)
VALUES ($1, $2, $3, $4, $5, $6)`,
[
originalQueue,
jobId,
JSON.stringify(jobData),
error,
failedAt,
attemptsMade,
]
);
// 2. Alert on-call if the failure rate is high
const recentFailures = await db.query<{ count: string }>(
`SELECT COUNT(*)::text FROM failed_jobs
WHERE original_queue = $1
AND failed_at > NOW() - INTERVAL '1 hour'`,
[originalQueue]
);
const failureCount = parseInt(recentFailures.rows[0].count);
if (failureCount >= 10) {
await alertOncall({
severity: "warning",
message: `${failureCount} jobs failed in ${originalQueue} queue in the last hour`,
queue: originalQueue,
runbook: "https://runbooks.internal/job-queue-failures",
});
}
// 3. Notify affected user (where applicable)
if (originalQueue === "export") {
const exportData = jobData as { notifyEmail: string; type: string };
await sendSystemEmail(exportData.notifyEmail, "export-failed", {
exportType: exportData.type,
supportLink: "https://viprasol.com/contact",
});
}
},
{ connection, concurrency: 1 } // Process DLQ items one at a time
);
๐ Senior Engineers. No Junior Handoffs. Ever.
You get the senior developer, not a project manager who relays your requirements to someone you never meet. Every Viprasol project has a senior lead from kickoff to launch.
- MVPs in 4โ8 weeks, full platforms in 3โ5 months
- Lighthouse 90+ performance scores standard
- Works across US, UK, AU timezones
- Free 30-min architecture review, no commitment
Job Producers: Enqueueing Jobs
// src/services/email.service.ts
import { emailQueue } from "../queues";
export class EmailService {
// Simple enqueue
async sendWelcomeEmail(userId: string, email: string): Promise<void> {
await emailQueue.add(
"welcome",
{
type: "welcome",
to: email,
userId,
templateData: { signupDate: new Date().toISOString() },
},
{
// Job-level overrides
jobId: `welcome-${userId}`, // Idempotent: won't add duplicate
delay: 5 * 60_000, // Send 5 minutes after signup
}
);
}
// Bulk enqueue with rate limiting
async sendBulkNotifications(
userIds: string[],
message: string
): Promise<void> {
const jobs = userIds.map((userId, index) => ({
name: "notification",
data: {
type: "notification" as const,
to: `user-${userId}@example.com`,
userId,
templateData: { message },
},
opts: {
delay: index * 100, // Stagger by 100ms to avoid thundering herd
priority: 10, // Lower priority than transactional emails
},
}));
await emailQueue.addBulk(jobs);
}
// Scheduled/recurring jobs
async scheduleMonthlyReport(tenantId: string): Promise<void> {
await emailQueue.add(
"monthly-report",
{
type: "notification",
to: "admin@tenant.com",
userId: "system",
templateData: { tenantId, reportMonth: new Date().toISOString() },
},
{
repeat: {
pattern: "0 8 1 * *", // 8am on the 1st of every month (cron)
tz: "America/New_York",
},
jobId: `monthly-report-${tenantId}`, // Stable ID for the repeated job
}
);
}
}
Queue Monitoring
// src/monitoring/queue-metrics.ts
import { QueueEvents, Queue } from "bullmq";
import { Gauge, Counter, register } from "prom-client";
const queueDepth = new Gauge({
name: "bullmq_queue_depth",
help: "Number of jobs waiting in queue",
labelNames: ["queue", "status"],
registers: [register],
});
const jobsProcessed = new Counter({
name: "bullmq_jobs_processed_total",
help: "Total jobs processed",
labelNames: ["queue", "outcome"],
registers: [register],
});
export async function collectQueueMetrics(queues: Queue[]) {
for (const queue of queues) {
const counts = await queue.getJobCounts(
"waiting",
"active",
"delayed",
"failed",
"completed"
);
queueDepth.set({ queue: queue.name, status: "waiting" }, counts.waiting);
queueDepth.set({ queue: queue.name, status: "active" }, counts.active);
queueDepth.set({ queue: queue.name, status: "delayed" }, counts.delayed);
queueDepth.set({ queue: queue.name, status: "failed" }, counts.failed);
// Alert if DLQ is growing
if (queue.name === "dead-letter" && counts.waiting > 50) {
console.error(`DLQ depth: ${counts.waiting} โ investigate failed jobs`);
}
}
}
// BullMQ Board โ web UI for queue inspection (development + staging)
// npm install @bull-board/fastify @bull-board/api
import { createBullBoard } from "@bull-board/api";
import { BullMQAdapter } from "@bull-board/api/bullMQAdapter";
import { FastifyAdapter } from "@bull-board/fastify";
export function setupBullBoard(
app: FastifyInstance,
queues: Queue[]
): void {
const serverAdapter = new FastifyAdapter();
serverAdapter.setBasePath("/admin/queues");
createBullBoard({
queues: queues.map((q) => new BullMQAdapter(q)),
serverAdapter,
});
app.register(serverAdapter.registerPlugin(), {
prefix: "/admin/queues",
basePath: "/admin/queues",
});
}
Concurrency Tuning Guide
| Worker Type | Concurrency | Reasoning |
|---|---|---|
| Email sending | 10โ20 | I/O-bound, limited by provider rate limits |
| PDF/report generation | 2โ4 | CPU-bound, memory-intensive |
| Webhook delivery | 5โ10 | I/O-bound, external latency |
| Database-heavy jobs | 3โ5 | Limited by connection pool |
| External API calls | 10โ20 | I/O-bound, watch API rate limits |
| Image processing | 2โ3 | CPU + memory intensive |
Rule: concurrency ร workers ร avg_job_duration_ms / 1000 = sustained throughput in jobs/sec
See Also
- WebSocket Scaling โ real-time delivery alternative
- Database Connection Pooling โ pool sizing for workers
- Incident On-Call Culture โ alerting on job failures
- Serverless Architecture โ when to use Lambda instead
Working With Viprasol
Background job reliability is non-negotiable for SaaS products where users expect async operations (exports, report generation, email sequences) to complete correctly. We design job queue architectures with proper retry policies, dead letter queues, concurrency limits, and monitoring so failures are caught and resolved before users notice.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need a Modern Web Application?
From landing pages to complex SaaS platforms โ we build it all with Next.js and React.
Free consultation โข No commitment โข Response within 24 hours
Need a custom web application built?
We build React and Next.js web applications with Lighthouse โฅ90 scores, mobile-first design, and full source code ownership. Senior engineers only โ from architecture through deployment.