AWS SQS Dead Letter Queue in 2026: Poison Pills, Redrive Policy, and Failure Alerting
Master AWS SQS Dead Letter Queues in 2026: poison pill detection, maxReceiveCount redrive policy, DLQ monitoring with CloudWatch, manual redrive, and Terraform configuration.
AWS SQS Dead Letter Queue in 2026: Poison Pills, Redrive Policy, and Failure Alerting
Every SQS queue needs a Dead Letter Queue. Without one, poison pill messages—messages your consumer fails to process—loop infinitely, blocking the queue and burning your Lambda invocations or worker CPU. The DLQ is your safety net: after N failed attempts, the message is parked somewhere safe where you can inspect it, fix the bug, and redrive the message back to the source queue.
This post covers the complete DLQ setup: Terraform configuration, consumer error handling patterns that distinguish retriable from permanent failures, CloudWatch alarms when messages land in the DLQ, and the redrive API to replay messages after fixing the underlying bug.
Queue Architecture
Source Queue → Consumer Lambda/Worker
↓ (after maxReceiveCount failures)
Dead Letter Queue → Alert → Manual inspection → Redrive → Source Queue
Every queue should have a corresponding DLQ. The DLQ itself should have a much longer retention period—you want messages to stay there long enough for you to diagnose and fix the issue.
Terraform Configuration
# terraform/sqs.tf
# Dead Letter Queue — long retention for inspection
resource "aws_sqs_queue" "order_processing_dlq" {
name = "${var.name}-${var.environment}-order-processing-dlq"
message_retention_seconds = 1209600 # 14 days (max)
# Optional: FIFO DLQ if source is FIFO
# fifo_queue = true
tags = merge(var.common_tags, {
Purpose = "dead-letter-queue"
Source = "${var.name}-${var.environment}-order-processing"
})
}
# Source Queue with DLQ redrive policy
resource "aws_sqs_queue" "order_processing" {
name = "${var.name}-${var.environment}-order-processing"
message_retention_seconds = 86400 # 1 day
visibility_timeout_seconds = 300 # Must be >= Lambda timeout
# After 3 failures, message moves to DLQ
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.order_processing_dlq.arn
maxReceiveCount = 3
})
tags = var.common_tags
}
# Allow the DLQ to receive messages from the source queue
resource "aws_sqs_queue_redrive_allow_policy" "order_processing_dlq" {
queue_url = aws_sqs_queue.order_processing_dlq.url
redrive_allow_policy = jsonencode({
redrivePermission = "byQueue"
sourceQueueArns = [aws_sqs_queue.order_processing.arn]
})
}
# CloudWatch alarm: alert when ANY message lands in DLQ
resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
alarm_name = "${var.name}-${var.environment}-order-dlq-not-empty"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = 300 # 5 minutes
statistic = "Sum"
threshold = 0 # Alert on first message
alarm_description = "Messages in order processing DLQ — requires investigation"
treat_missing_data = "notBreaching"
dimensions = {
QueueName = aws_sqs_queue.order_processing_dlq.name
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}
# Lambda event source mapping
resource "aws_lambda_event_source_mapping" "order_processor" {
event_source_arn = aws_sqs_queue.order_processing.arn
function_name = aws_lambda_function.order_processor.arn
batch_size = 10
maximum_batching_window_in_seconds = 5 # Wait up to 5s to fill batch
# Partial batch failure — report individual message failures
function_response_types = ["ReportBatchItemFailures"]
}
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Consumer: Handling Partial Batch Failures
With ReportBatchItemFailures, your Lambda can fail specific messages in a batch without failing the entire batch:
// functions/order-processor/handler.ts
import { SQSHandler, SQSBatchResponse } from "aws-lambda";
export const handler: SQSHandler = async (event): Promise<SQSBatchResponse> => {
const batchItemFailures: SQSBatchResponse["batchItemFailures"] = [];
await Promise.allSettled(
event.Records.map(async (record) => {
try {
const body = JSON.parse(record.body);
await processOrder(body);
} catch (error) {
console.error(`Failed to process message ${record.messageId}:`, error);
// Only report as failure if it's retriable
// Permanent failures (bad data) should be handled differently
if (isRetriableError(error)) {
batchItemFailures.push({ itemIdentifier: record.messageId });
} else {
// Permanent failure — log, alert, but don't retry
await recordPermanentFailure(record, error);
// Don't add to batchItemFailures — message will be deleted from queue
}
}
})
);
return { batchItemFailures };
};
function isRetriableError(error: unknown): boolean {
if (error instanceof Error) {
// Network errors, timeouts, rate limits → retry
if (error.message.includes("ECONNRESET")) return true;
if (error.message.includes("timeout")) return true;
if (error.message.includes("rate limit")) return true;
if (error.message.includes("503")) return true;
// Business logic errors → don't retry (would always fail)
if (error.message.includes("ValidationError")) return false;
if (error.message.includes("NOT_FOUND")) return false;
if (error.message.includes("DUPLICATE")) return false;
}
return true; // Default: retry unknown errors
}
async function processOrder(body: { orderId: string; [key: string]: unknown }) {
// Your business logic
const order = await db.order.findUnique({ where: { id: body.orderId } });
if (!order) {
throw new Error(`NOT_FOUND: order ${body.orderId}`);
}
// ... process the order
}
async function recordPermanentFailure(record: any, error: unknown) {
// Persist to DB for human review
await db.queueFailure.create({
data: {
queueName: "order-processing",
messageId: record.messageId,
body: record.body,
errorMessage: error instanceof Error ? error.message : String(error),
receiveCount: parseInt(record.attributes.ApproximateReceiveCount, 10),
failedAt: new Date(),
},
});
}
DLQ Inspector Worker
A separate worker that reads from the DLQ, classifies failures, and takes action:
// lib/queues/dlq-inspector.ts
import {
SQSClient,
ReceiveMessageCommand,
DeleteMessageCommand,
SendMessageCommand,
} from "@aws-sdk/client-sqs";
const sqs = new SQSClient({ region: process.env.AWS_REGION });
interface DLQMessage {
messageId: string;
receiptHandle: string;
body: unknown;
attributes: {
ApproximateReceiveCount: string;
SentTimestamp: string;
};
}
export async function inspectDLQ(dlqUrl: string, sourceQueueUrl: string) {
const response = await sqs.send(
new ReceiveMessageCommand({
QueueUrl: dlqUrl,
MaxNumberOfMessages: 10,
AttributeNames: ["All"],
WaitTimeSeconds: 5,
})
);
const messages = response.Messages ?? [];
if (messages.length === 0) return { processed: 0 };
let redriven = 0;
let discarded = 0;
for (const message of messages) {
const msg: DLQMessage = {
messageId: message.MessageId!,
receiptHandle: message.ReceiptHandle!,
body: JSON.parse(message.Body ?? "{}"),
attributes: message.Attributes as any,
};
const decision = await classifyDLQMessage(msg);
if (decision === "redrive") {
// Send back to source queue for reprocessing
await sqs.send(
new SendMessageCommand({
QueueUrl: sourceQueueUrl,
MessageBody: message.Body!,
MessageAttributes: {
RedriveCount: {
DataType: "Number",
StringValue: "1",
},
},
})
);
redriven++;
} else {
discarded++;
}
// Delete from DLQ regardless of decision
await sqs.send(
new DeleteMessageCommand({
QueueUrl: dlqUrl,
ReceiptHandle: message.ReceiptHandle!,
})
);
}
return { processed: messages.length, redriven, discarded };
}
async function classifyDLQMessage(msg: DLQMessage): Promise<"redrive" | "discard"> {
// Check if underlying data issue is now fixed
const body = msg.body as any;
if (body.orderId) {
const order = await db.order.findUnique({ where: { id: body.orderId } });
if (!order) return "discard"; // Order deleted — no point redriving
}
const receiveCount = parseInt(msg.attributes.ApproximateReceiveCount, 10);
if (receiveCount > 10) return "discard"; // Too many attempts — give up
return "redrive";
}
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Manual Redrive via AWS Console / API
AWS provides a native redrive capability (no need to write custom code):
// lib/queues/redrive.ts
import {
SQSClient,
StartMessageMoveTaskCommand,
ListMessageMoveTasksCommand,
} from "@aws-sdk/client-sqs";
const sqs = new SQSClient({ region: process.env.AWS_REGION });
/**
* Redrive all messages from DLQ back to source queue.
* Uses AWS native StartMessageMoveTask (no Lambda needed).
*/
export async function startRedrive(
dlqArn: string,
sourceQueueArn: string,
maxMessagesPerSecond: number = 50 // Throttle to avoid overwhelming consumer
) {
const { TaskHandle } = await sqs.send(
new StartMessageMoveTaskCommand({
SourceArn: dlqArn,
DestinationArn: sourceQueueArn,
MaxNumberOfMessagesPerSecond: maxMessagesPerSecond,
})
);
return TaskHandle;
}
export async function checkRedriveStatus(sourceArn: string) {
const { Results } = await sqs.send(
new ListMessageMoveTasksCommand({ SourceArn: sourceArn })
);
return Results?.map((r) => ({
taskHandle: r.TaskHandle,
status: r.Status,
movedMessages: r.ApproximateNumberOfMessagesMoved,
remainingMessages: r.ApproximateNumberOfMessagesToMove,
failureReason: r.FailureReason,
}));
}
DLQ Monitoring Dashboard
// app/api/admin/dlq-status/route.ts
import { SQSClient, GetQueueAttributesCommand } from "@aws-sdk/client-sqs";
const sqs = new SQSClient({ region: process.env.AWS_REGION });
const QUEUES = [
{ name: "Order Processing", dlqUrl: process.env.ORDER_DLQ_URL! },
{ name: "Email Delivery", dlqUrl: process.env.EMAIL_DLQ_URL! },
{ name: "Webhook Events", dlqUrl: process.env.WEBHOOK_DLQ_URL! },
];
export async function GET() {
const statuses = await Promise.all(
QUEUES.map(async (queue) => {
const attrs = await sqs.send(
new GetQueueAttributesCommand({
QueueUrl: queue.dlqUrl,
AttributeNames: [
"ApproximateNumberOfMessages",
"ApproximateNumberOfMessagesNotVisible",
],
})
);
const visible = parseInt(attrs.Attributes?.ApproximateNumberOfMessages ?? "0", 10);
const inFlight = parseInt(attrs.Attributes?.ApproximateNumberOfMessagesNotVisible ?? "0", 10);
return {
name: queue.name,
dlqUrl: queue.dlqUrl,
messagesVisible: visible,
messagesInFlight: inFlight,
total: visible + inFlight,
status: visible + inFlight > 0 ? "needs_attention" : "healthy",
};
})
);
const hasIssues = statuses.some((s) => s.status === "needs_attention");
return Response.json({
queues: statuses,
overallStatus: hasIssues ? "degraded" : "healthy",
checkedAt: new Date().toISOString(),
});
}
Choosing maxReceiveCount
| Use Case | Recommended maxReceiveCount | Reasoning |
|---|---|---|
| Idempotent, fast processing | 3 | Quick to DLQ on bug |
| External API calls (flaky) | 5–7 | Allow for transient failures |
| Long processing (5+ min) | 2–3 | Visibility timeout complexity |
| Critical financial operations | 5 | Balance retry vs. duplicate risk |
| Webhook delivery | 7 | Target server may be down temporarily |
Cost and Timeline
| Component | Timeline | Cost (USD) |
|---|---|---|
| DLQ + redrive Terraform | 0.5 day | $300–$500 |
| Partial batch failure handling | 0.5–1 day | $400–$800 |
| DLQ inspector worker | 1 day | $600–$1,000 |
| CloudWatch alarms + alerting | 0.5 day | $300–$500 |
| DLQ dashboard UI | 1 day | $600–$1,000 |
| Full DLQ system | 3–5 days | $3,000–$5,000 |
SQS cost for DLQ: negligible — you're only charged for messages that actually land in DLQ (i.e., failures), which should be rare.
See Also
- AWS SQS Worker Pattern — The primary queue worker pattern
- AWS Step Functions — Orchestration with built-in retry/catch
- AWS CloudWatch Observability — DLQ monitoring dashboards
- AWS Lambda Cold Start Optimization — Optimizing the Lambda consumers
Working With Viprasol
We design and implement SQS queue architectures for production workloads—from basic queues through complex multi-queue pipelines with DLQ monitoring, alerting, and automated redrive. Our cloud team has built queue systems handling millions of messages per day.
What we deliver:
- SQS + DLQ Terraform modules with redrive policy
- Partial batch failure handling in Lambda consumers
- Failure classification (retriable vs permanent)
- CloudWatch alarms with SNS/PagerDuty integration
- DLQ inspector and redrive tooling
See our cloud infrastructure services or contact us to design your SQS architecture.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.