AWS SQS Dead Letter Queue in 2026: Poison Pills, Redrive Policy, and Failure Alerting

Every SQS queue needs a Dead Letter Queue. Without one, poison pill messages—messages your consumer fails to process—loop infinitely, blocking the queue and burning your Lambda invocations or worker CPU. The DLQ is your safety net: after N failed attempts, the message is parked somewhere safe where you can inspect it, fix the bug, and redrive the message back to the source queue.

This post covers the complete DLQ setup: Terraform configuration, consumer error handling patterns that distinguish retriable from permanent failures, CloudWatch alarms when messages land in the DLQ, and the redrive API to replay messages after fixing the underlying bug.

Queue Architecture

Source Queue → Consumer Lambda/Worker
    ↓ (after maxReceiveCount failures)
Dead Letter Queue → Alert → Manual inspection → Redrive → Source Queue

Every queue should have a corresponding DLQ. The DLQ itself should have a much longer retention period—you want messages to stay there long enough for you to diagnose and fix the issue.

Terraform Configuration

# terraform/sqs.tf

# Dead Letter Queue — long retention for inspection
resource "aws_sqs_queue" "order_processing_dlq" {
  name                       = "${var.name}-${var.environment}-order-processing-dlq"
  message_retention_seconds  = 1209600  # 14 days (max)
  
  # Optional: FIFO DLQ if source is FIFO
  # fifo_queue = true

  tags = merge(var.common_tags, {
    Purpose = "dead-letter-queue"
    Source  = "${var.name}-${var.environment}-order-processing"
  })
}

# Source Queue with DLQ redrive policy
resource "aws_sqs_queue" "order_processing" {
  name                       = "${var.name}-${var.environment}-order-processing"
  message_retention_seconds  = 86400    # 1 day
  visibility_timeout_seconds = 300      # Must be >= Lambda timeout
  
  # After 3 failures, message moves to DLQ
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.order_processing_dlq.arn
    maxReceiveCount     = 3
  })

  tags = var.common_tags
}

# Allow the DLQ to receive messages from the source queue
resource "aws_sqs_queue_redrive_allow_policy" "order_processing_dlq" {
  queue_url = aws_sqs_queue.order_processing_dlq.url

  redrive_allow_policy = jsonencode({
    redrivePermission = "byQueue"
    sourceQueueArns   = [aws_sqs_queue.order_processing.arn]
  })
}

# CloudWatch alarm: alert when ANY message lands in DLQ
resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
  alarm_name          = "${var.name}-${var.environment}-order-dlq-not-empty"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 300       # 5 minutes
  statistic           = "Sum"
  threshold           = 0         # Alert on first message
  alarm_description   = "Messages in order processing DLQ — requires investigation"
  treat_missing_data  = "notBreaching"

  dimensions = {
    QueueName = aws_sqs_queue.order_processing_dlq.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Lambda event source mapping
resource "aws_lambda_event_source_mapping" "order_processor" {
  event_source_arn                   = aws_sqs_queue.order_processing.arn
  function_name                      = aws_lambda_function.order_processor.arn
  batch_size                         = 10
  maximum_batching_window_in_seconds = 5  # Wait up to 5s to fill batch
  
  # Partial batch failure — report individual message failures
  function_response_types = ["ReportBatchItemFailures"]
}

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

AWS, GCP, Azure certified engineers
Infrastructure as Code (Terraform, CDK)
Docker, Kubernetes, GitHub Actions CI/CD
Typical audit recovers $500–$3,000/month in savings

Get a Free Cloud Audit WhatsApp

Consumer: Handling Partial Batch Failures

With ReportBatchItemFailures, your Lambda can fail specific messages in a batch without failing the entire batch:

// functions/order-processor/handler.ts
import { SQSHandler, SQSBatchResponse } from "aws-lambda";

export const handler: SQSHandler = async (event): Promise<SQSBatchResponse> => {
  const batchItemFailures: SQSBatchResponse["batchItemFailures"] = [];

  await Promise.allSettled(
    event.Records.map(async (record) => {
      try {
        const body = JSON.parse(record.body);
        await processOrder(body);
      } catch (error) {
        console.error(`Failed to process message ${record.messageId}:`, error);

        // Only report as failure if it's retriable
        // Permanent failures (bad data) should be handled differently
        if (isRetriableError(error)) {
          batchItemFailures.push({ itemIdentifier: record.messageId });
        } else {
          // Permanent failure — log, alert, but don't retry
          await recordPermanentFailure(record, error);
          // Don't add to batchItemFailures — message will be deleted from queue
        }
      }
    })
  );

  return { batchItemFailures };
};

function isRetriableError(error: unknown): boolean {
  if (error instanceof Error) {
    // Network errors, timeouts, rate limits → retry
    if (error.message.includes("ECONNRESET")) return true;
    if (error.message.includes("timeout")) return true;
    if (error.message.includes("rate limit")) return true;
    if (error.message.includes("503")) return true;

    // Business logic errors → don't retry (would always fail)
    if (error.message.includes("ValidationError")) return false;
    if (error.message.includes("NOT_FOUND")) return false;
    if (error.message.includes("DUPLICATE")) return false;
  }
  return true; // Default: retry unknown errors
}

async function processOrder(body: { orderId: string; [key: string]: unknown }) {
  // Your business logic
  const order = await db.order.findUnique({ where: { id: body.orderId } });
  
  if (!order) {
    throw new Error(`NOT_FOUND: order ${body.orderId}`);
  }

  // ... process the order
}

async function recordPermanentFailure(record: any, error: unknown) {
  // Persist to DB for human review
  await db.queueFailure.create({
    data: {
      queueName: "order-processing",
      messageId: record.messageId,
      body: record.body,
      errorMessage: error instanceof Error ? error.message : String(error),
      receiveCount: parseInt(record.attributes.ApproximateReceiveCount, 10),
      failedAt: new Date(),
    },
  });
}

DLQ Inspector Worker

A separate worker that reads from the DLQ, classifies failures, and takes action:

// lib/queues/dlq-inspector.ts
import {
  SQSClient,
  ReceiveMessageCommand,
  DeleteMessageCommand,
  SendMessageCommand,
} from "@aws-sdk/client-sqs";

const sqs = new SQSClient({ region: process.env.AWS_REGION });

interface DLQMessage {
  messageId: string;
  receiptHandle: string;
  body: unknown;
  attributes: {
    ApproximateReceiveCount: string;
    SentTimestamp: string;
  };
}

export async function inspectDLQ(dlqUrl: string, sourceQueueUrl: string) {
  const response = await sqs.send(
    new ReceiveMessageCommand({
      QueueUrl: dlqUrl,
      MaxNumberOfMessages: 10,
      AttributeNames: ["All"],
      WaitTimeSeconds: 5,
    })
  );

  const messages = response.Messages ?? [];
  if (messages.length === 0) return { processed: 0 };

  let redriven = 0;
  let discarded = 0;

  for (const message of messages) {
    const msg: DLQMessage = {
      messageId: message.MessageId!,
      receiptHandle: message.ReceiptHandle!,
      body: JSON.parse(message.Body ?? "{}"),
      attributes: message.Attributes as any,
    };

    const decision = await classifyDLQMessage(msg);

    if (decision === "redrive") {
      // Send back to source queue for reprocessing
      await sqs.send(
        new SendMessageCommand({
          QueueUrl: sourceQueueUrl,
          MessageBody: message.Body!,
          MessageAttributes: {
            RedriveCount: {
              DataType: "Number",
              StringValue: "1",
            },
          },
        })
      );
      redriven++;
    } else {
      discarded++;
    }

    // Delete from DLQ regardless of decision
    await sqs.send(
      new DeleteMessageCommand({
        QueueUrl: dlqUrl,
        ReceiptHandle: message.ReceiptHandle!,
      })
    );
  }

  return { processed: messages.length, redriven, discarded };
}

async function classifyDLQMessage(msg: DLQMessage): Promise<"redrive" | "discard"> {
  // Check if underlying data issue is now fixed
  const body = msg.body as any;

  if (body.orderId) {
    const order = await db.order.findUnique({ where: { id: body.orderId } });
    if (!order) return "discard"; // Order deleted — no point redriving
  }

  const receiveCount = parseInt(msg.attributes.ApproximateReceiveCount, 10);
  if (receiveCount > 10) return "discard"; // Too many attempts — give up

  return "redrive";
}

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

Staging + production environments with feature flags
Automated security scanning in the pipeline
Uptime monitoring + alerting + runbook automation
On-call support handover docs included

Modernize My DevOps WhatsApp

Manual Redrive via AWS Console / API

AWS provides a native redrive capability (no need to write custom code):

// lib/queues/redrive.ts
import {
  SQSClient,
  StartMessageMoveTaskCommand,
  ListMessageMoveTasksCommand,
} from "@aws-sdk/client-sqs";

const sqs = new SQSClient({ region: process.env.AWS_REGION });

/**
 * Redrive all messages from DLQ back to source queue.
 * Uses AWS native StartMessageMoveTask (no Lambda needed).
 */
export async function startRedrive(
  dlqArn: string,
  sourceQueueArn: string,
  maxMessagesPerSecond: number = 50  // Throttle to avoid overwhelming consumer
) {
  const { TaskHandle } = await sqs.send(
    new StartMessageMoveTaskCommand({
      SourceArn: dlqArn,
      DestinationArn: sourceQueueArn,
      MaxNumberOfMessagesPerSecond: maxMessagesPerSecond,
    })
  );

  return TaskHandle;
}

export async function checkRedriveStatus(sourceArn: string) {
  const { Results } = await sqs.send(
    new ListMessageMoveTasksCommand({ SourceArn: sourceArn })
  );

  return Results?.map((r) => ({
    taskHandle: r.TaskHandle,
    status: r.Status,
    movedMessages: r.ApproximateNumberOfMessagesMoved,
    remainingMessages: r.ApproximateNumberOfMessagesToMove,
    failureReason: r.FailureReason,
  }));
}

DLQ Monitoring Dashboard

// app/api/admin/dlq-status/route.ts
import { SQSClient, GetQueueAttributesCommand } from "@aws-sdk/client-sqs";

const sqs = new SQSClient({ region: process.env.AWS_REGION });

const QUEUES = [
  { name: "Order Processing", dlqUrl: process.env.ORDER_DLQ_URL! },
  { name: "Email Delivery",   dlqUrl: process.env.EMAIL_DLQ_URL! },
  { name: "Webhook Events",   dlqUrl: process.env.WEBHOOK_DLQ_URL! },
];

export async function GET() {
  const statuses = await Promise.all(
    QUEUES.map(async (queue) => {
      const attrs = await sqs.send(
        new GetQueueAttributesCommand({
          QueueUrl: queue.dlqUrl,
          AttributeNames: [
            "ApproximateNumberOfMessages",
            "ApproximateNumberOfMessagesNotVisible",
          ],
        })
      );

      const visible = parseInt(attrs.Attributes?.ApproximateNumberOfMessages ?? "0", 10);
      const inFlight = parseInt(attrs.Attributes?.ApproximateNumberOfMessagesNotVisible ?? "0", 10);

      return {
        name: queue.name,
        dlqUrl: queue.dlqUrl,
        messagesVisible: visible,
        messagesInFlight: inFlight,
        total: visible + inFlight,
        status: visible + inFlight > 0 ? "needs_attention" : "healthy",
      };
    })
  );

  const hasIssues = statuses.some((s) => s.status === "needs_attention");

  return Response.json({
    queues: statuses,
    overallStatus: hasIssues ? "degraded" : "healthy",
    checkedAt: new Date().toISOString(),
  });
}

Choosing `maxReceiveCount`

Use Case	Recommended maxReceiveCount	Reasoning
Idempotent, fast processing	3	Quick to DLQ on bug
External API calls (flaky)	5–7	Allow for transient failures
Long processing (5+ min)	2–3	Visibility timeout complexity
Critical financial operations	5	Balance retry vs. duplicate risk
Webhook delivery	7	Target server may be down temporarily

Cost and Timeline

Component	Timeline	Cost (USD)
DLQ + redrive Terraform	0.5 day	$300–$500
Partial batch failure handling	0.5–1 day	$400–$800
DLQ inspector worker	1 day	$600–$1,000
CloudWatch alarms + alerting	0.5 day	$300–$500
DLQ dashboard UI	1 day	$600–$1,000
Full DLQ system	3–5 days	$3,000–$5,000

SQS cost for DLQ: negligible — you're only charged for messages that actually land in DLQ (i.e., failures), which should be rare.

Working With Viprasol

We design and implement SQS queue architectures for production workloads—from basic queues through complex multi-queue pipelines with DLQ monitoring, alerting, and automated redrive. Our cloud team has built queue systems handling millions of messages per day.

What we deliver:

SQS + DLQ Terraform modules with redrive policy
Partial batch failure handling in Lambda consumers
Failure classification (retriable vs permanent)
CloudWatch alarms with SNS/PagerDuty integration
DLQ inspector and redrive tooling

See our cloud infrastructure services or contact us to design your SQS architecture.

AWS SQS Dead Letter Queue in 2026: Poison Pills, Redrive Policy, and Failure Alerting

AWS SQS Dead Letter Queue in 2026: Poison Pills, Redrive Policy, and Failure Alerting

Queue Architecture

Terraform Configuration

☁️ Is Your Cloud Costing Too Much?

Consumer: Handling Partial Batch Failures

DLQ Inspector Worker

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Manual Redrive via AWS Console / API

DLQ Monitoring Dashboard

Choosing `maxReceiveCount`

Cost and Timeline

See Also

Working With Viprasol

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

AWS SQS Message Processing: Consumer Workers, Visibility Timeout, DLQ, and Idempotency

AWS SQS FIFO Queues: Ordering, Deduplication, Message Groups, and Dead Letter Configuration

AWS SQS and SNS Patterns in 2026: Fan-Out, FIFO Queues, and Message Filtering

AWS SQS Dead Letter Queue in 2026: Poison Pills, Redrive Policy, and Failure Alerting

Queue Architecture

Terraform Configuration

☁️ Is Your Cloud Costing Too Much?

Consumer: Handling Partial Batch Failures

DLQ Inspector Worker

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Manual Redrive via AWS Console / API

DLQ Monitoring Dashboard

Choosing maxReceiveCount

Cost and Timeline

See Also

Working With Viprasol

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

AWS SQS Message Processing: Consumer Workers, Visibility Timeout, DLQ, and Idempotency

AWS SQS FIFO Queues: Ordering, Deduplication, Message Groups, and Dead Letter Configuration

AWS SQS and SNS Patterns in 2026: Fan-Out, FIFO Queues, and Message Filtering

Choosing `maxReceiveCount`