Back to Blog

AWS SQS Dead Letter Queue in 2026: Poison Pills, Redrive Policy, and Failure Alerting

Master AWS SQS Dead Letter Queues in 2026: poison pill detection, maxReceiveCount redrive policy, DLQ monitoring with CloudWatch, manual redrive, and Terraform configuration.

Viprasol Tech Team
January 24, 2027
13 min read

AWS SQS Dead Letter Queue in 2026: Poison Pills, Redrive Policy, and Failure Alerting

Every SQS queue needs a Dead Letter Queue. Without one, poison pill messages—messages your consumer fails to process—loop infinitely, blocking the queue and burning your Lambda invocations or worker CPU. The DLQ is your safety net: after N failed attempts, the message is parked somewhere safe where you can inspect it, fix the bug, and redrive the message back to the source queue.

This post covers the complete DLQ setup: Terraform configuration, consumer error handling patterns that distinguish retriable from permanent failures, CloudWatch alarms when messages land in the DLQ, and the redrive API to replay messages after fixing the underlying bug.


Queue Architecture

Source Queue → Consumer Lambda/Worker
    ↓ (after maxReceiveCount failures)
Dead Letter Queue → Alert → Manual inspection → Redrive → Source Queue

Every queue should have a corresponding DLQ. The DLQ itself should have a much longer retention period—you want messages to stay there long enough for you to diagnose and fix the issue.


Terraform Configuration

# terraform/sqs.tf

# Dead Letter Queue — long retention for inspection
resource "aws_sqs_queue" "order_processing_dlq" {
  name                       = "${var.name}-${var.environment}-order-processing-dlq"
  message_retention_seconds  = 1209600  # 14 days (max)
  
  # Optional: FIFO DLQ if source is FIFO
  # fifo_queue = true

  tags = merge(var.common_tags, {
    Purpose = "dead-letter-queue"
    Source  = "${var.name}-${var.environment}-order-processing"
  })
}

# Source Queue with DLQ redrive policy
resource "aws_sqs_queue" "order_processing" {
  name                       = "${var.name}-${var.environment}-order-processing"
  message_retention_seconds  = 86400    # 1 day
  visibility_timeout_seconds = 300      # Must be >= Lambda timeout
  
  # After 3 failures, message moves to DLQ
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.order_processing_dlq.arn
    maxReceiveCount     = 3
  })

  tags = var.common_tags
}

# Allow the DLQ to receive messages from the source queue
resource "aws_sqs_queue_redrive_allow_policy" "order_processing_dlq" {
  queue_url = aws_sqs_queue.order_processing_dlq.url

  redrive_allow_policy = jsonencode({
    redrivePermission = "byQueue"
    sourceQueueArns   = [aws_sqs_queue.order_processing.arn]
  })
}

# CloudWatch alarm: alert when ANY message lands in DLQ
resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
  alarm_name          = "${var.name}-${var.environment}-order-dlq-not-empty"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 300       # 5 minutes
  statistic           = "Sum"
  threshold           = 0         # Alert on first message
  alarm_description   = "Messages in order processing DLQ — requires investigation"
  treat_missing_data  = "notBreaching"

  dimensions = {
    QueueName = aws_sqs_queue.order_processing_dlq.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Lambda event source mapping
resource "aws_lambda_event_source_mapping" "order_processor" {
  event_source_arn                   = aws_sqs_queue.order_processing.arn
  function_name                      = aws_lambda_function.order_processor.arn
  batch_size                         = 10
  maximum_batching_window_in_seconds = 5  # Wait up to 5s to fill batch
  
  # Partial batch failure — report individual message failures
  function_response_types = ["ReportBatchItemFailures"]
}

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

  • AWS, GCP, Azure certified engineers
  • Infrastructure as Code (Terraform, CDK)
  • Docker, Kubernetes, GitHub Actions CI/CD
  • Typical audit recovers $500–$3,000/month in savings

Consumer: Handling Partial Batch Failures

With ReportBatchItemFailures, your Lambda can fail specific messages in a batch without failing the entire batch:

// functions/order-processor/handler.ts
import { SQSHandler, SQSBatchResponse } from "aws-lambda";

export const handler: SQSHandler = async (event): Promise<SQSBatchResponse> => {
  const batchItemFailures: SQSBatchResponse["batchItemFailures"] = [];

  await Promise.allSettled(
    event.Records.map(async (record) => {
      try {
        const body = JSON.parse(record.body);
        await processOrder(body);
      } catch (error) {
        console.error(`Failed to process message ${record.messageId}:`, error);

        // Only report as failure if it's retriable
        // Permanent failures (bad data) should be handled differently
        if (isRetriableError(error)) {
          batchItemFailures.push({ itemIdentifier: record.messageId });
        } else {
          // Permanent failure — log, alert, but don't retry
          await recordPermanentFailure(record, error);
          // Don't add to batchItemFailures — message will be deleted from queue
        }
      }
    })
  );

  return { batchItemFailures };
};

function isRetriableError(error: unknown): boolean {
  if (error instanceof Error) {
    // Network errors, timeouts, rate limits → retry
    if (error.message.includes("ECONNRESET")) return true;
    if (error.message.includes("timeout")) return true;
    if (error.message.includes("rate limit")) return true;
    if (error.message.includes("503")) return true;

    // Business logic errors → don't retry (would always fail)
    if (error.message.includes("ValidationError")) return false;
    if (error.message.includes("NOT_FOUND")) return false;
    if (error.message.includes("DUPLICATE")) return false;
  }
  return true; // Default: retry unknown errors
}

async function processOrder(body: { orderId: string; [key: string]: unknown }) {
  // Your business logic
  const order = await db.order.findUnique({ where: { id: body.orderId } });
  
  if (!order) {
    throw new Error(`NOT_FOUND: order ${body.orderId}`);
  }

  // ... process the order
}

async function recordPermanentFailure(record: any, error: unknown) {
  // Persist to DB for human review
  await db.queueFailure.create({
    data: {
      queueName: "order-processing",
      messageId: record.messageId,
      body: record.body,
      errorMessage: error instanceof Error ? error.message : String(error),
      receiveCount: parseInt(record.attributes.ApproximateReceiveCount, 10),
      failedAt: new Date(),
    },
  });
}

DLQ Inspector Worker

A separate worker that reads from the DLQ, classifies failures, and takes action:

// lib/queues/dlq-inspector.ts
import {
  SQSClient,
  ReceiveMessageCommand,
  DeleteMessageCommand,
  SendMessageCommand,
} from "@aws-sdk/client-sqs";

const sqs = new SQSClient({ region: process.env.AWS_REGION });

interface DLQMessage {
  messageId: string;
  receiptHandle: string;
  body: unknown;
  attributes: {
    ApproximateReceiveCount: string;
    SentTimestamp: string;
  };
}

export async function inspectDLQ(dlqUrl: string, sourceQueueUrl: string) {
  const response = await sqs.send(
    new ReceiveMessageCommand({
      QueueUrl: dlqUrl,
      MaxNumberOfMessages: 10,
      AttributeNames: ["All"],
      WaitTimeSeconds: 5,
    })
  );

  const messages = response.Messages ?? [];
  if (messages.length === 0) return { processed: 0 };

  let redriven = 0;
  let discarded = 0;

  for (const message of messages) {
    const msg: DLQMessage = {
      messageId: message.MessageId!,
      receiptHandle: message.ReceiptHandle!,
      body: JSON.parse(message.Body ?? "{}"),
      attributes: message.Attributes as any,
    };

    const decision = await classifyDLQMessage(msg);

    if (decision === "redrive") {
      // Send back to source queue for reprocessing
      await sqs.send(
        new SendMessageCommand({
          QueueUrl: sourceQueueUrl,
          MessageBody: message.Body!,
          MessageAttributes: {
            RedriveCount: {
              DataType: "Number",
              StringValue: "1",
            },
          },
        })
      );
      redriven++;
    } else {
      discarded++;
    }

    // Delete from DLQ regardless of decision
    await sqs.send(
      new DeleteMessageCommand({
        QueueUrl: dlqUrl,
        ReceiptHandle: message.ReceiptHandle!,
      })
    );
  }

  return { processed: messages.length, redriven, discarded };
}

async function classifyDLQMessage(msg: DLQMessage): Promise<"redrive" | "discard"> {
  // Check if underlying data issue is now fixed
  const body = msg.body as any;

  if (body.orderId) {
    const order = await db.order.findUnique({ where: { id: body.orderId } });
    if (!order) return "discard"; // Order deleted — no point redriving
  }

  const receiveCount = parseInt(msg.attributes.ApproximateReceiveCount, 10);
  if (receiveCount > 10) return "discard"; // Too many attempts — give up

  return "redrive";
}

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

  • Staging + production environments with feature flags
  • Automated security scanning in the pipeline
  • Uptime monitoring + alerting + runbook automation
  • On-call support handover docs included

Manual Redrive via AWS Console / API

AWS provides a native redrive capability (no need to write custom code):

// lib/queues/redrive.ts
import {
  SQSClient,
  StartMessageMoveTaskCommand,
  ListMessageMoveTasksCommand,
} from "@aws-sdk/client-sqs";

const sqs = new SQSClient({ region: process.env.AWS_REGION });

/**
 * Redrive all messages from DLQ back to source queue.
 * Uses AWS native StartMessageMoveTask (no Lambda needed).
 */
export async function startRedrive(
  dlqArn: string,
  sourceQueueArn: string,
  maxMessagesPerSecond: number = 50  // Throttle to avoid overwhelming consumer
) {
  const { TaskHandle } = await sqs.send(
    new StartMessageMoveTaskCommand({
      SourceArn: dlqArn,
      DestinationArn: sourceQueueArn,
      MaxNumberOfMessagesPerSecond: maxMessagesPerSecond,
    })
  );

  return TaskHandle;
}

export async function checkRedriveStatus(sourceArn: string) {
  const { Results } = await sqs.send(
    new ListMessageMoveTasksCommand({ SourceArn: sourceArn })
  );

  return Results?.map((r) => ({
    taskHandle: r.TaskHandle,
    status: r.Status,
    movedMessages: r.ApproximateNumberOfMessagesMoved,
    remainingMessages: r.ApproximateNumberOfMessagesToMove,
    failureReason: r.FailureReason,
  }));
}

DLQ Monitoring Dashboard

// app/api/admin/dlq-status/route.ts
import { SQSClient, GetQueueAttributesCommand } from "@aws-sdk/client-sqs";

const sqs = new SQSClient({ region: process.env.AWS_REGION });

const QUEUES = [
  { name: "Order Processing", dlqUrl: process.env.ORDER_DLQ_URL! },
  { name: "Email Delivery",   dlqUrl: process.env.EMAIL_DLQ_URL! },
  { name: "Webhook Events",   dlqUrl: process.env.WEBHOOK_DLQ_URL! },
];

export async function GET() {
  const statuses = await Promise.all(
    QUEUES.map(async (queue) => {
      const attrs = await sqs.send(
        new GetQueueAttributesCommand({
          QueueUrl: queue.dlqUrl,
          AttributeNames: [
            "ApproximateNumberOfMessages",
            "ApproximateNumberOfMessagesNotVisible",
          ],
        })
      );

      const visible = parseInt(attrs.Attributes?.ApproximateNumberOfMessages ?? "0", 10);
      const inFlight = parseInt(attrs.Attributes?.ApproximateNumberOfMessagesNotVisible ?? "0", 10);

      return {
        name: queue.name,
        dlqUrl: queue.dlqUrl,
        messagesVisible: visible,
        messagesInFlight: inFlight,
        total: visible + inFlight,
        status: visible + inFlight > 0 ? "needs_attention" : "healthy",
      };
    })
  );

  const hasIssues = statuses.some((s) => s.status === "needs_attention");

  return Response.json({
    queues: statuses,
    overallStatus: hasIssues ? "degraded" : "healthy",
    checkedAt: new Date().toISOString(),
  });
}

Choosing maxReceiveCount

Use CaseRecommended maxReceiveCountReasoning
Idempotent, fast processing3Quick to DLQ on bug
External API calls (flaky)5–7Allow for transient failures
Long processing (5+ min)2–3Visibility timeout complexity
Critical financial operations5Balance retry vs. duplicate risk
Webhook delivery7Target server may be down temporarily

Cost and Timeline

ComponentTimelineCost (USD)
DLQ + redrive Terraform0.5 day$300–$500
Partial batch failure handling0.5–1 day$400–$800
DLQ inspector worker1 day$600–$1,000
CloudWatch alarms + alerting0.5 day$300–$500
DLQ dashboard UI1 day$600–$1,000
Full DLQ system3–5 days$3,000–$5,000

SQS cost for DLQ: negligible — you're only charged for messages that actually land in DLQ (i.e., failures), which should be rare.


See Also


Working With Viprasol

We design and implement SQS queue architectures for production workloads—from basic queues through complex multi-queue pipelines with DLQ monitoring, alerting, and automated redrive. Our cloud team has built queue systems handling millions of messages per day.

What we deliver:

  • SQS + DLQ Terraform modules with redrive policy
  • Partial batch failure handling in Lambda consumers
  • Failure classification (retriable vs permanent)
  • CloudWatch alarms with SNS/PagerDuty integration
  • DLQ inspector and redrive tooling

See our cloud infrastructure services or contact us to design your SQS architecture.

Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Need DevOps & Cloud Expertise?

Scale your infrastructure with confidence. AWS, GCP, Azure certified team.

Free consultation • No commitment • Response within 24 hours

Viprasol · Big Data & Analytics

Making sense of your data at scale?

Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.