Back to Blog

AWS RDS Aurora: Serverless v2, Read Replicas, Failover, and Performance Insights

Deploy production PostgreSQL on AWS Aurora: Serverless v2 auto-scaling, read replica routing, failover testing, Performance Insights query analysis, and Terraform IaC for Aurora clusters.

Viprasol Tech Team
November 15, 2026
13 min read

Aurora is PostgreSQL-compatible, but it's not just managed PostgreSQL. The storage layer is fundamentally different — distributed across six copies in three AZs, with writes acknowledged after four copies confirm — giving you 5 nines availability and point-in-time recovery without running your own replication. Aurora Serverless v2 adds automatic capacity scaling from 0.5 to 256 ACUs, making it cost-effective for variable workloads that would otherwise require over-provisioned instances.

This post covers production Aurora setup: Serverless v2 cluster configuration, read replica routing, failover behavior and testing, Performance Insights for query analysis, and Terraform IaC.

Aurora vs Standard RDS PostgreSQL

FeatureAurora PostgreSQLRDS PostgreSQL
StorageDistributed (6 copies, 3 AZs)Single-AZ EBS (Multi-AZ = 2 copies)
Failover time~30 seconds~60–120 seconds
Read replicasUp to 15 (in-cluster)Up to 5 (separate instances)
Replica lag~10–20ms~100ms–several seconds
Serverless optionv2 (sub-minute scaling)No
Storage auto-scalingYes (to 128TB)Yes (to 64TB)
Cost~20% higher than RDSBaseline
Global databasesYes (cross-region)No (manual replication)

1. Aurora Serverless v2 Cluster (Terraform)

# infrastructure/aurora/main.tf

locals {
  cluster_identifier = "${var.project}-${var.environment}-aurora"
  db_name           = replace(var.project, "-", "_")
}

# Subnet group
resource "aws_db_subnet_group" "aurora" {
  name       = "${local.cluster_identifier}-subnet-group"
  subnet_ids = var.private_subnet_ids

  tags = {
    Name        = "${local.cluster_identifier}-subnet-group"
    Environment = var.environment
  }
}

# Security group: only accept connections from ECS tasks / Lambda
resource "aws_security_group" "aurora" {
  name        = "${local.cluster_identifier}-sg"
  description = "Aurora cluster security group"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = var.allowed_security_group_ids
    description     = "PostgreSQL from app tier"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = { Name = "${local.cluster_identifier}-sg" }
}

# Parameter group for PostgreSQL tuning
resource "aws_rds_cluster_parameter_group" "aurora" {
  family      = "aurora-postgresql16"
  name        = "${local.cluster_identifier}-params"
  description = "Custom parameter group for ${local.cluster_identifier}"

  parameter {
    name  = "shared_preload_libraries"
    value = "pg_stat_statements,auto_explain"
  }
  parameter {
    name  = "log_min_duration_statement"
    value = "1000"  # Log queries > 1 second
  }
  parameter {
    name  = "auto_explain.log_min_duration"
    value = "500"   # Auto-explain queries > 500ms
  }
  parameter {
    name  = "auto_explain.log_analyze"
    value = "1"
  }
  parameter {
    name  = "work_mem"
    value = "65536"  # 64MB per sort/hash operation
  }
  parameter {
    name  = "random_page_cost"
    value = "1.0"   # Aurora storage is fast; default 4.0 is too conservative
  }
}

# Aurora cluster
resource "aws_rds_cluster" "main" {
  cluster_identifier     = local.cluster_identifier
  engine                 = "aurora-postgresql"
  engine_version         = "16.4"
  database_name          = local.db_name
  master_username        = "postgres"
  manage_master_user_password = true  # AWS Secrets Manager rotation

  db_subnet_group_name   = aws_db_subnet_group.aurora.name
  vpc_security_group_ids = [aws_security_group.aurora.id]
  db_cluster_parameter_group_name = aws_rds_cluster_parameter_group.aurora.name

  # Serverless v2 capacity range
  serverlessv2_scaling_configuration {
    min_capacity = var.environment == "prod" ? 2.0 : 0.5  # ACUs
    max_capacity = var.environment == "prod" ? 64.0 : 8.0
  }

  # Backup and recovery
  backup_retention_period   = var.environment == "prod" ? 14 : 7
  preferred_backup_window   = "03:00-04:00"  # UTC
  deletion_protection       = var.environment == "prod"
  skip_final_snapshot       = var.environment != "prod"
  final_snapshot_identifier = var.environment == "prod" ? "${local.cluster_identifier}-final" : null

  # Storage encryption
  storage_encrypted = true
  kms_key_id        = var.kms_key_arn

  # CloudWatch logging
  enabled_cloudwatch_logs_exports = ["postgresql"]

  # Performance Insights (covered later)
  # Enabled at instance level

  # Apply immediately for non-prod
  apply_immediately = var.environment != "prod"

  tags = {
    Project     = var.project
    Environment = var.environment
    ManagedBy   = "terraform"
  }

  lifecycle {
    ignore_changes = [
      # Don't recreate cluster if master password changes (managed by Secrets Manager)
      master_password,
    ]
  }
}

# Writer instance (Serverless v2)
resource "aws_rds_cluster_instance" "writer" {
  identifier         = "${local.cluster_identifier}-writer"
  cluster_identifier = aws_rds_cluster.main.id
  instance_class     = "db.serverless"  # Required for Serverless v2
  engine             = aws_rds_cluster.main.engine
  engine_version     = aws_rds_cluster.main.engine_version

  db_subnet_group_name    = aws_db_subnet_group.aurora.name
  db_parameter_group_name = aws_rds_cluster_parameter_group.aurora.name

  performance_insights_enabled          = true
  performance_insights_retention_period = var.environment == "prod" ? 731 : 7  # days
  performance_insights_kms_key_id       = var.kms_key_arn

  monitoring_interval = 60  # Enhanced Monitoring every 60 seconds
  monitoring_role_arn = aws_iam_role.rds_monitoring.arn

  auto_minor_version_upgrade = true

  tags = {
    Project     = var.project
    Environment = var.environment
    Role        = "writer"
  }
}

# Read replica (Serverless v2)
resource "aws_rds_cluster_instance" "reader" {
  count = var.environment == "prod" ? 2 : 1

  identifier         = "${local.cluster_identifier}-reader-${count.index + 1}"
  cluster_identifier = aws_rds_cluster.main.id
  instance_class     = "db.serverless"
  engine             = aws_rds_cluster.main.engine
  engine_version     = aws_rds_cluster.main.engine_version

  db_subnet_group_name = aws_db_subnet_group.aurora.name

  performance_insights_enabled          = true
  performance_insights_retention_period = var.environment == "prod" ? 731 : 7
  performance_insights_kms_key_id       = var.kms_key_arn

  promotion_tier = count.index + 1  # Lower = promoted first on failover

  tags = {
    Project     = var.project
    Environment = var.environment
    Role        = "reader-${count.index + 1}"
  }
}

# Outputs
output "writer_endpoint" {
  value = aws_rds_cluster.main.endpoint  # Always points to current writer
}

output "reader_endpoint" {
  value = aws_rds_cluster.main.reader_endpoint  # Load-balanced across replicas
}

output "cluster_identifier" {
  value = aws_rds_cluster.main.cluster_identifier
}

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

  • AWS, GCP, Azure certified engineers
  • Infrastructure as Code (Terraform, CDK)
  • Docker, Kubernetes, GitHub Actions CI/CD
  • Typical audit recovers $500–$3,000/month in savings

2. Read Replica Routing in Application Code

Aurora provides two endpoints: writer (single instance, always the current primary) and reader (load-balanced across all replicas). Route read-heavy queries to the reader endpoint.

// src/lib/db/aurora-client.ts
import { PrismaClient } from '@prisma/client';

// Writer: for mutations and strongly consistent reads
export const dbWriter = new PrismaClient({
  datasourceUrl: process.env.DATABASE_URL_WRITER, // Aurora cluster endpoint
  log: process.env.NODE_ENV === 'development' ? ['query', 'warn', 'error'] : ['warn', 'error'],
});

// Reader: for analytics, reports, non-critical reads
// Aurora reader endpoint load-balances across all replicas
export const dbReader = new PrismaClient({
  datasourceUrl: process.env.DATABASE_URL_READER, // Aurora reader endpoint
});

// Helper: use reader for queries, writer for mutations
export async function readQuery<T>(fn: (db: PrismaClient) => Promise<T>): Promise<T> {
  return fn(dbReader);
}

export async function writeQuery<T>(fn: (db: PrismaClient) => Promise<T>): Promise<T> {
  return fn(dbWriter);
}

// Usage:
// const users = await readQuery(db => db.user.findMany({ take: 100 }));
// await writeQuery(db => db.user.create({ data: { ... } }));
# .env
# Writer endpoint (cluster endpoint — auto-fails over to new primary)
DATABASE_URL_WRITER="postgresql://postgres:${PASSWORD}@${CLUSTER_ENDPOINT}:5432/myapp"

# Reader endpoint (load-balanced across replicas)
DATABASE_URL_READER="postgresql://postgres:${PASSWORD}@${READER_ENDPOINT}:5432/myapp"

PgBouncer Connection Pooling with Aurora

Aurora Serverless v2 supports up to 5,000 connections, but connection overhead still adds up. Use RDS Proxy (AWS-managed PgBouncer) or self-managed PgBouncer.

# RDS Proxy for connection pooling
resource "aws_db_proxy" "main" {
  name                   = "${local.cluster_identifier}-proxy"
  debug_logging          = false
  engine_family          = "POSTGRESQL"
  idle_client_timeout    = 1800
  require_tls            = true
  role_arn               = aws_iam_role.rds_proxy.arn
  vpc_security_group_ids = [aws_security_group.aurora.id]
  vpc_subnet_ids         = var.private_subnet_ids

  auth {
    auth_scheme = "SECRETS"
    iam_auth    = "DISABLED"
    secret_arn  = aws_rds_cluster.main.master_user_secret[0].secret_arn
  }

  tags = { Name = "${local.cluster_identifier}-proxy" }
}

resource "aws_db_proxy_default_target_group" "main" {
  db_proxy_name = aws_db_proxy.main.name

  connection_pool_config {
    connection_borrow_timeout    = 120  # seconds to wait for a connection
    max_connections_percent      = 90   # Use up to 90% of max_connections
    max_idle_connections_percent = 50
  }
}

resource "aws_db_proxy_target" "main" {
  db_cluster_identifier = aws_rds_cluster.main.cluster_identifier
  db_proxy_name         = aws_db_proxy.main.name
  target_group_name     = aws_db_proxy_default_target_group.main.name
}

3. Failover Testing

Aurora promotes a read replica to writer when the current writer fails. Test this in non-production environments to verify your application handles the ~30-second reconnection window.

#!/usr/bin/env bash
# scripts/test-aurora-failover.sh

CLUSTER_ID="${1:?Usage: test-aurora-failover.sh <cluster-id>}"

echo "Initiating Aurora failover for cluster: $CLUSTER_ID"
echo "Watch your application logs — there should be a ~30s reconnection window"

# Trigger failover (promotes lowest promotion_tier replica)
aws rds failover-db-cluster \
  --db-cluster-identifier "$CLUSTER_ID" \
  --region us-east-1

# Monitor until failover completes
echo "Monitoring failover status..."
while true; do
  STATUS=$(aws rds describe-db-clusters \
    --db-cluster-identifier "$CLUSTER_ID" \
    --query 'DBClusters[0].Status' \
    --output text)
  
  WRITER=$(aws rds describe-db-clusters \
    --db-cluster-identifier "$CLUSTER_ID" \
    --query 'DBClusters[0].DBClusterMembers[?IsClusterWriter==`true`].DBInstanceIdentifier' \
    --output text)

  echo "$(date): Status=$STATUS, Writer=$WRITER"
  
  if [ "$STATUS" = "available" ]; then
    echo "Failover complete! New writer: $WRITER"
    break
  fi
  
  sleep 5
done

Application Retry Logic for Failover

// src/lib/db/retry.ts
import { PrismaClient, Prisma } from '@prisma/client';

// Aurora failover causes ~30s of connection errors
// This retry wrapper handles the reconnection window
const FAILOVER_ERROR_CODES = [
  'P1001',  // Can't reach database
  'P1002',  // Database connection timed out
  'P1008',  // Operations timed out
  'P1017',  // Server closed connection unexpectedly
];

export async function withRetry<T>(
  fn: () => Promise<T>,
  maxRetries: number = 5,
  baseDelayMs: number = 1000
): Promise<T> {
  let lastError: Error;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err: any) {
      lastError = err;

      const isRetryable =
        err instanceof Prisma.PrismaClientKnownRequestError &&
        FAILOVER_ERROR_CODES.includes(err.code);

      const isConnectionError =
        err.message?.includes('ECONNREFUSED') ||
        err.message?.includes('Connection terminated') ||
        err.message?.includes('read ECONNRESET');

      if (!isRetryable && !isConnectionError) throw err;
      if (attempt === maxRetries) throw err;

      const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 1000;
      console.warn(`DB error (attempt ${attempt + 1}/${maxRetries + 1}), retrying in ${Math.round(delay)}ms:`, err.message);
      await new Promise((r) => setTimeout(r, delay));
    }
  }

  throw lastError!;
}

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

  • Staging + production environments with feature flags
  • Automated security scanning in the pipeline
  • Uptime monitoring + alerting + runbook automation
  • On-call support handover docs included

4. Performance Insights

Performance Insights shows DB load by wait event, SQL query, and user — identifying bottlenecks without needing direct database access.

Read Performance Insights via AWS SDK

// src/scripts/analyze-performance.ts
import {
  PIClient,
  GetResourceMetricsCommand,
  DescribeDimensionKeysCommand,
} from '@aws-sdk/client-pi';

const pi = new PIClient({ region: 'us-east-1' });

const RESOURCE_ID = process.env.PI_RESOURCE_ID!; // dbi-xxxxx from RDS console

// Get top SQL queries by average active sessions (last hour)
async function getTopQueries() {
  const endTime = new Date();
  const startTime = new Date(endTime.getTime() - 60 * 60 * 1000);

  const { Keys } = await pi.send(
    new DescribeDimensionKeysCommand({
      ServiceType: 'RDS',
      Identifier: RESOURCE_ID,
      StartTime: startTime,
      EndTime: endTime,
      Metric: 'db.load.avg',
      GroupBy: {
        Group: 'db.sql',
        Dimensions: ['db.sql.statement'],
        Limit: 10,
      },
    })
  );

  console.log('\n📊 Top 10 Queries by DB Load (last hour):');
  Keys?.forEach((key, i) => {
    const stmt = key.Dimensions?.['db.sql.statement'] ?? 'N/A';
    const load = key.Total?.toFixed(4) ?? '0';
    console.log(`\n${i + 1}. Load: ${load} avg active sessions`);
    console.log(`   SQL: ${stmt.slice(0, 200)}...`);
  });
}

// Get wait events (identify bottlenecks: LWLock, I/O, lock contention)
async function getWaitEvents() {
  const endTime = new Date();
  const startTime = new Date(endTime.getTime() - 60 * 60 * 1000);

  const { Keys } = await pi.send(
    new DescribeDimensionKeysCommand({
      ServiceType: 'RDS',
      Identifier: RESOURCE_ID,
      StartTime: startTime,
      EndTime: endTime,
      Metric: 'db.load.avg',
      GroupBy: {
        Group: 'db.wait_event',
        Dimensions: ['db.wait_event.name'],
        Limit: 10,
      },
    })
  );

  console.log('\n⏳ Top Wait Events:');
  Keys?.forEach((key) => {
    const event = key.Dimensions?.['db.wait_event.name'] ?? 'N/A';
    const load = key.Total?.toFixed(4) ?? '0';
    console.log(`  ${event}: ${load}`);
  });
}

await getTopQueries();
await getWaitEvents();

Cost Reference

ConfigurationMonthly Cost (us-east-1)Use Case
Serverless v2, 0.5–8 ACU (dev)$20–120/mo (usage-based)Development / staging
Serverless v2, 2–32 ACU + 1 reader (prod)$300–900/moSmall-medium SaaS
db.r8g.2xlarge + 1 reader~$1,400/moHigh-throughput, predictable load
db.r8g.4xlarge (Global DB)~$4,200/moMulti-region with cross-region reads
+ RDS Proxy+$0.015/hour (~$11/mo)Connection pooling
+ Performance Insights (7 days)FreeQuery analysis
+ Performance Insights (2 years)+$0.02/vCPU/hourLong-term analysis

See Also


Working With Viprasol

Migrating from self-managed PostgreSQL or standard RDS to Aurora, or right-sizing an existing Aurora cluster that's over-provisioned? We design Aurora architectures — Serverless v2 for variable workloads, provisioned for steady high-throughput, Global Databases for multi-region — with Terraform IaC, read replica routing, and Performance Insights dashboards.

Talk to our team → | Explore our cloud solutions →

Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Need DevOps & Cloud Expertise?

Scale your infrastructure with confidence. AWS, GCP, Azure certified team.

Free consultation • No commitment • Response within 24 hours

Viprasol · Big Data & Analytics

Making sense of your data at scale?

Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.