AWS Aurora 2026: Serverless v2, Replicas, Failover

Q: What is AWS Aurora 2026?

> Quick answer. Amazon Aurora is PostgreSQL/MySQL-compatible but re-architected: storage spans six copies across three Availability Zones and writes commit after four confirm, giving 99.99% availability and fast point-in-time recovery. Choose Aurora for bursty or mission-critical workloads (Serverless v2 auto-scales); choose plain RDS for predictable, cost-sensitive ones.

AWS RDS vs Aurora in 2026: Architecture, Cost, and When to Choose Each

Quick answer. Amazon Aurora is PostgreSQL/MySQL-compatible but re-architected: storage spans six copies across three Availability Zones and writes commit after four confirm, giving 99.99% availability and fast point-in-time recovery. Choose Aurora for bursty or mission-critical workloads (Serverless v2 auto-scales); choose plain RDS for predictable, cost-sensitive ones.

Aurora is PostgreSQL-compatible, but it's not just managed PostgreSQL. The storage layer is fundamentally different — distributed across six copies in three AZs, with writes acknowledged after four copies confirm — giving you 5 nines availability and point-in-time recovery without running your own replication. Aurora Serverless v2 adds automatic capacity scaling from 0.5 to 256 ACUs, making it cost-effective for variable workloads that would otherwise require over-provisioned instances.

This post covers production Aurora setup: Serverless v2 cluster configuration, read replica routing, failover behavior and testing, Performance Insights for query analysis, and Terraform IaC.

Aurora vs Standard RDS PostgreSQL

Feature	Aurora PostgreSQL	RDS PostgreSQL
Storage	Distributed (6 copies, 3 AZs)	Single-AZ EBS (Multi-AZ = 2 copies)
Failover time	~30 seconds	~60–120 seconds
Read replicas	Up to 15 (in-cluster)	Up to 5 (separate instances)
Replica lag	~10–20ms	~100ms–several seconds
Serverless option	v2 (sub-minute scaling)	No
Storage auto-scaling	Yes (to 128TB)	Yes (to 64TB)
Cost	~20% higher than RDS	Baseline
Global databases	Yes (cross-region)	No (manual replication)

1. Aurora Serverless v2 Cluster (Terraform)

# infrastructure/aurora/main.tf

locals {
  cluster_identifier = "${var.project}-${var.environment}-aurora"
  db_name           = replace(var.project, "-", "_")
}

# Subnet group
resource "aws_db_subnet_group" "aurora" {
  name       = "${local.cluster_identifier}-subnet-group"
  subnet_ids = var.private_subnet_ids

  tags = {
    Name        = "${local.cluster_identifier}-subnet-group"
    Environment = var.environment
  }
}

# Security group: only accept connections from ECS tasks / Lambda
resource "aws_security_group" "aurora" {
  name        = "${local.cluster_identifier}-sg"
  description = "Aurora cluster security group"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = var.allowed_security_group_ids
    description     = "PostgreSQL from app tier"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = { Name = "${local.cluster_identifier}-sg" }
}

# Parameter group for PostgreSQL tuning
resource "aws_rds_cluster_parameter_group" "aurora" {
  family      = "aurora-postgresql16"
  name        = "${local.cluster_identifier}-params"
  description = "Custom parameter group for ${local.cluster_identifier}"

  parameter {
    name  = "shared_preload_libraries"
    value = "pg_stat_statements,auto_explain"
  }
  parameter {
    name  = "log_min_duration_statement"
    value = "1000"  # Log queries > 1 second
  }
  parameter {
    name  = "auto_explain.log_min_duration"
    value = "500"   # Auto-explain queries > 500ms
  }
  parameter {
    name  = "auto_explain.log_analyze"
    value = "1"
  }
  parameter {
    name  = "work_mem"
    value = "65536"  # 64MB per sort/hash operation
  }
  parameter {
    name  = "random_page_cost"
    value = "1.0"   # Aurora storage is fast; default 4.0 is too conservative
  }
}

# Aurora cluster
resource "aws_rds_cluster" "main" {
  cluster_identifier     = local.cluster_identifier
  engine                 = "aurora-postgresql"
  engine_version         = "16.4"
  database_name          = local.db_name
  master_username        = "postgres"
  manage_master_user_password = true  # AWS Secrets Manager rotation

  db_subnet_group_name   = aws_db_subnet_group.aurora.name
  vpc_security_group_ids = [aws_security_group.aurora.id]
  db_cluster_parameter_group_name = aws_rds_cluster_parameter_group.aurora.name

  # Serverless v2 capacity range
  serverlessv2_scaling_configuration {
    min_capacity = var.environment == "prod" ? 2.0 : 0.5  # ACUs
    max_capacity = var.environment == "prod" ? 64.0 : 8.0
  }

  # Backup and recovery
  backup_retention_period   = var.environment == "prod" ? 14 : 7
  preferred_backup_window   = "03:00-04:00"  # UTC
  deletion_protection       = var.environment == "prod"
  skip_final_snapshot       = var.environment != "prod"
  final_snapshot_identifier = var.environment == "prod" ? "${local.cluster_identifier}-final" : null

  # Storage encryption
  storage_encrypted = true
  kms_key_id        = var.kms_key_arn

  # CloudWatch logging
  enabled_cloudwatch_logs_exports = ["postgresql"]

  # Performance Insights (covered later)
  # Enabled at instance level

  # Apply immediately for non-prod
  apply_immediately = var.environment != "prod"

  tags = {
    Project     = var.project
    Environment = var.environment
    ManagedBy   = "terraform"
  }

  lifecycle {
    ignore_changes = [
      # Don't recreate cluster if master password changes (managed by Secrets Manager)
      master_password,
    ]
  }
}

# Writer instance (Serverless v2)
resource "aws_rds_cluster_instance" "writer" {
  identifier         = "${local.cluster_identifier}-writer"
  cluster_identifier = aws_rds_cluster.main.id
  instance_class     = "db.serverless"  # Required for Serverless v2
  engine             = aws_rds_cluster.main.engine
  engine_version     = aws_rds_cluster.main.engine_version

  db_subnet_group_name    = aws_db_subnet_group.aurora.name
  db_parameter_group_name = aws_rds_cluster_parameter_group.aurora.name

  performance_insights_enabled          = true
  performance_insights_retention_period = var.environment == "prod" ? 731 : 7  # days
  performance_insights_kms_key_id       = var.kms_key_arn

  monitoring_interval = 60  # Enhanced Monitoring every 60 seconds
  monitoring_role_arn = aws_iam_role.rds_monitoring.arn

  auto_minor_version_upgrade = true

  tags = {
    Project     = var.project
    Environment = var.environment
    Role        = "writer"
  }
}

# Read replica (Serverless v2)
resource "aws_rds_cluster_instance" "reader" {
  count = var.environment == "prod" ? 2 : 1

  identifier         = "${local.cluster_identifier}-reader-${count.index + 1}"
  cluster_identifier = aws_rds_cluster.main.id
  instance_class     = "db.serverless"
  engine             = aws_rds_cluster.main.engine
  engine_version     = aws_rds_cluster.main.engine_version

  db_subnet_group_name = aws_db_subnet_group.aurora.name

  performance_insights_enabled          = true
  performance_insights_retention_period = var.environment == "prod" ? 731 : 7
  performance_insights_kms_key_id       = var.kms_key_arn

  promotion_tier = count.index + 1  # Lower = promoted first on failover

  tags = {
    Project     = var.project
    Environment = var.environment
    Role        = "reader-${count.index + 1}"
  }
}

# Outputs
output "writer_endpoint" {
  value = aws_rds_cluster.main.endpoint  # Always points to current writer
}

output "reader_endpoint" {
  value = aws_rds_cluster.main.reader_endpoint  # Load-balanced across replicas
}

output "cluster_identifier" {
  value = aws_rds_cluster.main.cluster_identifier
}

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

AWS, GCP, Azure certified engineers
Infrastructure as Code (Terraform, CDK)
Docker, Kubernetes, GitHub Actions CI/CD
Typical audit recovers $500–$3,000/month in savings

Get a Free Cloud Audit WhatsApp

2. Read Replica Routing in Application Code

Aurora provides two endpoints: writer (single instance, always the current primary) and reader (load-balanced across all replicas). Route read-heavy queries to the reader endpoint.

// src/lib/db/aurora-client.ts
import { PrismaClient } from '@prisma/client';

// Writer: for mutations and strongly consistent reads
export const dbWriter = new PrismaClient({
  datasourceUrl: process.env.DATABASE_URL_WRITER, // Aurora cluster endpoint
  log: process.env.NODE_ENV === 'development' ? ['query', 'warn', 'error'] : ['warn', 'error'],
});

// Reader: for analytics, reports, non-critical reads
// Aurora reader endpoint load-balances across all replicas
export const dbReader = new PrismaClient({
  datasourceUrl: process.env.DATABASE_URL_READER, // Aurora reader endpoint
});

// Helper: use reader for queries, writer for mutations
export async function readQuery<T>(fn: (db: PrismaClient) => Promise<T>): Promise<T> {
  return fn(dbReader);
}

export async function writeQuery<T>(fn: (db: PrismaClient) => Promise<T>): Promise<T> {
  return fn(dbWriter);
}

// Usage:
// const users = await readQuery(db => db.user.findMany({ take: 100 }));
// await writeQuery(db => db.user.create({ data: { ... } }));

# .env
# Writer endpoint (cluster endpoint — auto-fails over to new primary)
DATABASE_URL_WRITER="postgresql://postgres:${PASSWORD}@${CLUSTER_ENDPOINT}:5432/myapp"

# Reader endpoint (load-balanced across replicas)
DATABASE_URL_READER="postgresql://postgres:${PASSWORD}@${READER_ENDPOINT}:5432/myapp"

PgBouncer Connection Pooling with Aurora

Aurora Serverless v2 supports up to 5,000 connections, but connection overhead still adds up. Use RDS Proxy (AWS-managed PgBouncer) or self-managed PgBouncer.

# RDS Proxy for connection pooling
resource "aws_db_proxy" "main" {
  name                   = "${local.cluster_identifier}-proxy"
  debug_logging          = false
  engine_family          = "POSTGRESQL"
  idle_client_timeout    = 1800
  require_tls            = true
  role_arn               = aws_iam_role.rds_proxy.arn
  vpc_security_group_ids = [aws_security_group.aurora.id]
  vpc_subnet_ids         = var.private_subnet_ids

  auth {
    auth_scheme = "SECRETS"
    iam_auth    = "DISABLED"
    secret_arn  = aws_rds_cluster.main.master_user_secret[0].secret_arn
  }

  tags = { Name = "${local.cluster_identifier}-proxy" }
}

resource "aws_db_proxy_default_target_group" "main" {
  db_proxy_name = aws_db_proxy.main.name

  connection_pool_config {
    connection_borrow_timeout    = 120  # seconds to wait for a connection
    max_connections_percent      = 90   # Use up to 90% of max_connections
    max_idle_connections_percent = 50
  }
}

resource "aws_db_proxy_target" "main" {
  db_cluster_identifier = aws_rds_cluster.main.cluster_identifier
  db_proxy_name         = aws_db_proxy.main.name
  target_group_name     = aws_db_proxy_default_target_group.main.name
}

3. Failover Testing

Aurora promotes a read replica to writer when the current writer fails. Test this in non-production environments to verify your application handles the ~30-second reconnection window.

#!/usr/bin/env bash
# scripts/test-aurora-failover.sh

CLUSTER_ID="${1:?Usage: test-aurora-failover.sh <cluster-id>}"

echo "Initiating Aurora failover for cluster: $CLUSTER_ID"
echo "Watch your application logs — there should be a ~30s reconnection window"

# Trigger failover (promotes lowest promotion_tier replica)
aws rds failover-db-cluster \
  --db-cluster-identifier "$CLUSTER_ID" \
  --region us-east-1

# Monitor until failover completes
echo "Monitoring failover status..."
while true; do
  STATUS=$(aws rds describe-db-clusters \
    --db-cluster-identifier "$CLUSTER_ID" \
    --query 'DBClusters[0].Status' \
    --output text)
  
  WRITER=$(aws rds describe-db-clusters \
    --db-cluster-identifier "$CLUSTER_ID" \
    --query 'DBClusters[0].DBClusterMembers[?IsClusterWriter==`true`].DBInstanceIdentifier' \
    --output text)

  echo "$(date): Status=$STATUS, Writer=$WRITER"
  
  if [ "$STATUS" = "available" ]; then
    echo "Failover complete! New writer: $WRITER"
    break
  fi
  
  sleep 5
done

Application Retry Logic for Failover

// src/lib/db/retry.ts
import { PrismaClient, Prisma } from '@prisma/client';

// Aurora failover causes ~30s of connection errors
// This retry wrapper handles the reconnection window
const FAILOVER_ERROR_CODES = [
  'P1001',  // Can't reach database
  'P1002',  // Database connection timed out
  'P1008',  // Operations timed out
  'P1017',  // Server closed connection unexpectedly
];

export async function withRetry<T>(
  fn: () => Promise<T>,
  maxRetries: number = 5,
  baseDelayMs: number = 1000
): Promise<T> {
  let lastError: Error;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err: any) {
      lastError = err;

      const isRetryable =
        err instanceof Prisma.PrismaClientKnownRequestError &&
        FAILOVER_ERROR_CODES.includes(err.code);

      const isConnectionError =
        err.message?.includes('ECONNREFUSED') ||
        err.message?.includes('Connection terminated') ||
        err.message?.includes('read ECONNRESET');

      if (!isRetryable && !isConnectionError) throw err;
      if (attempt === maxRetries) throw err;

      const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 1000;
      console.warn(`DB error (attempt ${attempt + 1}/${maxRetries + 1}), retrying in ${Math.round(delay)}ms:`, err.message);
      await new Promise((r) => setTimeout(r, delay));
    }
  }

  throw lastError!;
}

aws - AWS Aurora 2026: Serverless v2, Replicas, Failover

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

Staging + production environments with feature flags
Automated security scanning in the pipeline
Uptime monitoring + alerting + runbook automation
On-call support handover docs included

Modernize My DevOps WhatsApp

4. Performance Insights

Performance Insights shows DB load by wait event, SQL query, and user — identifying bottlenecks without needing direct database access.

Read Performance Insights via AWS SDK

// src/scripts/analyze-performance.ts
import {
  PIClient,
  GetResourceMetricsCommand,
  DescribeDimensionKeysCommand,
} from '@aws-sdk/client-pi';

const pi = new PIClient({ region: 'us-east-1' });

const RESOURCE_ID = process.env.PI_RESOURCE_ID!; // dbi-xxxxx from RDS console

// Get top SQL queries by average active sessions (last hour)
async function getTopQueries() {
  const endTime = new Date();
  const startTime = new Date(endTime.getTime() - 60 * 60 * 1000);

  const { Keys } = await pi.send(
    new DescribeDimensionKeysCommand({
      ServiceType: 'RDS',
      Identifier: RESOURCE_ID,
      StartTime: startTime,
      EndTime: endTime,
      Metric: 'db.load.avg',
      GroupBy: {
        Group: 'db.sql',
        Dimensions: ['db.sql.statement'],
        Limit: 10,
      },
    })
  );

  console.log('\n📊 Top 10 Queries by DB Load (last hour):');
  Keys?.forEach((key, i) => {
    const stmt = key.Dimensions?.['db.sql.statement'] ?? 'N/A';
    const load = key.Total?.toFixed(4) ?? '0';
    console.log(`\n${i + 1}. Load: ${load} avg active sessions`);
    console.log(`   SQL: ${stmt.slice(0, 200)}...`);
  });
}

// Get wait events (identify bottlenecks: LWLock, I/O, lock contention)
async function getWaitEvents() {
  const endTime = new Date();
  const startTime = new Date(endTime.getTime() - 60 * 60 * 1000);

  const { Keys } = await pi.send(
    new DescribeDimensionKeysCommand({
      ServiceType: 'RDS',
      Identifier: RESOURCE_ID,
      StartTime: startTime,
      EndTime: endTime,
      Metric: 'db.load.avg',
      GroupBy: {
        Group: 'db.wait_event',
        Dimensions: ['db.wait_event.name'],
        Limit: 10,
      },
    })
  );

  console.log('\n⏳ Top Wait Events:');
  Keys?.forEach((key) => {
    const event = key.Dimensions?.['db.wait_event.name'] ?? 'N/A';
    const load = key.Total?.toFixed(4) ?? '0';
    console.log(`  ${event}: ${load}`);
  });
}

await getTopQueries();
await getWaitEvents();

Cost Reference

Configuration	Monthly Cost (us-east-1)	Use Case
Serverless v2, 0.5–8 ACU (dev)	$20–120/mo (usage-based)	Development / staging
Serverless v2, 2–32 ACU + 1 reader (prod)	$300–900/mo	Small-medium SaaS
2× `db.r8g.2xlarge` + 1 reader	~$1,400/mo	High-throughput, predictable load
3× `db.r8g.4xlarge` (Global DB)	~$4,200/mo	Multi-region with cross-region reads
+ RDS Proxy	+$0.015/hour (~$11/mo)	Connection pooling
+ Performance Insights (7 days)	Free	Query analysis
+ Performance Insights (2 years)	+$0.02/vCPU/hour	Long-term analysis

Viprasol in Action

Migrating from self-managed PostgreSQL or standard RDS to Aurora, or right-sizing an existing Aurora cluster that's over-provisioned? We design Aurora architectures — Serverless v2 for variable workloads, provisioned for steady high-throughput, Global Databases for multi-region — with Terraform IaC, read replica routing, and Performance Insights dashboards.

Talk to our team → | Explore our cloud solutions →

How an Aurora Serverless Read Replica Improves Scaling and Availability

An Aurora Serverless read replica lets you absorb read-heavy traffic without manually sizing instances. With Aurora Serverless v2, each reader scales its Aurora Capacity Units up and down based on live demand, so reporting dashboards, search queries, and analytics workloads draw from replicas while the writer stays focused on transactions. Because all readers share the same distributed storage volume, replication lag stays low and you avoid the overhead of copying data between nodes.

Adding a reader also strengthens availability. If the primary fails, Aurora can promote a replica during failover, typically restoring writes in well under a minute. Placing readers across multiple Availability Zones spreads risk further.

At Viprasol Tech, our senior engineers design Aurora topologies end to end, tuning capacity ranges, endpoints, and failover priorities so your database scales predictably under real load.

AWS Aurora 2026: Serverless v2, Replicas, Failover

AWS RDS vs Aurora in 2026: Architecture, Cost, and When to Choose Each

Aurora vs Standard RDS PostgreSQL

1. Aurora Serverless v2 Cluster (Terraform)

☁️ Is Your Cloud Costing Too Much?

2. Read Replica Routing in Application Code

PgBouncer Connection Pooling with Aurora

3. Failover Testing

Application Retry Logic for Failover

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Recommended Reading

4. Performance Insights

Read Performance Insights via AWS SDK

Cost Reference

More on This Topic

Viprasol in Action

How an Aurora Serverless Read Replica Improves Scaling and Availability

External Resources

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

AWS Aurora Serverless v2: Setup, Auto-Pause, RDS Proxy

AWS RDS Proxy: Connection Pooling, IAM Auth, Failover

AWS RDS Read Replicas: Routing, Connection Pooling, Lag Monitoring

AWS IAM Least-Privilege Design: Policy Patterns, Condition Keys

PostgreSQL Backup and Restore in 2026: pg_dump, WAL Archiving, PITR

Terraform State Management: Remote State, Workspaces, Locking, Import