AWS RDS Aurora: Serverless v2, Read Replicas, Failover, and Performance Insights
Deploy production PostgreSQL on AWS Aurora: Serverless v2 auto-scaling, read replica routing, failover testing, Performance Insights query analysis, and Terraform IaC for Aurora clusters.
Aurora is PostgreSQL-compatible, but it's not just managed PostgreSQL. The storage layer is fundamentally different — distributed across six copies in three AZs, with writes acknowledged after four copies confirm — giving you 5 nines availability and point-in-time recovery without running your own replication. Aurora Serverless v2 adds automatic capacity scaling from 0.5 to 256 ACUs, making it cost-effective for variable workloads that would otherwise require over-provisioned instances.
This post covers production Aurora setup: Serverless v2 cluster configuration, read replica routing, failover behavior and testing, Performance Insights for query analysis, and Terraform IaC.
Aurora vs Standard RDS PostgreSQL
| Feature | Aurora PostgreSQL | RDS PostgreSQL |
|---|---|---|
| Storage | Distributed (6 copies, 3 AZs) | Single-AZ EBS (Multi-AZ = 2 copies) |
| Failover time | ~30 seconds | ~60–120 seconds |
| Read replicas | Up to 15 (in-cluster) | Up to 5 (separate instances) |
| Replica lag | ~10–20ms | ~100ms–several seconds |
| Serverless option | v2 (sub-minute scaling) | No |
| Storage auto-scaling | Yes (to 128TB) | Yes (to 64TB) |
| Cost | ~20% higher than RDS | Baseline |
| Global databases | Yes (cross-region) | No (manual replication) |
1. Aurora Serverless v2 Cluster (Terraform)
# infrastructure/aurora/main.tf
locals {
cluster_identifier = "${var.project}-${var.environment}-aurora"
db_name = replace(var.project, "-", "_")
}
# Subnet group
resource "aws_db_subnet_group" "aurora" {
name = "${local.cluster_identifier}-subnet-group"
subnet_ids = var.private_subnet_ids
tags = {
Name = "${local.cluster_identifier}-subnet-group"
Environment = var.environment
}
}
# Security group: only accept connections from ECS tasks / Lambda
resource "aws_security_group" "aurora" {
name = "${local.cluster_identifier}-sg"
description = "Aurora cluster security group"
vpc_id = var.vpc_id
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = var.allowed_security_group_ids
description = "PostgreSQL from app tier"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = { Name = "${local.cluster_identifier}-sg" }
}
# Parameter group for PostgreSQL tuning
resource "aws_rds_cluster_parameter_group" "aurora" {
family = "aurora-postgresql16"
name = "${local.cluster_identifier}-params"
description = "Custom parameter group for ${local.cluster_identifier}"
parameter {
name = "shared_preload_libraries"
value = "pg_stat_statements,auto_explain"
}
parameter {
name = "log_min_duration_statement"
value = "1000" # Log queries > 1 second
}
parameter {
name = "auto_explain.log_min_duration"
value = "500" # Auto-explain queries > 500ms
}
parameter {
name = "auto_explain.log_analyze"
value = "1"
}
parameter {
name = "work_mem"
value = "65536" # 64MB per sort/hash operation
}
parameter {
name = "random_page_cost"
value = "1.0" # Aurora storage is fast; default 4.0 is too conservative
}
}
# Aurora cluster
resource "aws_rds_cluster" "main" {
cluster_identifier = local.cluster_identifier
engine = "aurora-postgresql"
engine_version = "16.4"
database_name = local.db_name
master_username = "postgres"
manage_master_user_password = true # AWS Secrets Manager rotation
db_subnet_group_name = aws_db_subnet_group.aurora.name
vpc_security_group_ids = [aws_security_group.aurora.id]
db_cluster_parameter_group_name = aws_rds_cluster_parameter_group.aurora.name
# Serverless v2 capacity range
serverlessv2_scaling_configuration {
min_capacity = var.environment == "prod" ? 2.0 : 0.5 # ACUs
max_capacity = var.environment == "prod" ? 64.0 : 8.0
}
# Backup and recovery
backup_retention_period = var.environment == "prod" ? 14 : 7
preferred_backup_window = "03:00-04:00" # UTC
deletion_protection = var.environment == "prod"
skip_final_snapshot = var.environment != "prod"
final_snapshot_identifier = var.environment == "prod" ? "${local.cluster_identifier}-final" : null
# Storage encryption
storage_encrypted = true
kms_key_id = var.kms_key_arn
# CloudWatch logging
enabled_cloudwatch_logs_exports = ["postgresql"]
# Performance Insights (covered later)
# Enabled at instance level
# Apply immediately for non-prod
apply_immediately = var.environment != "prod"
tags = {
Project = var.project
Environment = var.environment
ManagedBy = "terraform"
}
lifecycle {
ignore_changes = [
# Don't recreate cluster if master password changes (managed by Secrets Manager)
master_password,
]
}
}
# Writer instance (Serverless v2)
resource "aws_rds_cluster_instance" "writer" {
identifier = "${local.cluster_identifier}-writer"
cluster_identifier = aws_rds_cluster.main.id
instance_class = "db.serverless" # Required for Serverless v2
engine = aws_rds_cluster.main.engine
engine_version = aws_rds_cluster.main.engine_version
db_subnet_group_name = aws_db_subnet_group.aurora.name
db_parameter_group_name = aws_rds_cluster_parameter_group.aurora.name
performance_insights_enabled = true
performance_insights_retention_period = var.environment == "prod" ? 731 : 7 # days
performance_insights_kms_key_id = var.kms_key_arn
monitoring_interval = 60 # Enhanced Monitoring every 60 seconds
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
auto_minor_version_upgrade = true
tags = {
Project = var.project
Environment = var.environment
Role = "writer"
}
}
# Read replica (Serverless v2)
resource "aws_rds_cluster_instance" "reader" {
count = var.environment == "prod" ? 2 : 1
identifier = "${local.cluster_identifier}-reader-${count.index + 1}"
cluster_identifier = aws_rds_cluster.main.id
instance_class = "db.serverless"
engine = aws_rds_cluster.main.engine
engine_version = aws_rds_cluster.main.engine_version
db_subnet_group_name = aws_db_subnet_group.aurora.name
performance_insights_enabled = true
performance_insights_retention_period = var.environment == "prod" ? 731 : 7
performance_insights_kms_key_id = var.kms_key_arn
promotion_tier = count.index + 1 # Lower = promoted first on failover
tags = {
Project = var.project
Environment = var.environment
Role = "reader-${count.index + 1}"
}
}
# Outputs
output "writer_endpoint" {
value = aws_rds_cluster.main.endpoint # Always points to current writer
}
output "reader_endpoint" {
value = aws_rds_cluster.main.reader_endpoint # Load-balanced across replicas
}
output "cluster_identifier" {
value = aws_rds_cluster.main.cluster_identifier
}
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
2. Read Replica Routing in Application Code
Aurora provides two endpoints: writer (single instance, always the current primary) and reader (load-balanced across all replicas). Route read-heavy queries to the reader endpoint.
// src/lib/db/aurora-client.ts
import { PrismaClient } from '@prisma/client';
// Writer: for mutations and strongly consistent reads
export const dbWriter = new PrismaClient({
datasourceUrl: process.env.DATABASE_URL_WRITER, // Aurora cluster endpoint
log: process.env.NODE_ENV === 'development' ? ['query', 'warn', 'error'] : ['warn', 'error'],
});
// Reader: for analytics, reports, non-critical reads
// Aurora reader endpoint load-balances across all replicas
export const dbReader = new PrismaClient({
datasourceUrl: process.env.DATABASE_URL_READER, // Aurora reader endpoint
});
// Helper: use reader for queries, writer for mutations
export async function readQuery<T>(fn: (db: PrismaClient) => Promise<T>): Promise<T> {
return fn(dbReader);
}
export async function writeQuery<T>(fn: (db: PrismaClient) => Promise<T>): Promise<T> {
return fn(dbWriter);
}
// Usage:
// const users = await readQuery(db => db.user.findMany({ take: 100 }));
// await writeQuery(db => db.user.create({ data: { ... } }));
# .env
# Writer endpoint (cluster endpoint — auto-fails over to new primary)
DATABASE_URL_WRITER="postgresql://postgres:${PASSWORD}@${CLUSTER_ENDPOINT}:5432/myapp"
# Reader endpoint (load-balanced across replicas)
DATABASE_URL_READER="postgresql://postgres:${PASSWORD}@${READER_ENDPOINT}:5432/myapp"
PgBouncer Connection Pooling with Aurora
Aurora Serverless v2 supports up to 5,000 connections, but connection overhead still adds up. Use RDS Proxy (AWS-managed PgBouncer) or self-managed PgBouncer.
# RDS Proxy for connection pooling
resource "aws_db_proxy" "main" {
name = "${local.cluster_identifier}-proxy"
debug_logging = false
engine_family = "POSTGRESQL"
idle_client_timeout = 1800
require_tls = true
role_arn = aws_iam_role.rds_proxy.arn
vpc_security_group_ids = [aws_security_group.aurora.id]
vpc_subnet_ids = var.private_subnet_ids
auth {
auth_scheme = "SECRETS"
iam_auth = "DISABLED"
secret_arn = aws_rds_cluster.main.master_user_secret[0].secret_arn
}
tags = { Name = "${local.cluster_identifier}-proxy" }
}
resource "aws_db_proxy_default_target_group" "main" {
db_proxy_name = aws_db_proxy.main.name
connection_pool_config {
connection_borrow_timeout = 120 # seconds to wait for a connection
max_connections_percent = 90 # Use up to 90% of max_connections
max_idle_connections_percent = 50
}
}
resource "aws_db_proxy_target" "main" {
db_cluster_identifier = aws_rds_cluster.main.cluster_identifier
db_proxy_name = aws_db_proxy.main.name
target_group_name = aws_db_proxy_default_target_group.main.name
}
3. Failover Testing
Aurora promotes a read replica to writer when the current writer fails. Test this in non-production environments to verify your application handles the ~30-second reconnection window.
#!/usr/bin/env bash
# scripts/test-aurora-failover.sh
CLUSTER_ID="${1:?Usage: test-aurora-failover.sh <cluster-id>}"
echo "Initiating Aurora failover for cluster: $CLUSTER_ID"
echo "Watch your application logs — there should be a ~30s reconnection window"
# Trigger failover (promotes lowest promotion_tier replica)
aws rds failover-db-cluster \
--db-cluster-identifier "$CLUSTER_ID" \
--region us-east-1
# Monitor until failover completes
echo "Monitoring failover status..."
while true; do
STATUS=$(aws rds describe-db-clusters \
--db-cluster-identifier "$CLUSTER_ID" \
--query 'DBClusters[0].Status' \
--output text)
WRITER=$(aws rds describe-db-clusters \
--db-cluster-identifier "$CLUSTER_ID" \
--query 'DBClusters[0].DBClusterMembers[?IsClusterWriter==`true`].DBInstanceIdentifier' \
--output text)
echo "$(date): Status=$STATUS, Writer=$WRITER"
if [ "$STATUS" = "available" ]; then
echo "Failover complete! New writer: $WRITER"
break
fi
sleep 5
done
Application Retry Logic for Failover
// src/lib/db/retry.ts
import { PrismaClient, Prisma } from '@prisma/client';
// Aurora failover causes ~30s of connection errors
// This retry wrapper handles the reconnection window
const FAILOVER_ERROR_CODES = [
'P1001', // Can't reach database
'P1002', // Database connection timed out
'P1008', // Operations timed out
'P1017', // Server closed connection unexpectedly
];
export async function withRetry<T>(
fn: () => Promise<T>,
maxRetries: number = 5,
baseDelayMs: number = 1000
): Promise<T> {
let lastError: Error;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err: any) {
lastError = err;
const isRetryable =
err instanceof Prisma.PrismaClientKnownRequestError &&
FAILOVER_ERROR_CODES.includes(err.code);
const isConnectionError =
err.message?.includes('ECONNREFUSED') ||
err.message?.includes('Connection terminated') ||
err.message?.includes('read ECONNRESET');
if (!isRetryable && !isConnectionError) throw err;
if (attempt === maxRetries) throw err;
const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 1000;
console.warn(`DB error (attempt ${attempt + 1}/${maxRetries + 1}), retrying in ${Math.round(delay)}ms:`, err.message);
await new Promise((r) => setTimeout(r, delay));
}
}
throw lastError!;
}
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
4. Performance Insights
Performance Insights shows DB load by wait event, SQL query, and user — identifying bottlenecks without needing direct database access.
Read Performance Insights via AWS SDK
// src/scripts/analyze-performance.ts
import {
PIClient,
GetResourceMetricsCommand,
DescribeDimensionKeysCommand,
} from '@aws-sdk/client-pi';
const pi = new PIClient({ region: 'us-east-1' });
const RESOURCE_ID = process.env.PI_RESOURCE_ID!; // dbi-xxxxx from RDS console
// Get top SQL queries by average active sessions (last hour)
async function getTopQueries() {
const endTime = new Date();
const startTime = new Date(endTime.getTime() - 60 * 60 * 1000);
const { Keys } = await pi.send(
new DescribeDimensionKeysCommand({
ServiceType: 'RDS',
Identifier: RESOURCE_ID,
StartTime: startTime,
EndTime: endTime,
Metric: 'db.load.avg',
GroupBy: {
Group: 'db.sql',
Dimensions: ['db.sql.statement'],
Limit: 10,
},
})
);
console.log('\n📊 Top 10 Queries by DB Load (last hour):');
Keys?.forEach((key, i) => {
const stmt = key.Dimensions?.['db.sql.statement'] ?? 'N/A';
const load = key.Total?.toFixed(4) ?? '0';
console.log(`\n${i + 1}. Load: ${load} avg active sessions`);
console.log(` SQL: ${stmt.slice(0, 200)}...`);
});
}
// Get wait events (identify bottlenecks: LWLock, I/O, lock contention)
async function getWaitEvents() {
const endTime = new Date();
const startTime = new Date(endTime.getTime() - 60 * 60 * 1000);
const { Keys } = await pi.send(
new DescribeDimensionKeysCommand({
ServiceType: 'RDS',
Identifier: RESOURCE_ID,
StartTime: startTime,
EndTime: endTime,
Metric: 'db.load.avg',
GroupBy: {
Group: 'db.wait_event',
Dimensions: ['db.wait_event.name'],
Limit: 10,
},
})
);
console.log('\n⏳ Top Wait Events:');
Keys?.forEach((key) => {
const event = key.Dimensions?.['db.wait_event.name'] ?? 'N/A';
const load = key.Total?.toFixed(4) ?? '0';
console.log(` ${event}: ${load}`);
});
}
await getTopQueries();
await getWaitEvents();
Cost Reference
| Configuration | Monthly Cost (us-east-1) | Use Case |
|---|---|---|
| Serverless v2, 0.5–8 ACU (dev) | $20–120/mo (usage-based) | Development / staging |
| Serverless v2, 2–32 ACU + 1 reader (prod) | $300–900/mo | Small-medium SaaS |
2× db.r8g.2xlarge + 1 reader | ~$1,400/mo | High-throughput, predictable load |
3× db.r8g.4xlarge (Global DB) | ~$4,200/mo | Multi-region with cross-region reads |
| + RDS Proxy | +$0.015/hour (~$11/mo) | Connection pooling |
| + Performance Insights (7 days) | Free | Query analysis |
| + Performance Insights (2 years) | +$0.02/vCPU/hour | Long-term analysis |
See Also
- Database Connection Pooling: PgBouncer and RDS Proxy
- PostgreSQL Performance: Indexes, Query Plans, and Connection Pooling
- PostgreSQL Table Partitioning: Range, List, Hash, and pg_partman
- AWS ECS Fargate in Production: Task Definitions and Blue/Green Deploys
- Infrastructure Cost Engineering: Tagging, Rightsizing, and Reserved Capacity
Working With Viprasol
Migrating from self-managed PostgreSQL or standard RDS to Aurora, or right-sizing an existing Aurora cluster that's over-provisioned? We design Aurora architectures — Serverless v2 for variable workloads, provisioned for steady high-throughput, Global Databases for multi-region — with Terraform IaC, read replica routing, and Performance Insights dashboards.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.