AWS RDS Read Replicas: Routing, Connection Pooling, Lag Monitoring, and Failover Patterns
Scale read-heavy PostgreSQL workloads with AWS RDS read replicas. Covers Prisma read/write splitting, PgBouncer per-replica connection pooling, replica lag monitoring, automatic failover with Aurora, and Terraform setup for multi-replica configurations.
Read replicas offload SELECT queries from your primary RDS instance, letting it focus on writes. The performance improvement is significant: if 80% of your queries are reads, a single replica can cut primary load in half. The implementation challenge is routing — making sure writes always go to the primary, reads go to replicas, and your application handles the replica lag gracefully.
When Read Replicas Help (and When They Don't)
| Scenario | Read Replicas Help | Alternative |
|---|---|---|
| Read-heavy analytics (dashboards, reports) | ✅ Yes | Also consider materialized views |
| Read-heavy app queries (80%+ SELECTs) | ✅ Yes | |
| Write-heavy workloads | ❌ Minimal | Sharding, CQRS |
| Queries needing fresh data (< 100ms lag) | ❌ Risky | Route to primary |
| Full-text search | ❌ No | Use OpenSearch |
| < 100 RPS total | ❌ Overkill | Optimize indexes first |
Terraform: RDS with Read Replicas
# terraform/rds-replicas.tf
# Primary instance
resource "aws_db_instance" "primary" {
identifier = "${var.app_name}-primary"
engine = "postgres"
engine_version = "16.3"
instance_class = "db.t4g.large"
allocated_storage = 100
storage_type = "gp3"
storage_encrypted = true
db_name = var.db_name
username = var.db_username
password = var.db_password
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.rds.id]
multi_az = true # Synchronous standby for failover
# Enable enhanced monitoring for replica lag metrics
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
# Backup for Point-in-Time Recovery
backup_retention_period = 7
backup_window = "03:00-04:00"
maintenance_window = "sun:04:30-sun:05:30"
# Required for read replicas
parameter_group_name = aws_db_parameter_group.postgres16.name
tags = local.tags
}
# Read replica 1 (same region, different AZ)
resource "aws_db_instance" "replica_1" {
identifier = "${var.app_name}-replica-1"
replicate_source_db = aws_db_instance.primary.identifier
instance_class = "db.t4g.medium" # Can be smaller than primary
publicly_accessible = false
vpc_security_group_ids = [aws_security_group.rds.id]
# Replicas don't need backups (primary has them)
backup_retention_period = 0
skip_final_snapshot = true
# Auto-minor version upgrade
auto_minor_version_upgrade = true
tags = merge(local.tags, { Role = "replica" })
}
# Optional: replica in a different region (cross-region)
# resource "aws_db_instance" "replica_eu" {
# provider = aws.eu_west_1
# replicate_source_db = aws_db_instance.primary.arn # ARN for cross-region
# ...
# }
# CloudWatch alarm: replica lag > 30 seconds
resource "aws_cloudwatch_metric_alarm" "replica_lag" {
alarm_name = "${var.app_name}-replica-lag"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "ReplicaLag"
namespace = "AWS/RDS"
period = 60
statistic = "Average"
threshold = 30 # seconds
alarm_actions = [var.sns_alert_topic_arn]
dimensions = {
DBInstanceIdentifier = aws_db_instance.replica_1.identifier
}
}
output "primary_endpoint" {
value = aws_db_instance.primary.endpoint
}
output "replica_endpoint" {
value = aws_db_instance.replica_1.endpoint
}
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Prisma: Read/Write Splitting
// lib/prisma.ts — separate clients for primary and replica
import { PrismaClient } from "@prisma/client";
function createPrismaClient(databaseUrl: string) {
return new PrismaClient({
datasources: { db: { url: databaseUrl } },
log: process.env.NODE_ENV === "development" ? ["query", "error"] : ["error"],
});
}
// Primary: all writes + reads requiring freshness
const primaryClient = createPrismaClient(process.env.DATABASE_URL!);
// Replica: read-only queries
const replicaClient = process.env.DATABASE_REPLICA_URL
? createPrismaClient(process.env.DATABASE_REPLICA_URL)
: primaryClient; // Fallback to primary if no replica configured
export const prisma = primaryClient;
export const prismaRO = replicaClient; // "RO" = read-only
// Usage pattern:
// import { prisma, prismaRO } from "@/lib/prisma";
// const user = await prismaRO.user.findUnique(...) ← reads from replica
// await prisma.user.update(...) ← writes to primary
Read/Write Routing Layer
// lib/db/router.ts — route queries based on operation type
import { prisma, prismaRO } from "@/lib/prisma";
import type { Prisma } from "@prisma/client";
// Read operations that can use replica (stale data acceptable)
export async function readFromReplica<T>(
query: (client: typeof prismaRO) => Promise<T>
): Promise<T> {
return query(prismaRO);
}
// Write + read: always use primary
export async function readFromPrimary<T>(
query: (client: typeof prisma) => Promise<T>
): Promise<T> {
return query(prisma);
}
// Practical usage:
// Reports, dashboards, list pages → replica
// Post-write reads ("read your writes") → primary
// Example: dashboard stats (stale ok, use replica)
export async function getDashboardStats(workspaceId: string) {
return readFromReplica((db) =>
db.project.groupBy({
by: ["status"],
where: { workspaceId },
_count: { id: true },
})
);
}
// Example: fetch user after updating (must use primary)
export async function updateAndFetchUser(userId: string, data: Prisma.UserUpdateInput) {
const updated = await prisma.user.update({ where: { id: userId }, data });
// Read from PRIMARY — replica may be 100-200ms behind
const fresh = await readFromPrimary((db) =>
db.user.findUniqueOrThrow({ where: { id: userId } })
);
return fresh;
}
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Monitoring Replica Lag
// lib/monitoring/replica-lag.ts
import { CloudWatchClient, GetMetricStatisticsCommand } from "@aws-sdk/client-cloudwatch";
const cw = new CloudWatchClient({ region: process.env.AWS_REGION });
export async function getReplicaLagSeconds(): Promise<number> {
const now = new Date();
const past = new Date(now.getTime() - 5 * 60 * 1000); // Last 5 minutes
const result = await cw.send(
new GetMetricStatisticsCommand({
Namespace: "AWS/RDS",
MetricName: "ReplicaLag",
Dimensions: [
{ Name: "DBInstanceIdentifier", Value: process.env.RDS_REPLICA_ID! },
],
StartTime: past,
EndTime: now,
Period: 60,
Statistics: ["Average"],
})
);
const points = result.Datapoints ?? [];
if (points.length === 0) return 0;
return Math.max(...points.map((p) => p.Average ?? 0));
}
// Health check: skip replica if lagging > 5 seconds
export async function getReadClient() {
const lagSeconds = await getReplicaLagSeconds();
if (lagSeconds > 5) {
console.warn(`[replica] Lag ${lagSeconds}s > 5s — routing to primary`);
return prisma; // Fall back to primary
}
return prismaRO;
}
Aurora: Automatic Failover and Cluster Endpoint
# For Aurora PostgreSQL: use cluster endpoint — auto-routes to current writer
resource "aws_rds_cluster" "aurora" {
cluster_identifier = "${var.app_name}-aurora"
engine = "aurora-postgresql"
engine_version = "16.3"
database_name = var.db_name
master_username = var.db_username
master_password = var.db_password
storage_encrypted = true
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.rds.id]
backup_retention_period = 7
preferred_backup_window = "03:00-04:00"
skip_final_snapshot = false
final_snapshot_identifier = "${var.app_name}-final"
}
# Writer instance
resource "aws_rds_cluster_instance" "writer" {
identifier = "${var.app_name}-writer"
cluster_identifier = aws_rds_cluster.aurora.id
instance_class = "db.t4g.large"
engine = aws_rds_cluster.aurora.engine
}
# Reader instance (Aurora auto-promotes to writer on failover)
resource "aws_rds_cluster_instance" "reader" {
identifier = "${var.app_name}-reader"
cluster_identifier = aws_rds_cluster.aurora.id
instance_class = "db.t4g.medium"
engine = aws_rds_cluster.aurora.engine
}
output "aurora_writer_endpoint" {
value = aws_rds_cluster.aurora.endpoint # Always points to current writer
}
output "aurora_reader_endpoint" {
value = aws_rds_cluster.aurora.reader_endpoint # Round-robins across readers
}
// Aurora endpoints in Prisma
// DATABASE_URL = aurora_writer_endpoint (writes)
// DATABASE_REPLICA_URL = aurora_reader_endpoint (reads)
// Aurora handles failover automatically — no DNS update needed
Cost Comparison
| Configuration | Monthly Cost (approx) |
|---|---|
| Single db.t4g.large (no replica) | ~$130 |
| Primary db.t4g.large + 1 replica db.t4g.medium | ~$190 |
| Aurora db.t4g.large writer + 1 reader | ~$220 |
| RDS Multi-AZ db.t4g.large (standby, no read traffic) | ~$260 |
See Also
- AWS RDS Aurora Serverless
- AWS RDS Aurora
- PostgreSQL Connection Pooling with PgBouncer
- PostgreSQL Performance Tuning
- AWS Cost Optimization
Working With Viprasol
Read replicas require correct routing discipline — a single misrouted write to a read-only endpoint causes an error; a single "read your own write" routed to a lagging replica causes a confusing user experience. Our team sets up RDS replicas with Terraform, implements prismaRO for replica reads and prisma for primary writes, adds replica lag CloudWatch alarms, and builds a getReadClient() health-check function that falls back to primary when lag exceeds 5 seconds.
What we deliver:
- RDS Terraform: primary Multi-AZ + replica (smaller instance class), enhanced monitoring, lag CloudWatch alarm
prisma(primary) andprismaRO(replica) Prisma clients with fallback to primary when no replica URLreadFromReplica()andreadFromPrimary()routing helpersgetReplicaLagSeconds()CloudWatch metric querygetReadClient()health check: lag > 5s → fall back to primary- Aurora cluster Terraform: writer + reader endpoints, auto-failover
Talk to our team about your RDS scaling architecture →
Or explore our cloud infrastructure services.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.