SaaS Multi-Region Deployment: PostgreSQL Replication, Latency Routing, and Disaster Recovery
Architect multi-region SaaS deployments: PostgreSQL read replicas with replication lag handling, Route 53 latency-based routing, active-passive and active-active patterns, RTO/RPO targets, and Terraform IaC.
Most SaaS products don't need multi-region until they have paying customers on multiple continents or an SLA that requires 99.99% uptime. Getting there prematurely adds enormous operational complexity. But when you do need it — whether for latency, compliance (GDPR data residency), or disaster recovery — you need a clear architecture before you're under pressure.
This post covers the two main patterns (active-passive and active-active), PostgreSQL replication with Aurora Global Database, latency-based routing with Route 53, handling replication lag in your application, and Terraform for the whole thing.
Pattern Comparison
| Pattern | RTO | RPO | Complexity | Cost | Use case |
|---|---|---|---|---|---|
| Single region | N/A | N/A | Low | Low | < 50K users, no SLA |
| Active-passive (warm standby) | 1–5 min | < 1 min | Medium | 1.5–2x | 99.9% SLA, DR |
| Active-passive (hot standby) | < 1 min | Seconds | Medium-High | 2x | 99.95% SLA |
| Active-active | Near-zero | Near-zero | Very High | 3–4x | 99.99% SLA, global users |
Start with active-passive. Active-active requires solving distributed writes — conflict resolution, two-phase commit or event sourcing — complexity most products don't need.
1. Aurora Global Database (Recommended for AWS)
Aurora Global Database provides sub-second replication lag across up to 5 regions with automatic failover in under 1 minute.
# infrastructure/aurora-global/main.tf
# Primary cluster (us-east-1)
resource "aws_rds_global_cluster" "main" {
global_cluster_identifier = "${var.project}-global"
engine = "aurora-postgresql"
engine_version = "16.3"
database_name = var.db_name
storage_encrypted = true
}
resource "aws_rds_cluster" "primary" {
provider = aws.us_east_1
cluster_identifier = "${var.project}-primary"
engine = "aurora-postgresql"
engine_version = "16.3"
global_cluster_identifier = aws_rds_global_cluster.main.id
db_subnet_group_name = aws_db_subnet_group.primary.name
vpc_security_group_ids = [aws_security_group.rds.id]
master_username = var.db_username
master_password = random_password.db.result
backup_retention_period = 7
preferred_backup_window = "03:00-04:00"
skip_final_snapshot = false
deletion_protection = true
tags = { Region = "primary", Environment = var.environment }
}
resource "aws_rds_cluster_instance" "primary" {
provider = aws.us_east_1
count = 2 # Writer + 1 reader in primary region
identifier = "${var.project}-primary-${count.index}"
cluster_identifier = aws_rds_cluster.primary.id
instance_class = "db.r8g.xlarge"
engine = "aurora-postgresql"
performance_insights_enabled = true
}
# Secondary cluster (eu-west-1) — read replica region
resource "aws_rds_cluster" "secondary" {
provider = aws.eu_west_1
cluster_identifier = "${var.project}-secondary"
engine = "aurora-postgresql"
engine_version = "16.3"
global_cluster_identifier = aws_rds_global_cluster.main.id
db_subnet_group_name = aws_db_subnet_group.secondary.name
vpc_security_group_ids = [aws_security_group.rds_eu.id]
# Secondary clusters don't have master credentials (replicated)
skip_final_snapshot = false
tags = { Region = "secondary", Environment = var.environment }
lifecycle {
ignore_changes = [replication_source_identifier]
}
}
resource "aws_rds_cluster_instance" "secondary" {
provider = aws.eu_west_1
count = 1
identifier = "${var.project}-secondary-${count.index}"
cluster_identifier = aws_rds_cluster.secondary.id
instance_class = "db.r8g.large"
engine = "aurora-postgresql"
}
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
2. Route 53 Latency-Based Routing
# infrastructure/dns/main.tf
# Health checks for each region's ALB
resource "aws_route53_health_check" "us_east_1" {
fqdn = aws_lb.us_east_1.dns_name
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
tags = { Name = "${var.project}-hc-us-east-1" }
}
resource "aws_route53_health_check" "eu_west_1" {
fqdn = aws_lb.eu_west_1.dns_name
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
tags = { Name = "${var.project}-hc-eu-west-1" }
}
# Latency-based records: Route 53 picks the nearest healthy region
resource "aws_route53_record" "api_us" {
zone_id = data.aws_route53_zone.main.zone_id
name = "api.${var.domain}"
type = "A"
set_identifier = "us-east-1"
latency_routing_policy {
region = "us-east-1"
}
health_check_id = aws_route53_health_check.us_east_1.id
alias {
name = aws_lb.us_east_1.dns_name
zone_id = aws_lb.us_east_1.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "api_eu" {
zone_id = data.aws_route53_zone.main.zone_id
name = "api.${var.domain}"
type = "A"
set_identifier = "eu-west-1"
latency_routing_policy {
region = "eu-west-1"
}
health_check_id = aws_route53_health_check.eu_west_1.id
alias {
name = aws_lb.eu_west_1.dns_name
zone_id = aws_lb.eu_west_1.zone_id
evaluate_target_health = true
}
}
3. Application Layer: Read/Write Splitting
// src/lib/db/multi-region.ts
import { PrismaClient } from '@prisma/client';
// Separate clients for write (primary) and read (local replica)
const writeDb = new PrismaClient({
datasourceUrl: process.env.DATABASE_URL_PRIMARY, // us-east-1 writer
log: ['error'],
});
const readDb = new PrismaClient({
datasourceUrl: process.env.DATABASE_URL_REPLICA, // Local regional replica
log: ['error'],
});
// Typed wrapper that enforces read/write routing
export const db = {
// Reads go to local replica (low latency)
query: readDb,
// Writes always go to primary
mutation: writeDb,
};
// Usage example:
// await db.query.post.findMany({ where: { status: 'published' } });
// await db.mutation.post.create({ data: { ... } });
Handling Replication Lag
// src/lib/db/read-after-write.ts
// Problem: user writes, then immediately reads — might see stale data from replica
// Solution: route reads to primary for a short window after a write
import { AsyncLocalStorage } from 'async_hooks';
const readAfterWriteStorage = new AsyncLocalStorage<{
expiresAt: number;
entity: string;
}>();
export function markReadAfterWrite(entity: string, windowMs = 5000) {
// Signal: this request should read from primary for next 5s
readAfterWriteStorage.run(
{ expiresAt: Date.now() + windowMs, entity },
() => {} // AsyncLocalStorage context set; caller uses getDb()
);
}
export function getDb() {
const context = readAfterWriteStorage.getStore();
if (context && Date.now() < context.expiresAt) {
// Within read-after-write window — use primary for reads too
return writeDb;
}
return readDb;
}
// In your API routes:
export async function updateUserProfile(userId: string, data: ProfileInput) {
// Write to primary
const updated = await writeDb.user.update({ where: { id: userId }, data });
// Signal: next reads for this user should hit primary
markReadAfterWrite('user');
return updated;
}
// Middleware: propagate read-after-write window across async calls
// Use a Redis key with TTL as a distributed alternative:
async function isInReadAfterWriteWindow(userId: string): Promise<boolean> {
const key = `raw:user:${userId}`;
return (await redis.exists(key)) === 1;
}
async function setReadAfterWriteWindow(userId: string, ttlMs = 5000): Promise<void> {
await redis.set(`raw:user:${userId}`, '1', 'PX', ttlMs);
}
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
4. Health Check Endpoint
// src/app/api/health/route.ts
import { NextResponse } from 'next/server';
import { writeDb, readDb } from '../../../lib/db/multi-region';
export const dynamic = 'force-dynamic';
export const runtime = 'nodejs';
export async function GET() {
const checks = await Promise.allSettled([
writeDb.$queryRaw`SELECT 1`,
readDb.$queryRaw`SELECT 1`,
checkRedis(),
]);
const [primary, replica, cache] = checks;
const healthy =
primary.status === 'fulfilled' &&
cache.status === 'fulfilled';
// Note: replica failure alone doesn't make us unhealthy
// Route 53 will still route writes to primary
const details = {
region: process.env.AWS_REGION ?? 'unknown',
primary: primary.status === 'fulfilled' ? 'ok' : 'degraded',
replica: replica.status === 'fulfilled' ? 'ok' : 'degraded',
cache: cache.status === 'fulfilled' ? 'ok' : 'degraded',
timestamp: new Date().toISOString(),
};
return NextResponse.json(details, {
status: healthy ? 200 : 503,
});
}
async function checkRedis(): Promise<void> {
const { redis } = await import('../../../lib/redis');
await redis.ping();
}
5. Failover Runbook (Automated)
# infrastructure/aurora-global/failover.tf
# CloudWatch alarm + Lambda to trigger Aurora failover automatically
resource "aws_cloudwatch_metric_alarm" "primary_db_down" {
alarm_name = "${var.project}-primary-db-unavailable"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "DatabaseConnections"
namespace = "AWS/RDS"
period = 60
statistic = "Sum"
threshold = 0
treat_missing_data = "breaching" # Missing data = alarm
dimensions = {
DBClusterIdentifier = aws_rds_cluster.primary.cluster_identifier
}
alarm_actions = [aws_sns_topic.ops_alerts.arn]
# For auto-failover: trigger Lambda via SNS that calls
# aws rds failover-global-cluster --global-cluster-identifier ...
}
Cost Reference
| Architecture | Monthly cost (medium SaaS) | Notes |
|---|---|---|
| Single region (us-east-1) | $800–2,000 | Baseline |
| Active-passive (2 regions) | $1,400–3,500 | ~1.7x single region |
| Active-active (2 regions) | $2,000–5,000 | ~2.5x single region |
| Aurora Global DB (3 regions) | $3,000–8,000 | Includes storage replication |
| Route 53 latency routing | +$2–15/mo | Negligible |
See Also
- AWS ECS Fargate in Production: Task Definitions and Blue/Green Deploys
- AWS CloudFront Edge: Caching Strategy and Lambda@Edge
- Kubernetes Cost Optimization: Right-Sizing, Spot Nodes, and Autoscaling
- Terraform Modules: Reusable Infrastructure and Remote State
- SaaS Audit Logging: Immutable Trails and SOC2 Compliance
Working With Viprasol
Approaching the scale where a single-region outage would cost you customers, or facing GDPR data residency requirements? We design and implement multi-region AWS architectures with Aurora Global Database, Route 53 latency routing, read-after-write consistency handling, and automated failover — with full Terraform IaC and documented runbooks.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.