Zero-Downtime Database Migrations: Expand-Contract, Shadow Tables, and Safe Deploys
Run database migrations without downtime using expand-contract patterns, shadow tables, and safe deploy sequences. Covers Flyway, Liquibase, Prisma, and PostgreSQL-specific techniques.
The most dangerous moment in a software deploy is when the new application code and old database schema are simultaneously in use — or vice versa. A naive migration strategy locks tables, blocks reads, and turns a 5-minute deploy into a 45-minute outage window.
Zero-downtime migrations aren't magic. They're a set of sequenced steps that ensure the database and application code are always compatible, regardless of what's currently deployed.
Why Migrations Break Production
Three failure modes cause most migration-related outages:
1. Table locks on ALTER TABLE
Adding a NOT NULL column in PostgreSQL acquires an ACCESS EXCLUSIVE lock — the strongest lock, blocking all reads and writes for the entire table rewrite duration. On a 100GB table, that's 10–30 minutes of downtime.
2. Deploying code before the schema is ready
New code references users.phone_number before the column exists. Every request fails with a SQL error until migration completes.
3. Deploying schema changes that break old code
You drop the users.username column, but old app instances still running (blue/green deploy mid-cutover) immediately start throwing errors.
The Expand-Contract Pattern
The solution is to decouple schema changes from application changes across three phases:
Phase 1: EXPAND → Add new structure, keep old structure
Phase 2: MIGRATE → Backfill data, deploy new application code
Phase 3: CONTRACT → Remove old structure
Example: Renaming a Column
Scenario: rename users.fullname → users.full_name
-- PHASE 1: EXPAND (deploy with old app running)
-- Add new column, keep old one
ALTER TABLE users ADD COLUMN full_name TEXT;
-- Copy existing data
UPDATE users SET full_name = fullname WHERE full_name IS NULL;
-- Trigger to keep both columns in sync during transition
CREATE OR REPLACE FUNCTION sync_fullname()
RETURNS TRIGGER AS $$
BEGIN
IF NEW.fullname IS DISTINCT FROM OLD.fullname THEN
NEW.full_name := NEW.fullname;
ELSIF NEW.full_name IS DISTINCT FROM OLD.full_name THEN
NEW.fullname := NEW.full_name;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER sync_user_name
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_fullname();
// PHASE 2: MIGRATE — deploy new app code that reads/writes full_name
// But still writes to both columns via the trigger
// Old app instances still work because fullname still exists
// New application code:
export async function getUser(id: string): Promise<User> {
const { rows } = await db.query(
"SELECT id, email, full_name, created_at FROM users WHERE id = $1",
[id]
);
return rows[0];
}
-- PHASE 3: CONTRACT (after all old app instances are drained)
-- Drop the trigger and old column
DROP TRIGGER sync_user_name ON users;
DROP FUNCTION sync_fullname();
ALTER TABLE users DROP COLUMN fullname;
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Adding NOT NULL Columns Safely
The naive approach locks the table:
-- ❌ DANGEROUS: full table rewrite with ACCESS EXCLUSIVE lock
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW();
The safe approach:
-- STEP 1: Add nullable column (no lock needed for nullable adds in PG 11+)
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ;
-- STEP 2: Backfill in batches (no lock, runs online)
DO $$
DECLARE
batch_size INT := 5000;
last_id BIGINT := 0;
max_id BIGINT;
BEGIN
SELECT MAX(id) INTO max_id FROM orders;
WHILE last_id <= max_id LOOP
UPDATE orders
SET processed_at = created_at -- or appropriate default
WHERE id > last_id
AND id <= last_id + batch_size
AND processed_at IS NULL;
last_id := last_id + batch_size;
PERFORM pg_sleep(0.1); -- Throttle to avoid replication lag
END LOOP;
END $$;
-- STEP 3: Add NOT NULL constraint using NOT VALID (validates without full scan)
ALTER TABLE orders
ADD CONSTRAINT orders_processed_at_not_null
CHECK (processed_at IS NOT NULL) NOT VALID;
-- STEP 4: Validate the constraint (runs as ShareUpdateExclusiveLock — allows reads/writes)
ALTER TABLE orders VALIDATE CONSTRAINT orders_processed_at_not_null;
-- STEP 5 (optional): Convert to actual NOT NULL after validation
-- This is now instant because the constraint is already validated
ALTER TABLE orders ALTER COLUMN processed_at SET NOT NULL;
ALTER TABLE orders DROP CONSTRAINT orders_processed_at_not_null;
Adding Indexes Without Locking
-- ❌ Locks the table for the entire index build
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- ✅ Builds concurrently — only takes ShareUpdateExclusiveLock (reads/writes continue)
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
-- Note: CONCURRENTLY is slower and can't run inside a transaction
-- If it fails midway, clean up with:
DROP INDEX CONCURRENTLY IF EXISTS idx_orders_user_id;
-- Then retry
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Shadow Table Pattern for Structural Changes
When you need to change a table's structure dramatically (partition it, change primary key type, add a new sort column), shadow tables let you do it online:
-- Scenario: migrate users table from BIGINT id to UUID id
-- STEP 1: Create new table with target structure
CREATE TABLE users_new (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
legacy_id BIGINT UNIQUE NOT NULL, -- keep for foreign key mapping
email TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- STEP 2: Install CDC trigger on old table
CREATE OR REPLACE FUNCTION mirror_to_users_new()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' THEN
INSERT INTO users_new (legacy_id, email, name, created_at)
VALUES (NEW.id, NEW.email, NEW.name, NEW.created_at)
ON CONFLICT (legacy_id) DO NOTHING;
ELSIF TG_OP = 'UPDATE' THEN
UPDATE users_new SET email = NEW.email, name = NEW.name
WHERE legacy_id = NEW.id;
ELSIF TG_OP = 'DELETE' THEN
DELETE FROM users_new WHERE legacy_id = OLD.id;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER mirror_users
AFTER INSERT OR UPDATE OR DELETE ON users
FOR EACH ROW EXECUTE FUNCTION mirror_to_users_new();
-- STEP 3: Backfill historical data (batched)
INSERT INTO users_new (legacy_id, email, name, created_at)
SELECT id, email, name, created_at FROM users
ON CONFLICT (legacy_id) DO NOTHING;
-- STEP 4: Verify row counts match
SELECT
(SELECT COUNT(*) FROM users) AS old_count,
(SELECT COUNT(*) FROM users_new) AS new_count;
-- STEP 5: Cutover (brief lock window)
BEGIN;
LOCK TABLE users IN ACCESS EXCLUSIVE MODE;
-- Final sync of any rows that came in during backfill
INSERT INTO users_new (legacy_id, email, name, created_at)
SELECT id, email, name, created_at FROM users
ON CONFLICT (legacy_id) DO NOTHING;
-- Swap tables
ALTER TABLE users RENAME TO users_old;
ALTER TABLE users_new RENAME TO users;
-- Update sequences, views, FK references here
COMMIT;
-- STEP 6: After full validation, drop old table
-- DROP TABLE users_old; -- Wait at least 1 week to be safe
Migration Tool Comparison
| Tool | Language | Approach | Version Control | Cloud Native |
|---|---|---|---|---|
| Flyway | SQL/Java | Versioned (V1__name.sql) | Git-friendly | AWS RDS support |
| Liquibase | XML/SQL/YAML | Changeset-based | Checksum validation | Docker-native |
| Prisma | TypeScript | Auto-generated | Migration files | Full-stack JS |
| golang-migrate | SQL | Sequential versioned | Git-friendly | Minimal deps |
| Alembic | Python | Auto + manual | Git-friendly | SQLAlchemy native |
Prisma Migration Safety
// schema.prisma — Prisma generates migrations from schema diffs
model User {
id String @id @default(uuid())
email String @unique
fullName String? // Nullable first (expand phase)
legacyName String? // Old column kept during transition
createdAt DateTime @default(now())
}
# Generate migration (never use --create-only in production without review)
npx prisma migrate dev --name add_full_name_column
# Review the generated SQL before applying
cat prisma/migrations/20260904_add_full_name_column/migration.sql
# Apply in production (no prompts)
npx prisma migrate deploy
// prisma/migrations/20260904_add_full_name_column/migration.sql
-- Prisma generates this; review and modify if needed
ALTER TABLE "User" ADD COLUMN "fullName" TEXT;
-- Add your backfill manually before making it NOT NULL:
-- UPDATE "User" SET "fullName" = "legacyName" WHERE "fullName" IS NULL;
Deploy Sequence for Zero Downtime
// scripts/safe-deploy.ts
// Enforces correct migration + deploy ordering
interface DeployStep {
name: string;
phase: "pre-deploy" | "deploy" | "post-deploy";
run: () => Promise<void>;
rollback?: () => Promise<void>;
}
const deploySteps: DeployStep[] = [
{
name: "Run expand migrations",
phase: "pre-deploy",
run: async () => {
// Only additive migrations: new tables, new nullable columns, new indexes
await runMigrations({ tags: ["expand"] });
},
rollback: async () => {
await rollbackMigrations({ tags: ["expand"] });
},
},
{
name: "Deploy application",
phase: "deploy",
run: async () => {
await deployToECS({ desiredCount: 2, maxSurge: "100%", minHealthy: "50%" });
await waitForHealthCheck({ timeout: 300_000 });
},
rollback: async () => {
await rollbackECS();
},
},
{
name: "Run contract migrations",
phase: "post-deploy",
run: async () => {
// Only after all old instances are drained
await waitForOldInstanceDrain({ timeout: 60_000 });
await runMigrations({ tags: ["contract"] });
},
},
];
async function safeDeploy() {
const executed: DeployStep[] = [];
for (const step of deploySteps) {
console.log(`[${step.phase}] ${step.name}`);
try {
await step.run();
executed.push(step);
} catch (error) {
console.error(`Step failed: ${step.name}`, error);
// Rollback in reverse order
for (const completedStep of executed.reverse()) {
if (completedStep.rollback) {
console.log(`Rolling back: ${completedStep.name}`);
await completedStep.rollback().catch(console.error);
}
}
throw error;
}
}
}
Migration Tagging with Flyway
-- V1__create_users.sql (baseline)
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email TEXT UNIQUE NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- V2__add_full_name_expand.sql (tagged: expand)
-- flyway:tag=expand
ALTER TABLE users ADD COLUMN full_name TEXT;
ALTER TABLE users ADD COLUMN legacy_name TEXT;
UPDATE users SET legacy_name = email; -- preserve old data
-- V3__add_full_name_index.sql
CREATE INDEX CONCURRENTLY idx_users_full_name ON users(full_name);
-- V4__drop_legacy_name_contract.sql (tagged: contract — run after deploy)
-- flyway:tag=contract
ALTER TABLE users DROP COLUMN legacy_name;
Cost of Migration Downtime
| Table Size | Naive ALTER TABLE | pg_repack | Shadow Table | CONCURRENTLY index |
|---|---|---|---|---|
| 1 GB | 30–120s downtime | ~5 min online | ~10 min online | ~3 min online |
| 10 GB | 5–30 min downtime | ~45 min online | ~90 min online | ~25 min online |
| 100 GB | 1–6h downtime | ~8h online | ~15h online | ~4h online |
| 1 TB | Days of downtime | 3–5 days online | 5–7 days online | 2–3 days online |
pg_repack is the best tool for large table rewrites — it creates a shadow table internally, syncs via triggers, then does a brief lock swap.
# Install pg_repack (requires extension on server)
psql -c "CREATE EXTENSION pg_repack;"
# Repack a table online (no downtime)
pg_repack --host=db.prod.internal --dbname=myapp --table=orders --jobs=2
Monitoring During Migrations
-- Monitor long-running migrations in real time
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state,
wait_event_type,
wait_event
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < NOW() - INTERVAL '30 seconds'
ORDER BY duration DESC;
-- Check lock contention
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.relation = blocked_locks.relation
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
See Also
- PostgreSQL Performance Tuning for Production — query optimization
- PostgreSQL Extensions for Production Apps — advanced PG features
- Kubernetes Operators: Automating Complex Workloads — automated DB ops
- DevOps CI/CD Pipeline Setup — deployment automation
Working With Viprasol
Migration accidents are among the most costly engineering incidents — and they're almost entirely preventable with the right sequencing strategy. Our team has run zero-downtime migrations on PostgreSQL databases ranging from 10GB to 8TB, using expand-contract, shadow tables, and pg_repack.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.