Zero Downtime Deployment: Blue-Green, Canary, and Feature Flags Explained
Zero downtime deployment strategies in 2026 — blue-green deployments, canary releases, feature flags, and rolling updates with real Kubernetes, AWS, and CI/CD i
Zero Downtime Deployment: Blue-Green, Canary, and Feature Flags Explained
A production outage during deployment costs money, damages customer trust, and creates exactly the wrong incentive for engineering teams — if deploying is risky, teams deploy less frequently, which makes each deployment larger and riskier.
Zero downtime deployment breaks this cycle. When every deploy is safe, small, and reversible, teams ship more often, catch problems earlier, and build a culture where continuous improvement is the default.
This guide covers the four primary strategies — rolling updates, blue-green, canary, and feature flags — with production-ready implementation examples for Kubernetes, AWS, and common CI/CD pipelines.
Why Deployments Cause Downtime
Before choosing a strategy, understand the failure modes:
- Cold cutover: Old version stopped, new version started; gap between them = downtime
- Failed healthcheck: New version starts, fails health check, traffic stays on failing instance
- Incompatible database migration: New code assumes schema that doesn't exist yet (or vice versa)
- Connection draining: In-flight requests killed when pod/instance is terminated
- Dependency version mismatch: New service version requires updated sidecar/library not yet deployed
Each strategy addresses some of these. None addresses all of them without the others.
Strategy 1: Rolling Update
The simplest zero-downtime approach. Replace instances one at a time, waiting for each to become healthy before proceeding.
Kubernetes Rolling Update
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # Max extra pods during update
maxUnavailable: 0 # Never reduce below desired replica count
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
spec:
containers:
- name: api
image: your-registry/api:${VERSION}
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
lifecycle:
preStop:
exec:
# Allow in-flight requests to complete before pod terminates
command: ["/bin/sh", "-c", "sleep 10"]
terminationGracePeriodSeconds: 30
Readiness vs. Liveness probes — get this right:
readinessProbe: "Is this pod ready to receive traffic?" — fails this, Kubernetes removes it from the load balancerlivenessProbe: "Is this pod still alive?" — fails this, Kubernetes restarts the pod
// Health check endpoints — these must be fast and accurate
// GET /health/ready — fails if DB is unavailable, queues are backed up, etc.
app.get('/health/ready', async (req, res) => {
try {
await db.raw('SELECT 1'); // Quick DB connectivity check
res.json({ status: 'ready' });
} catch {
res.status(503).json({ status: 'not ready', reason: 'database unavailable' });
}
});
// GET /health/live — only fails if the process itself is broken
app.get('/health/live', (req, res) => {
res.json({ status: 'alive' });
});
Rolling update limitations:
- Both versions run simultaneously during rollout — API must be backward-compatible
- Slow for large clusters (rolling through 50 pods takes time)
- No traffic control (you can't route only 5% of traffic to the new version)
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
Strategy 2: Blue-Green Deployment
Two identical environments (blue = current, green = new). Traffic switches atomically from blue to green. Rollback = switch back to blue.
Before deploy: 100% traffic → BLUE (v1.2)
During deploy: green (v1.3) receives 0% traffic, being prepared
After deploy: 100% traffic → GREEN (v1.3)
Rollback: 100% traffic → BLUE (v1.2) — instant
AWS ALB Blue-Green
#!/bin/bash
# blue-green-deploy.sh
CURRENT_TG=$(aws elbv2 describe-rules \
--listener-arn "$LISTENER_ARN" \
--query 'Rules[?Priority==`100`].Actions[0].TargetGroupArn' \
--output text)
# Determine which is blue and which is green
if [[ "$CURRENT_TG" == "$BLUE_TG_ARN" ]]; then
NEW_TG="$GREEN_TG_ARN"
LABEL="green"
else
NEW_TG="$BLUE_TG_ARN"
LABEL="blue"
fi
echo "Deploying to $LABEL environment ($NEW_TG)"
# Deploy new version to the inactive target group
aws ecs update-service \
--cluster "$CLUSTER" \
--service "api-${LABEL}" \
--task-definition "api:${NEW_VERSION}" \
--force-new-deployment
# Wait for service to stabilize
aws ecs wait services-stable \
--cluster "$CLUSTER" \
--services "api-${LABEL}"
# Run smoke tests against the inactive environment
./scripts/smoke-test.sh "https://${LABEL}.internal.example.com"
if [[ $? -ne 0 ]]; then
echo "Smoke tests failed — aborting, traffic stays on current environment"
exit 1
fi
# Switch traffic to new environment
aws elbv2 modify-rule \
--rule-arn "$RULE_ARN" \
--actions "Type=forward,TargetGroupArn=${NEW_TG}"
echo "Traffic switched to $LABEL environment"
echo "Previous environment ($CURRENT_TG) standing by for rollback"
Blue-green advantages:
- Instant rollback (switch DNS/load balancer back)
- New version is fully tested before it receives any production traffic
- Clean separation — no version mixing during transition
Blue-green limitations:
- Requires 2x infrastructure cost during deployment
- Database migrations must be forward/backward compatible
- Cold-start if new version hasn't warmed up before traffic switch
Strategy 3: Canary Release
Route a small percentage of production traffic to the new version, monitor metrics, then gradually increase.
Phase 1: 1% → new version, 99% → old (5 minutes, watch error rates)
Phase 2: 10% → new version, 90% → old (15 minutes)
Phase 3: 50% → new version, 50% → old (30 minutes)
Phase 4: 100% → new version (cleanup old version)
Kubernetes with Argo Rollouts
# rollout.yaml — requires Argo Rollouts controller
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-service
spec:
replicas: 10
selector:
matchLabels:
app: api-service
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 5m }
- analysis:
templates:
- templateName: error-rate-check
- setWeight: 30
- pause: { duration: 10m }
- setWeight: 60
- pause: { duration: 10m }
- setWeight: 100
# Automatic rollback on analysis failure
autoPromotionEnabled: false
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
spec:
metrics:
- name: error-rate
interval: 1m
successCondition: result[0] < 0.01 # < 1% error rate
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status=~"5.."}[2m])) /
sum(rate(http_requests_total[2m]))
Nginx Canary with Kubernetes Ingress
# Stable ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-stable
spec:
rules:
- host: api.example.com
http:
paths:
- path: /
backend:
service:
name: api-stable
port: { number: 80 }
---
# Canary ingress — nginx.ingress.kubernetes.io/canary annotations control weight
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-canary
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10" # 10% to canary
spec:
rules:
- host: api.example.com
http:
paths:
- path: /
backend:
service:
name: api-canary
port: { number: 80 }
Increase canary-weight from 10 → 30 → 60 → 100 as you validate the canary. At 100%, delete the canary ingress.
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Strategy 4: Feature Flags
Decouple deployment from release. Code ships to all users, but features are gated by flag state. This is the most powerful and flexible approach — and the most commonly underused.
// Feature flag evaluation — supports percentage rollouts and user targeting
interface FlagConfig {
enabled: boolean;
rolloutPercentage: number; // 0–100
allowList?: string[]; // Always-on user IDs
denyList?: string[]; // Always-off user IDs
}
class FeatureFlags {
async isEnabled(flagName: string, userId: string): Promise<boolean> {
const config = await this.getConfig(flagName);
if (!config.enabled) return false;
if (config.denyList?.includes(userId)) return false;
if (config.allowList?.includes(userId)) return true;
// Stable hash — same user always gets same bucket
const bucket = this.stableHash(userId + flagName) % 100;
return bucket < config.rolloutPercentage;
}
private stableHash(input: string): number {
let hash = 0;
for (let i = 0; i < input.length; i++) {
const char = input.charCodeAt(i);
hash = (hash << 5) - hash + char;
hash = hash & hash; // Convert to 32-bit int
}
return Math.abs(hash);
}
}
// Usage in application code
const flags = new FeatureFlags();
async function handleCheckout(userId: string, cart: Cart) {
const useNewCheckout = await flags.isEnabled('new-checkout-flow', userId);
if (useNewCheckout) {
return newCheckoutService.process(cart);
}
return legacyCheckoutService.process(cart);
}
Feature flag services (2026):
| Tool | Self-Hosted | SaaS | Best For |
|---|---|---|---|
| LaunchDarkly | ❌ | ✅ | Enterprise, complex targeting |
| Unleash | ✅ | ✅ | Open-source, full control |
| Flagsmith | ✅ | ✅ | Mid-market, easy setup |
| PostHog | ✅ | ✅ | Combined with product analytics |
| Custom | ✅ | ❌ | Simple use cases, full ownership |
For simple use cases, a database-backed feature flag system is 2–3 days of engineering work and eliminates the $20k+/year LaunchDarkly bill.
Database Migration Safety
The most common cause of deployment-related outages is database schema changes that break compatibility. Use expand/contract:
-- ❌ WRONG: Adding NOT NULL column in one step breaks existing app instances
ALTER TABLE orders ADD COLUMN shipping_method TEXT NOT NULL DEFAULT 'standard';
-- ✅ RIGHT: Expand/contract across 3 deployments
-- Deployment 1: Add nullable column (no existing code breaks)
ALTER TABLE orders ADD COLUMN shipping_method TEXT;
-- Between Deployment 1 and 2: Application writes to new column, reads from both
-- Background job backfills existing rows
UPDATE orders SET shipping_method = 'standard' WHERE shipping_method IS NULL;
-- Deployment 2: Add NOT NULL constraint after backfill completes
ALTER TABLE orders ALTER COLUMN shipping_method SET NOT NULL;
ALTER TABLE orders ALTER COLUMN shipping_method SET DEFAULT 'standard';
-- Deployment 3: Clean up any compatibility shims in application code
Deployment Strategy Selection Guide
| Situation | Recommended Strategy |
|---|---|
| Small team, simple app, <100 users | Rolling update |
| Need instant rollback capability | Blue-green |
| Releasing risky changes to large user base | Canary |
| Separating code deploy from feature release | Feature flags |
| Database schema changes | Expand/contract + rolling |
| High-traffic e-commerce, payment flows | Canary + feature flags |
| Compliance-sensitive features (healthcare, fintech) | Feature flags with audit trail |
Most mature engineering organizations use all four in combination:
- Rolling updates for routine service deployments
- Blue-green for major infrastructure changes
- Canary for risky application changes
- Feature flags for product releases
Implementation Costs
| Scope | Investment |
|---|---|
| Rolling update setup (Kubernetes) | $3,000–$8,000 |
| Blue-green pipeline implementation | $8,000–$20,000 |
| Canary with automated analysis | $15,000–$35,000 |
| Feature flag system (custom) | $5,000–$15,000 |
| Full deployment platform (all strategies) | $30,000–$70,000 |
Most teams underinvest here relative to the value. A single prevented outage typically pays for the entire deployment infrastructure investment.
Working With Viprasol
We design and implement deployment pipelines that eliminate deployment-related downtime — from simple Kubernetes rolling updates through full canary release systems with automated analysis and rollback.
→ Discuss your deployment infrastructure →
→ Cloud Solutions →
→ DevOps as a Service →
See Also
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.