Zero Downtime Deployment: Blue-Green, Canary, and Feature Flags Explained

A production outage during deployment costs money, damages customer trust, and creates exactly the wrong incentive for engineering teams — if deploying is risky, teams deploy less frequently, which makes each deployment larger and riskier.

Zero downtime deployment breaks this cycle. When every deploy is safe, small, and reversible, teams ship more often, catch problems earlier, and build a culture where continuous improvement is the default.

This guide covers the four primary strategies — rolling updates, blue-green, canary, and feature flags — with production-ready implementation examples for Kubernetes, AWS, and common CI/CD pipelines.

Why Deployments Cause Downtime

Before choosing a strategy, understand the failure modes:

Cold cutover: Old version stopped, new version started; gap between them = downtime
Failed healthcheck: New version starts, fails health check, traffic stays on failing instance
Incompatible database migration: New code assumes schema that doesn't exist yet (or vice versa)
Connection draining: In-flight requests killed when pod/instance is terminated
Dependency version mismatch: New service version requires updated sidecar/library not yet deployed

Each strategy addresses some of these. None addresses all of them without the others.

Strategy 1: Rolling Update

The simplest zero-downtime approach. Replace instances one at a time, waiting for each to become healthy before proceeding.

Kubernetes Rolling Update

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2         # Max extra pods during update
      maxUnavailable: 0   # Never reduce below desired replica count
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      containers:
        - name: api
          image: your-registry/api:${VERSION}
          ports:
            - containerPort: 3000
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                # Allow in-flight requests to complete before pod terminates
                command: ["/bin/sh", "-c", "sleep 10"]
      terminationGracePeriodSeconds: 30

Readiness vs. Liveness probes — get this right:

readinessProbe: "Is this pod ready to receive traffic?" — fails this, Kubernetes removes it from the load balancer
livenessProbe: "Is this pod still alive?" — fails this, Kubernetes restarts the pod

// Health check endpoints — these must be fast and accurate
// GET /health/ready — fails if DB is unavailable, queues are backed up, etc.
app.get('/health/ready', async (req, res) => {
  try {
    await db.raw('SELECT 1');  // Quick DB connectivity check
    res.json({ status: 'ready' });
  } catch {
    res.status(503).json({ status: 'not ready', reason: 'database unavailable' });
  }
});

// GET /health/live — only fails if the process itself is broken
app.get('/health/live', (req, res) => {
  res.json({ status: 'alive' });
});

Rolling update limitations:

Both versions run simultaneously during rollout — API must be backward-compatible
Slow for large clusters (rolling through 50 pods takes time)
No traffic control (you can't route only 5% of traffic to the new version)

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

AWS, GCP, Azure certified engineers
Infrastructure as Code (Terraform, CDK)
Docker, Kubernetes, GitHub Actions CI/CD
Typical audit recovers $500–$3,000/month in savings

Get a Free Cloud Audit WhatsApp

Strategy 2: Blue-Green Deployment

Two identical environments (blue = current, green = new). Traffic switches atomically from blue to green. Rollback = switch back to blue.

Before deploy:  100% traffic → BLUE (v1.2)
During deploy:  green (v1.3) receives 0% traffic, being prepared
After deploy:   100% traffic → GREEN (v1.3)
Rollback:       100% traffic → BLUE (v1.2) — instant

AWS ALB Blue-Green

#!/bin/bash
# blue-green-deploy.sh

CURRENT_TG=$(aws elbv2 describe-rules \
  --listener-arn "$LISTENER_ARN" \
  --query 'Rules[?Priority==`100`].Actions[0].TargetGroupArn' \
  --output text)

# Determine which is blue and which is green
if [[ "$CURRENT_TG" == "$BLUE_TG_ARN" ]]; then
  NEW_TG="$GREEN_TG_ARN"
  LABEL="green"
else
  NEW_TG="$BLUE_TG_ARN"
  LABEL="blue"
fi

echo "Deploying to $LABEL environment ($NEW_TG)"

# Deploy new version to the inactive target group
aws ecs update-service \
  --cluster "$CLUSTER" \
  --service "api-${LABEL}" \
  --task-definition "api:${NEW_VERSION}" \
  --force-new-deployment

# Wait for service to stabilize
aws ecs wait services-stable \
  --cluster "$CLUSTER" \
  --services "api-${LABEL}"

# Run smoke tests against the inactive environment
./scripts/smoke-test.sh "https://${LABEL}.internal.example.com"

if [[ $? -ne 0 ]]; then
  echo "Smoke tests failed — aborting, traffic stays on current environment"
  exit 1
fi

# Switch traffic to new environment
aws elbv2 modify-rule \
  --rule-arn "$RULE_ARN" \
  --actions "Type=forward,TargetGroupArn=${NEW_TG}"

echo "Traffic switched to $LABEL environment"
echo "Previous environment ($CURRENT_TG) standing by for rollback"

Blue-green advantages:

Instant rollback (switch DNS/load balancer back)
New version is fully tested before it receives any production traffic
Clean separation — no version mixing during transition

Blue-green limitations:

Requires 2x infrastructure cost during deployment
Database migrations must be forward/backward compatible
Cold-start if new version hasn't warmed up before traffic switch

Strategy 3: Canary Release

Route a small percentage of production traffic to the new version, monitor metrics, then gradually increase.

Phase 1: 1% → new version, 99% → old (5 minutes, watch error rates)
Phase 2: 10% → new version, 90% → old (15 minutes)
Phase 3: 50% → new version, 50% → old (30 minutes)
Phase 4: 100% → new version (cleanup old version)

Kubernetes with Argo Rollouts

# rollout.yaml — requires Argo Rollouts controller
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api-service
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate-check
        - setWeight: 30
        - pause: { duration: 10m }
        - setWeight: 60
        - pause: { duration: 10m }
        - setWeight: 100
      # Automatic rollback on analysis failure
      autoPromotionEnabled: false

---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-check
spec:
  metrics:
    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.01   # < 1% error rate
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status=~"5.."}[2m])) /
            sum(rate(http_requests_total[2m]))

Nginx Canary with Kubernetes Ingress

# Stable ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-stable
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            backend:
              service:
                name: api-stable
                port: { number: 80 }

---
# Canary ingress — nginx.ingress.kubernetes.io/canary annotations control weight
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"   # 10% to canary
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            backend:
              service:
                name: api-canary
                port: { number: 80 }

Increase canary-weight from 10 → 30 → 60 → 100 as you validate the canary. At 100%, delete the canary ingress.

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

Staging + production environments with feature flags
Automated security scanning in the pipeline
Uptime monitoring + alerting + runbook automation
On-call support handover docs included

Modernize My DevOps WhatsApp

Strategy 4: Feature Flags

Decouple deployment from release. Code ships to all users, but features are gated by flag state. This is the most powerful and flexible approach — and the most commonly underused.

// Feature flag evaluation — supports percentage rollouts and user targeting
interface FlagConfig {
  enabled: boolean;
  rolloutPercentage: number;  // 0–100
  allowList?: string[];       // Always-on user IDs
  denyList?: string[];        // Always-off user IDs
}

class FeatureFlags {
  async isEnabled(flagName: string, userId: string): Promise<boolean> {
    const config = await this.getConfig(flagName);
    
    if (!config.enabled) return false;
    if (config.denyList?.includes(userId)) return false;
    if (config.allowList?.includes(userId)) return true;
    
    // Stable hash — same user always gets same bucket
    const bucket = this.stableHash(userId + flagName) % 100;
    return bucket < config.rolloutPercentage;
  }

  private stableHash(input: string): number {
    let hash = 0;
    for (let i = 0; i < input.length; i++) {
      const char = input.charCodeAt(i);
      hash = (hash << 5) - hash + char;
      hash = hash & hash; // Convert to 32-bit int
    }
    return Math.abs(hash);
  }
}

// Usage in application code
const flags = new FeatureFlags();

async function handleCheckout(userId: string, cart: Cart) {
  const useNewCheckout = await flags.isEnabled('new-checkout-flow', userId);
  
  if (useNewCheckout) {
    return newCheckoutService.process(cart);
  }
  return legacyCheckoutService.process(cart);
}

Feature flag services (2026):

Tool	Self-Hosted	SaaS	Best For
LaunchDarkly	❌	✅	Enterprise, complex targeting
Unleash	✅	✅	Open-source, full control
Flagsmith	✅	✅	Mid-market, easy setup
PostHog	✅	✅	Combined with product analytics
Custom	✅	❌	Simple use cases, full ownership

For simple use cases, a database-backed feature flag system is 2–3 days of engineering work and eliminates the $20k+/year LaunchDarkly bill.

Database Migration Safety

The most common cause of deployment-related outages is database schema changes that break compatibility. Use expand/contract:

-- ❌ WRONG: Adding NOT NULL column in one step breaks existing app instances
ALTER TABLE orders ADD COLUMN shipping_method TEXT NOT NULL DEFAULT 'standard';

-- ✅ RIGHT: Expand/contract across 3 deployments

-- Deployment 1: Add nullable column (no existing code breaks)
ALTER TABLE orders ADD COLUMN shipping_method TEXT;

-- Between Deployment 1 and 2: Application writes to new column, reads from both
-- Background job backfills existing rows
UPDATE orders SET shipping_method = 'standard' WHERE shipping_method IS NULL;

-- Deployment 2: Add NOT NULL constraint after backfill completes
ALTER TABLE orders ALTER COLUMN shipping_method SET NOT NULL;
ALTER TABLE orders ALTER COLUMN shipping_method SET DEFAULT 'standard';

-- Deployment 3: Clean up any compatibility shims in application code

Deployment Strategy Selection Guide

Situation	Recommended Strategy
Small team, simple app, <100 users	Rolling update
Need instant rollback capability	Blue-green
Releasing risky changes to large user base	Canary
Separating code deploy from feature release	Feature flags
Database schema changes	Expand/contract + rolling
High-traffic e-commerce, payment flows	Canary + feature flags
Compliance-sensitive features (healthcare, fintech)	Feature flags with audit trail

Most mature engineering organizations use all four in combination:

Rolling updates for routine service deployments
Blue-green for major infrastructure changes
Canary for risky application changes
Feature flags for product releases

Implementation Costs

Scope	Investment
Rolling update setup (Kubernetes)	$3,000–$8,000
Blue-green pipeline implementation	$8,000–$20,000
Canary with automated analysis	$15,000–$35,000
Feature flag system (custom)	$5,000–$15,000
Full deployment platform (all strategies)	$30,000–$70,000

Most teams underinvest here relative to the value. A single prevented outage typically pays for the entire deployment infrastructure investment.

Working With Viprasol

We design and implement deployment pipelines that eliminate deployment-related downtime — from simple Kubernetes rolling updates through full canary release systems with automated analysis and rollback.

→ Discuss your deployment infrastructure →
→ Cloud Solutions →
→ DevOps as a Service →

Zero Downtime Deployment: Blue-Green, Canary, and Feature Flags Explained

Zero Downtime Deployment: Blue-Green, Canary, and Feature Flags Explained

Why Deployments Cause Downtime

Strategy 1: Rolling Update

Kubernetes Rolling Update

☁️ Is Your Cloud Costing Too Much?

Strategy 2: Blue-Green Deployment

AWS ALB Blue-Green

Strategy 3: Canary Release

Kubernetes with Argo Rollouts

Nginx Canary with Kubernetes Ingress

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Strategy 4: Feature Flags

Database Migration Safety

Deployment Strategy Selection Guide

Implementation Costs

Working With Viprasol

See Also

Sources

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

Kubernetes Helm Charts: Authoring, Values Schema, Hooks, Tests, and OCI Registry

Kubernetes Ingress NGINX: TLS Termination, Rate Limiting, Canary Deployments, and Annotations

Kubernetes Networking: CNI Plugins, NetworkPolicy, Service Mesh vs Native, and Ingress