Chaos Engineering: Fault Injection, Chaos Monkey, and Resilience Testing in Production

Chaos engineering is the practice of deliberately injecting failures into your system to discover weaknesses before they cause unplanned outages. The core premise: your system will fail. Chaos engineering lets you choose when, how, and with a team standing by — rather than discovering failure modes at 2am with customers affected.

Netflix coined the term and built Chaos Monkey. The principles apply to any system that needs to be resilient.

The Chaos Engineering Hypothesis Model

Every chaos experiment follows the same structure:

Define steady state: What does normal look like? (p99 latency < 200ms, error rate < 0.1%)
Hypothesize: "We believe the system will remain in steady state when X fails"
Introduce failure: Terminate instances, inject latency, block network traffic
Observe: Did the system stay in steady state? What broke?
Fix and repeat: Address discovered weaknesses; re-run experiment

This is not random destruction. It's scientific experimentation to build confidence in your system's resilience.

Start in Staging, Not Production

The chaos engineering narrative focuses on production experiments. For most teams starting out:

Environment	When	What You Learn
Local	Always	Basic fault tolerance
Staging	First	Architecture weaknesses
Production	After staging is solid	Real-world behavior under actual load

You gain confidence to run production experiments by first validating that your system handles failures in staging. Don't start with production chaos on Day 1.

☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

AWS, GCP, Azure certified engineers
Infrastructure as Code (Terraform, CDK)
Docker, Kubernetes, GitHub Actions CI/CD
Typical audit recovers $500–$3,000/month in savings

Get a Free Cloud Audit WhatsApp

Simple Fault Injection: No Tools Required

Before installing Chaos Monkey, implement these manually:

# Kill a random container on ECS (simulates instance failure)
TASK_ARN=$(aws ecs list-tasks \
  --cluster production \
  --service-name api \
  --query 'taskArns[0]' \
  --output text)

aws ecs stop-task \
  --cluster production \
  --task $TASK_ARN \
  --reason "Chaos engineering experiment: testing ECS auto-recovery"

# Watch: does ECS spin up a replacement? How long does it take?
# Expected: new task running within 30-60 seconds

# Check if ALB health checks removed the failing task from rotation
aws elbv2 describe-target-health \
  --target-group-arn $TARGET_GROUP_ARN

# Simulate database connection exhaustion
# Connect 100 idle connections to PostgreSQL, then observe API behavior
for i in {1..100}; do
  psql $DATABASE_URL -c "SELECT pg_sleep(300);" &
done

# Hypothesis: API should return 503 (not hang) when DB pool exhausted
# Expected: connection pool timeout returns error within 5s
# Actual: might hang for 30s without proper timeout configuration

AWS Fault Injection Simulator (FIS)

AWS FIS provides managed fault injection for AWS infrastructure — terminate EC2 instances, throttle API calls, inject network latency, kill ECS tasks.

# terraform/fis.tf — define reusable experiments

# IAM role for FIS to take actions
resource "aws_iam_role" "fis_role" {
  name = "fis-experiment-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "fis.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "fis_policy" {
  role = aws_iam_role.fis_role.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["ecs:StopTask", "ecs:ListTasks"]
        Resource = "*"
        Condition = {
          StringEquals = {
            "aws:ResourceTag/Environment" = "staging"
          }
        }
      },
      {
        Effect   = "Allow"
        Action   = ["ec2:DescribeInstances", "ec2:TerminateInstances"]
        Resource = "*"
        Condition = {
          StringEquals = {
            "aws:ResourceTag/ChaosTarget" = "true"
          }
        }
      }
    ]
  })
}

# Experiment: Terminate 50% of ECS tasks (simulates AZ failure)
resource "aws_fis_experiment_template" "ecs_task_termination" {
  description = "Terminate 50% of API service tasks to test auto-recovery"
  role_arn    = aws_iam_role.fis_role.arn

  action {
    name      = "TerminateTasks"
    action_id = "aws:ecs:stop-task"
    parameter {
      key   = "count"
      value = "50"  # percentage
    }
    target {
      key   = "Tasks"
      value = "ecsTaskTargets"
    }
  }

  target {
    name           = "ecsTaskTargets"
    resource_type  = "aws:ecs:task"
    selection_mode = "PERCENT(50)"
    resource_tag {
      key   = "aws:ecs:serviceName"
      value = "api-service"
    }
  }

  # Stop conditions — halt experiment if things get too bad
  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.error_rate_critical.arn
  }

  tags = { Environment = "staging" }
}

# run_experiment.py — start FIS experiment and monitor results
import boto3
import time

fis = boto3.client('fis', region_name='us-east-1')
cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')

def run_chaos_experiment(template_id: str, duration_seconds: int = 300):
    print(f"Starting chaos experiment: {template_id}")

    # Capture baseline metrics
    baseline = get_metrics('before')

    # Start experiment
    response = fis.start_experiment(
        experimentTemplateId=template_id,
        tags={'RunBy': 'chaos-team', 'Purpose': 'resilience-testing'},
    )
    experiment_id = response['experiment']['id']

    try:
        # Monitor during experiment
        for i in range(duration_seconds // 10):
            time.sleep(10)
            metrics = get_metrics('during')
            print(f"t+{(i+1)*10}s | Error rate: {metrics['error_rate']:.2f}% | p99: {metrics['p99_ms']:.0f}ms")

    finally:
        # Capture recovery metrics
        time.sleep(120)  # Wait for system to stabilize
        recovery = get_metrics('after')

        print("\n=== Experiment Results ===")
        print(f"Baseline error rate:  {baseline['error_rate']:.2f}%")
        print(f"Peak error rate:      {recovery['error_rate']:.2f}%")
        print(f"Recovery time:        ~2 minutes (observed)")
        print(f"Steady state resumed: {'Yes' if recovery['error_rate'] < 0.5 else 'No'}")

def get_metrics(phase: str) -> dict:
    # Query CloudWatch for current error rate and p99 latency
    # ... implementation
    pass

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

Staging + production environments with feature flags
Automated security scanning in the pipeline
Uptime monitoring + alerting + runbook automation
On-call support handover docs included

Modernize My DevOps WhatsApp

The Game Day

A game day is a structured chaos engineering exercise where the team deliberately causes failures and practices responding:

# Game Day Runbook: "What Happens When the Database Goes Down?"

## Date: 2026-05-15 | Time: 10:00 AM–12:00 PM | Team: Engineering + On-Call

Hypothesis

When the primary PostgreSQL instance becomes unavailable, the system will:

Automatically fail over to read replica within 60 seconds (RDS Multi-AZ)
Return 503 errors (not 502 or timeouts) for write operations during failover
Recover to normal operation within 90 seconds
Generate a PagerDuty alert within 2 minutes

Setup

Notify stakeholders (outage window is intentional)
Confirm staging environment health
Set up monitoring dashboard (Datadog, CloudWatch)
Assign roles: Experiment operator, Observer, Scribe

Experiment Steps

10:00 — Capture baseline metrics
10:05 — Operator reboots primary RDS instance
10:05–10:10 — Team observes behavior; Scribe records timeline
10:10 — Did failover complete? How long did it take?
10:15 — Examine error logs during failover window
10:30 — Debrief: what worked, what didn't

Success Criteria

Failover completed in < 90 seconds
API returned 503 (not 504) during failover
PagerDuty alert fired within 2 minutes
No data loss

What to Capture

Exact timestamps of failure and recovery
Error messages from API during outage
Customer-visible impact (yes/no, duration)
Any surprises (things that broke unexpectedly)


---

## Application-Level Fault Injection

Inject failures in your application code during testing:

```typescript
// lib/chaos.ts — fault injection for staging/testing
const CHAOS_ENABLED = process.env.CHAOS_ENABLED === 'true';

export function chaosMiddleware() {
  return async (request: FastifyRequest, reply: FastifyReply, done: () => void) => {
    if (!CHAOS_ENABLED) return done();

    const roll = Math.random();

    // 5% of requests: inject 2-second latency
    if (roll < 0.05) {
      await new Promise(resolve => setTimeout(resolve, 2000));
    }
    // 2% of requests: return 500 error
    else if (roll < 0.07) {
      return reply.code(500).send({ error: 'Chaos engineering: injected failure' });
    }
    // 1% of requests: return 503 (service unavailable)
    else if (roll < 0.08) {
      return reply.code(503).send({ error: 'Chaos engineering: service unavailable' });
    }

    done();
  };
}

// Database chaos: randomly fail queries
export async function chaosQuery<T>(queryFn: () => Promise<T>): Promise<T> {
  if (CHAOS_ENABLED && Math.random() < 0.02) {
    throw new Error('Chaos engineering: simulated database failure');
  }
  return queryFn();
}

What to Experiment On First

Prioritize experiments that test your most critical resilience mechanisms:

Experiment	Hypothesis	Tool
Kill one ECS task	Service continues with remaining tasks	AWS FIS
Kill all tasks in one AZ	Multi-AZ routing works correctly	AWS FIS
RDS failover	Read replica promotes in < 90s	RDS reboot with failover
Redis unavailable	App degrades gracefully (cache miss, not crash)	Stop Redis container
Slow downstream API	Circuit breaker opens; returns cached/default response	tc (traffic control)
CPU spike	Horizontal scaling triggers correctly	stress-ng

Working With Viprasol

We run chaos engineering engagements for client systems — designing experiments, building fault injection infrastructure, running game days with your team, and helping address discovered weaknesses. Resilience is an engineering investment that pays off in reduced incident frequency and faster recovery.

→ Talk to our cloud team about chaos engineering and resilience testing.

Chaos Engineering: Fault Injection, Chaos Monkey, and Resilience Testing in Production

Chaos Engineering: Fault Injection, Chaos Monkey, and Resilience Testing in Production

The Chaos Engineering Hypothesis Model

Start in Staging, Not Production

☁️ Is Your Cloud Costing Too Much?

Simple Fault Injection: No Tools Required

AWS Fault Injection Simulator (FIS)

⚙️ DevOps Done Right — Zero Downtime, Full Automation

The Game Day

Hypothesis

Setup

Experiment Steps

Success Criteria

What to Capture

What to Experiment On First

Working With Viprasol

See Also

Viprasol Tech Team

Need DevOps & Cloud Expertise?

Making sense of your data at scale?

Related Articles

AWS ECS Autoscaling: Target Tracking, Step Scaling, and Fargate Capacity Providers with Terraform

AWS IAM Least-Privilege Design: Policy Patterns, Condition Keys, and Terraform

AWS ECS Blue/Green Deployment: CodeDeploy, Traffic Shifting, and Rollback