Back to Blog

Chaos Engineering: Fault Injection, Chaos Monkey, and Resilience Testing in Production

Implement chaos engineering to build resilient systems — fault injection principles, Chaos Monkey setup, AWS Fault Injection Simulator, game days, and measuring

Viprasol Tech Team
May 6, 2026
12 min read

Chaos Engineering: Fault Injection, Chaos Monkey, and Resilience Testing in Production

Chaos engineering is the practice of deliberately injecting failures into your system to discover weaknesses before they cause unplanned outages. The core premise: your system will fail. Chaos engineering lets you choose when, how, and with a team standing by — rather than discovering failure modes at 2am with customers affected.

Netflix coined the term and built Chaos Monkey. The principles apply to any system that needs to be resilient.


The Chaos Engineering Hypothesis Model

Every chaos experiment follows the same structure:

  1. Define steady state: What does normal look like? (p99 latency < 200ms, error rate < 0.1%)
  2. Hypothesize: "We believe the system will remain in steady state when X fails"
  3. Introduce failure: Terminate instances, inject latency, block network traffic
  4. Observe: Did the system stay in steady state? What broke?
  5. Fix and repeat: Address discovered weaknesses; re-run experiment

This is not random destruction. It's scientific experimentation to build confidence in your system's resilience.


Start in Staging, Not Production

The chaos engineering narrative focuses on production experiments. For most teams starting out:

EnvironmentWhenWhat You Learn
LocalAlwaysBasic fault tolerance
StagingFirstArchitecture weaknesses
ProductionAfter staging is solidReal-world behavior under actual load

You gain confidence to run production experiments by first validating that your system handles failures in staging. Don't start with production chaos on Day 1.


☁️ Is Your Cloud Costing Too Much?

Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.

  • AWS, GCP, Azure certified engineers
  • Infrastructure as Code (Terraform, CDK)
  • Docker, Kubernetes, GitHub Actions CI/CD
  • Typical audit recovers $500–$3,000/month in savings

Simple Fault Injection: No Tools Required

Before installing Chaos Monkey, implement these manually:

# Kill a random container on ECS (simulates instance failure)
TASK_ARN=$(aws ecs list-tasks \
  --cluster production \
  --service-name api \
  --query 'taskArns[0]' \
  --output text)

aws ecs stop-task \
  --cluster production \
  --task $TASK_ARN \
  --reason "Chaos engineering experiment: testing ECS auto-recovery"

# Watch: does ECS spin up a replacement? How long does it take?
# Expected: new task running within 30-60 seconds

# Check if ALB health checks removed the failing task from rotation
aws elbv2 describe-target-health \
  --target-group-arn $TARGET_GROUP_ARN
# Simulate database connection exhaustion
# Connect 100 idle connections to PostgreSQL, then observe API behavior
for i in {1..100}; do
  psql $DATABASE_URL -c "SELECT pg_sleep(300);" &
done

# Hypothesis: API should return 503 (not hang) when DB pool exhausted
# Expected: connection pool timeout returns error within 5s
# Actual: might hang for 30s without proper timeout configuration

AWS Fault Injection Simulator (FIS)

AWS FIS provides managed fault injection for AWS infrastructure — terminate EC2 instances, throttle API calls, inject network latency, kill ECS tasks.

# terraform/fis.tf — define reusable experiments

# IAM role for FIS to take actions
resource "aws_iam_role" "fis_role" {
  name = "fis-experiment-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "fis.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "fis_policy" {
  role = aws_iam_role.fis_role.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["ecs:StopTask", "ecs:ListTasks"]
        Resource = "*"
        Condition = {
          StringEquals = {
            "aws:ResourceTag/Environment" = "staging"
          }
        }
      },
      {
        Effect   = "Allow"
        Action   = ["ec2:DescribeInstances", "ec2:TerminateInstances"]
        Resource = "*"
        Condition = {
          StringEquals = {
            "aws:ResourceTag/ChaosTarget" = "true"
          }
        }
      }
    ]
  })
}

# Experiment: Terminate 50% of ECS tasks (simulates AZ failure)
resource "aws_fis_experiment_template" "ecs_task_termination" {
  description = "Terminate 50% of API service tasks to test auto-recovery"
  role_arn    = aws_iam_role.fis_role.arn

  action {
    name      = "TerminateTasks"
    action_id = "aws:ecs:stop-task"
    parameter {
      key   = "count"
      value = "50"  # percentage
    }
    target {
      key   = "Tasks"
      value = "ecsTaskTargets"
    }
  }

  target {
    name           = "ecsTaskTargets"
    resource_type  = "aws:ecs:task"
    selection_mode = "PERCENT(50)"
    resource_tag {
      key   = "aws:ecs:serviceName"
      value = "api-service"
    }
  }

  # Stop conditions — halt experiment if things get too bad
  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.error_rate_critical.arn
  }

  tags = { Environment = "staging" }
}
# run_experiment.py — start FIS experiment and monitor results
import boto3
import time

fis = boto3.client('fis', region_name='us-east-1')
cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')

def run_chaos_experiment(template_id: str, duration_seconds: int = 300):
    print(f"Starting chaos experiment: {template_id}")

    # Capture baseline metrics
    baseline = get_metrics('before')

    # Start experiment
    response = fis.start_experiment(
        experimentTemplateId=template_id,
        tags={'RunBy': 'chaos-team', 'Purpose': 'resilience-testing'},
    )
    experiment_id = response['experiment']['id']

    try:
        # Monitor during experiment
        for i in range(duration_seconds // 10):
            time.sleep(10)
            metrics = get_metrics('during')
            print(f"t+{(i+1)*10}s | Error rate: {metrics['error_rate']:.2f}% | p99: {metrics['p99_ms']:.0f}ms")

    finally:
        # Capture recovery metrics
        time.sleep(120)  # Wait for system to stabilize
        recovery = get_metrics('after')

        print("\n=== Experiment Results ===")
        print(f"Baseline error rate:  {baseline['error_rate']:.2f}%")
        print(f"Peak error rate:      {recovery['error_rate']:.2f}%")
        print(f"Recovery time:        ~2 minutes (observed)")
        print(f"Steady state resumed: {'Yes' if recovery['error_rate'] < 0.5 else 'No'}")

def get_metrics(phase: str) -> dict:
    # Query CloudWatch for current error rate and p99 latency
    # ... implementation
    pass

⚙️ DevOps Done Right — Zero Downtime, Full Automation

Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.

  • Staging + production environments with feature flags
  • Automated security scanning in the pipeline
  • Uptime monitoring + alerting + runbook automation
  • On-call support handover docs included

The Game Day

A game day is a structured chaos engineering exercise where the team deliberately causes failures and practices responding:

# Game Day Runbook: "What Happens When the Database Goes Down?"

## Date: 2026-05-15 | Time: 10:00 AM–12:00 PM | Team: Engineering + On-Call

Hypothesis

When the primary PostgreSQL instance becomes unavailable, the system will:

  1. Automatically fail over to read replica within 60 seconds (RDS Multi-AZ)
  2. Return 503 errors (not 502 or timeouts) for write operations during failover
  3. Recover to normal operation within 90 seconds
  4. Generate a PagerDuty alert within 2 minutes

Setup

  • Notify stakeholders (outage window is intentional)
  • Confirm staging environment health
  • Set up monitoring dashboard (Datadog, CloudWatch)
  • Assign roles: Experiment operator, Observer, Scribe

Experiment Steps

  1. 10:00 — Capture baseline metrics
  2. 10:05 — Operator reboots primary RDS instance
  3. 10:05–10:10 — Team observes behavior; Scribe records timeline
  4. 10:10 — Did failover complete? How long did it take?
  5. 10:15 — Examine error logs during failover window
  6. 10:30 — Debrief: what worked, what didn't

Success Criteria

  • Failover completed in < 90 seconds
  • API returned 503 (not 504) during failover
  • PagerDuty alert fired within 2 minutes
  • No data loss

What to Capture

  • Exact timestamps of failure and recovery
  • Error messages from API during outage
  • Customer-visible impact (yes/no, duration)
  • Any surprises (things that broke unexpectedly)

---

## Application-Level Fault Injection

Inject failures in your application code during testing:

```typescript
// lib/chaos.ts — fault injection for staging/testing
const CHAOS_ENABLED = process.env.CHAOS_ENABLED === 'true';

export function chaosMiddleware() {
  return async (request: FastifyRequest, reply: FastifyReply, done: () => void) => {
    if (!CHAOS_ENABLED) return done();

    const roll = Math.random();

    // 5% of requests: inject 2-second latency
    if (roll < 0.05) {
      await new Promise(resolve => setTimeout(resolve, 2000));
    }
    // 2% of requests: return 500 error
    else if (roll < 0.07) {
      return reply.code(500).send({ error: 'Chaos engineering: injected failure' });
    }
    // 1% of requests: return 503 (service unavailable)
    else if (roll < 0.08) {
      return reply.code(503).send({ error: 'Chaos engineering: service unavailable' });
    }

    done();
  };
}

// Database chaos: randomly fail queries
export async function chaosQuery<T>(queryFn: () => Promise<T>): Promise<T> {
  if (CHAOS_ENABLED && Math.random() < 0.02) {
    throw new Error('Chaos engineering: simulated database failure');
  }
  return queryFn();
}

What to Experiment On First

Prioritize experiments that test your most critical resilience mechanisms:

ExperimentHypothesisTool
Kill one ECS taskService continues with remaining tasksAWS FIS
Kill all tasks in one AZMulti-AZ routing works correctlyAWS FIS
RDS failoverRead replica promotes in < 90sRDS reboot with failover
Redis unavailableApp degrades gracefully (cache miss, not crash)Stop Redis container
Slow downstream APICircuit breaker opens; returns cached/default responsetc (traffic control)
CPU spikeHorizontal scaling triggers correctlystress-ng

Working With Viprasol

We run chaos engineering engagements for client systems — designing experiments, building fault injection infrastructure, running game days with your team, and helping address discovered weaknesses. Resilience is an engineering investment that pays off in reduced incident frequency and faster recovery.

Talk to our cloud team about chaos engineering and resilience testing.


See Also

Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Need DevOps & Cloud Expertise?

Scale your infrastructure with confidence. AWS, GCP, Azure certified team.

Free consultation • No commitment • Response within 24 hours

Viprasol · Big Data & Analytics

Making sense of your data at scale?

Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.