AWS Step Functions: State Machines, Error Handling, Parallel Execution, and Lambda Orchestration
Build production AWS Step Functions workflows: state machine design, Lambda orchestration, error handling with retry/catch, parallel execution, Map state for batch processing, and Terraform IaC.
Complex multi-step serverless workflows — order processing, document pipelines, ML training jobs — are hard to build reliably with Lambda alone. You end up passing state through SQS queues, tracking job progress in DynamoDB, and hand-rolling retry logic. AWS Step Functions provides a managed state machine that handles sequencing, parallel execution, retry with backoff, error catching, and long-running workflows (up to a year) without you managing any of that infrastructure.
This post covers real Step Functions patterns: Express vs Standard workflows, sequential and parallel states, retry/catch configuration, Map state for batch processing, and Terraform IaC for the whole thing.
Express vs Standard Workflows
| Standard | Express | |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Execution model | Exactly-once | At-least-once |
| Pricing | $0.025/1K state transitions | $1/M executions + duration |
| Use for | Business processes, long-running | High-volume, short-lived |
| Audit history | Full (90-day) | CloudWatch only |
Standard: Order processing, user provisioning, document pipelines. Express: Real-time event processing, API orchestration, streaming data.
1. State Machine Definition (ASL)
// infrastructure/step-functions/order-processing.asl.json
{
"Comment": "Order processing workflow with payment, inventory, and fulfillment",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${ValidateOrderFunctionArn}",
"Payload.$": "$"
},
"ResultPath": "$.validation",
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException", "Lambda.AWSLambdaException"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0,
"JitterStrategy": "FULL"
}
],
"Catch": [
{
"ErrorEquals": ["ValidationError"],
"ResultPath": "$.error",
"Next": "OrderFailed"
},
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "OrderFailed"
}
],
"Next": "CheckInventoryAndCharge"
},
"CheckInventoryAndCharge": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "CheckInventory",
"States": {
"CheckInventory": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${CheckInventoryFunctionArn}",
"Payload.$": "$"
},
"ResultPath": "$.inventory",
"Retry": [
{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 1,
"MaxAttempts": 2,
"BackoffRate": 2.0
}
],
"End": true
}
}
},
{
"StartAt": "ChargePayment",
"States": {
"ChargePayment": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${ChargePaymentFunctionArn}",
"Payload.$": "$"
},
"ResultPath": "$.payment",
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["PaymentDeclinedError"],
"ResultPath": "$.error",
"Next": "HandleDeclinedPayment"
}
],
"End": true
},
"HandleDeclinedPayment": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${NotifyPaymentDeclinedFunctionArn}",
"Payload.$": "$"
},
"End": true
}
}
}
],
"ResultPath": "$.parallelResults",
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "CompensatingTransaction"
}
],
"Next": "FulfillOrder"
},
"FulfillOrder": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${FulfillOrderFunctionArn}",
"Payload.$": "$"
},
"ResultPath": "$.fulfillment",
"Next": "SendConfirmation"
},
"SendConfirmation": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "${OrderConfirmationTopicArn}",
"Message.$": "States.JsonToString($.fulfillment)"
},
"Next": "OrderComplete"
},
"OrderComplete": {
"Type": "Succeed"
},
"CompensatingTransaction": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${RollbackOrderFunctionArn}",
"Payload.$": "$"
},
"Next": "OrderFailed"
},
"OrderFailed": {
"Type": "Fail",
"Error": "OrderProcessingFailed",
"Cause": "Order could not be processed"
}
}
}
☁️ Is Your Cloud Costing Too Much?
Most teams overspend 30–40% on cloud — wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500–$3,000/month in savings
2. Lambda Handler Patterns
// src/functions/order/validate-order.ts
interface OrderInput {
orderId: string;
userId: string;
items: Array<{ productId: string; quantity: number }>;
shippingAddress: Address;
}
interface ValidationResult {
isValid: boolean;
errors: string[];
}
export const handler = async (event: OrderInput): Promise<ValidationResult> => {
const errors: string[] = [];
if (!event.orderId) errors.push('Missing orderId');
if (!event.items?.length) errors.push('Order must have at least one item');
if (!event.shippingAddress?.country) errors.push('Invalid shipping address');
if (errors.length > 0) {
// Throw named error — Step Functions Catch can match on this
const err = new Error(errors.join('; '));
err.name = 'ValidationError';
throw err;
}
return { isValid: true, errors: [] };
};
// src/functions/order/charge-payment.ts
export const handler = async (event: OrderInput & { validation: ValidationResult }) => {
const { stripe } = await import('../../lib/stripe');
const amount = await calculateOrderTotal(event.items);
try {
const paymentIntent = await stripe.paymentIntents.create({
amount,
currency: 'usd',
customer: await getStripeCustomerId(event.userId),
confirm: true,
automatic_payment_methods: { enabled: true, allow_redirects: 'never' },
metadata: { orderId: event.orderId },
idempotency_key: `charge-${event.orderId}`, // Critical for retry safety
});
return {
paymentIntentId: paymentIntent.id,
amount,
status: paymentIntent.status,
};
} catch (err: any) {
if (err.code === 'card_declined') {
const declinedError = new Error('Payment declined');
declinedError.name = 'PaymentDeclinedError';
throw declinedError;
}
throw err; // Bubble up other errors for retry
}
};
3. Map State for Batch Processing
// Process each item in a batch concurrently (up to maxConcurrency)
"ProcessLineItems": {
"Type": "Map",
"ItemsPath": "$.order.items",
"ItemSelector": {
"item.$": "$$.Map.Item.Value",
"orderId.$": "$.orderId",
"index.$": "$$.Map.Item.Index"
},
"MaxConcurrency": 10,
"Iterator": {
"StartAt": "ProcessItem",
"States": {
"ProcessItem": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${ProcessLineItemFunctionArn}",
"Payload.$": "$"
},
"Retry": [
{
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"End": true
}
}
},
"ResultPath": "$.processedItems",
"Next": "AggregateResults"
}
⚙️ DevOps Done Right — Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
4. Terraform Configuration
# infrastructure/step-functions/main.tf
# Load ASL definition from file and substitute Lambda ARNs
locals {
asl = templatefile(
"${path.module}/order-processing.asl.json",
{
ValidateOrderFunctionArn = aws_lambda_function.validate_order.arn
CheckInventoryFunctionArn = aws_lambda_function.check_inventory.arn
ChargePaymentFunctionArn = aws_lambda_function.charge_payment.arn
FulfillOrderFunctionArn = aws_lambda_function.fulfill_order.arn
NotifyPaymentDeclinedFunctionArn = aws_lambda_function.notify_declined.arn
RollbackOrderFunctionArn = aws_lambda_function.rollback_order.arn
SendConfirmationFunctionArn = aws_lambda_function.send_confirmation.arn
OrderConfirmationTopicArn = aws_sns_topic.order_confirmation.arn
ProcessLineItemFunctionArn = aws_lambda_function.process_line_item.arn
}
)
}
resource "aws_sfn_state_machine" "order_processing" {
name = "${var.project}-order-processing"
role_arn = aws_iam_role.step_functions.arn
definition = local.asl
type = "STANDARD" # Use STANDARD for order processing (exactly-once)
logging_configuration {
log_destination = "${aws_cloudwatch_log_group.sfn_logs.arn}:*"
include_execution_data = true
level = "ERROR" # ALL in dev, ERROR in prod
}
tracing_configuration {
enabled = true # X-Ray tracing
}
tags = {
Environment = var.environment
Project = var.project
}
}
# CloudWatch log group
resource "aws_cloudwatch_log_group" "sfn_logs" {
name = "/aws/states/${var.project}-order-processing"
retention_in_days = 30
}
# IAM role for Step Functions to invoke Lambda + SNS
resource "aws_iam_role" "step_functions" {
name = "${var.project}-step-functions-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "states.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "step_functions_policy" {
name = "${var.project}-sfn-policy"
role = aws_iam_role.step_functions.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["lambda:InvokeFunction"]
Resource = [
aws_lambda_function.validate_order.arn,
aws_lambda_function.check_inventory.arn,
aws_lambda_function.charge_payment.arn,
aws_lambda_function.fulfill_order.arn,
aws_lambda_function.notify_declined.arn,
aws_lambda_function.rollback_order.arn,
aws_lambda_function.process_line_item.arn,
]
},
{
Effect = "Allow"
Action = ["sns:Publish"]
Resource = [aws_sns_topic.order_confirmation.arn]
},
{
Effect = "Allow"
Action = [
"logs:CreateLogDelivery",
"logs:PutLogEvents",
"logs:GetLogDelivery",
"logs:UpdateLogDelivery",
"logs:DeleteLogDelivery",
"logs:ListLogDeliveries",
"logs:PutResourcePolicy",
"logs:DescribeResourcePolicies",
"logs:DescribeLogGroups",
]
Resource = "*"
},
{
Effect = "Allow"
Action = ["xray:PutTraceSegments", "xray:PutTelemetryRecords"]
Resource = "*"
}
]
})
}
# Start execution from Lambda trigger
output "state_machine_arn" {
value = aws_sfn_state_machine.order_processing.arn
}
5. Starting Executions from TypeScript
// src/lib/workflows/order.ts
import { SFNClient, StartExecutionCommand } from '@aws-sdk/client-sfn';
const sfn = new SFNClient({ region: process.env.AWS_REGION });
export async function startOrderProcessing(order: OrderInput): Promise<string> {
const command = new StartExecutionCommand({
stateMachineArn: process.env.ORDER_STATE_MACHINE_ARN!,
name: `order-${order.orderId}`, // Unique name = idempotent: second call returns existing execution
input: JSON.stringify(order),
});
const result = await sfn.send(command);
return result.executionArn!;
}
// Check execution status
import { DescribeExecutionCommand } from '@aws-sdk/client-sfn';
export async function getExecutionStatus(executionArn: string) {
const result = await sfn.send(
new DescribeExecutionCommand({ executionArn })
);
return {
status: result.status, // RUNNING | SUCCEEDED | FAILED | TIMED_OUT | ABORTED
startedAt: result.startDate,
completedAt: result.stopDate,
output: result.output ? JSON.parse(result.output) : null,
error: result.error,
cause: result.cause,
};
}
Cost Reference
| Workflow type | Scale | Monthly cost | Notes |
|---|---|---|---|
| Standard | 10K executions, 10 states each | ~$2.50 | $0.025/1K transitions |
| Standard | 1M executions, 10 states | ~$250 | Consider Express for high volume |
| Express | 100M executions, 5s each | ~$350 | $1/M + $0.00001667/GB-second |
| Express | 1B events/month | ~$1,200 | Compare to SQS+Lambda DIY |
See Also
- AWS Lambda Layers: Shared Dependencies and Custom Runtimes
- AWS ECS Fargate in Production: Task Definitions and Blue/Green Deploys
- Kubernetes Cost Optimization: Right-Sizing, Spot Nodes, and Autoscaling
- Event-Driven Architecture: EventBridge, SNS, and SQS Patterns
- Terraform Modules: Reusable Infrastructure and Remote State
Working With Viprasol
Building complex multi-step serverless workflows that need reliable orchestration, compensating transactions on failure, and parallel execution? We design and implement AWS Step Functions state machines for your business processes — with proper error handling, retry strategies, Terraform IaC, and CloudWatch observability.
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation • No commitment • Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions — ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.