Event-Driven Microservices: Kafka Patterns, Saga Orchestration, and Idempotency
Build event-driven microservices that work in production: design Kafka topic schemas with Avro, implement the saga pattern for distributed transactions, enforce idempotency to handle duplicate events, and handle consumer group rebalancing without data loss.
Event-Driven Microservices: Architecture and Patterns (2026)
Our first microservices system was a disaster. Each service had its own database. When order service needed to update inventory, it called inventory service directly. When inventory service failed, orders broke. When we added the billing service, everything got slower.
Then we learned about events. Instead of services talking to each other, they published what happened. Order placed? Publish an event. Inventory and billing services listened for that event. Suddenly failures were isolated and systems were fast.
That was five years ago. At Viprasol, event-driven architecture is how we build systems that scale. I'm going to share what we've learned about making this work.
Why Event-Driven Architecture
Most teams start with direct service-to-service calls. This is synchronous communication. Service A calls Service B which calls Service C. When any service is slow or broken, everything breaks.
Event-driven is different. Services publish events about what happened. Other services subscribe to those events. No direct dependencies.
Synchronous problems we solve:
- Cascading failures: One slow service slows everything
- Tight coupling: Services know about each other
- Hard to scale: Difficult to add new services without changing existing ones
- Poor separation of concerns: Order service has to know about inventory
Event-driven benefits:
- Loose coupling: Services don't know about each other
- Failure isolation: One service failing doesn't break others
- Natural scaling: Easy to add new consumers for an event
- Better separation: Each service owns its data and logic
- Asynchronous processing: Don't wait for everything to complete
The tradeoff: complexity. Event-driven systems are harder to reason about, harder to debug, and require more infrastructure.
Core Concepts
Let me define the terms clearly:
Event: Something that happened. "Order placed", "Payment processed", "Inventory reserved". Events are immutable facts about the past.
Event stream/topic: Where events are published and consumed. Usually implemented as a message queue (Kafka, Pub/Sub, SQS).
Producer: Service that publishes events. Order service publishes "Order placed" events.
Consumer: Service that subscribes to events. Inventory service consumes "Order placed" events.
Event store: Optionally, persist all events for debugging, auditing, and replay.
โ๏ธ Is Your Cloud Costing Too Much?
Most teams overspend 30โ40% on cloud โ wrong instance types, no reserved pricing, bloated storage. We audit, right-size, and automate your infrastructure.
- AWS, GCP, Azure certified engineers
- Infrastructure as Code (Terraform, CDK)
- Docker, Kubernetes, GitHub Actions CI/CD
- Typical audit recovers $500โ$3,000/month in savings
Message Queue Technology
Your technology choice impacts everything. Here are the main options:
| Technology | Throughput | Latency | Durability | Complexity | Best For |
|---|---|---|---|---|---|
| Kafka | Very high | Low | Very high | High | High-scale systems, event sourcing |
| RabbitMQ | High | Low | High | Medium | Traditional messaging, reliable delivery |
| AWS SQS | Moderate | Variable | High | Low | AWS-native systems, simplicity |
| Google Pub/Sub | High | Low | High | Low | GCP systems, global scale |
| Apache Pulsar | Very high | Low | Very high | Medium | High-scale, multi-tenant |
For most systems at scale, we use Kafka. It handles millions of events/second and has excellent tooling.
For simpler setups or cloud-first companies, Pub/Sub or SQS work well.
Designing Event Schemas
An event schema defines what information an event contains. Get this right and your system works. Get it wrong and you'll be fixing compatibility issues forever.
Good event schema:
- Versioned: Include version number so you can evolve it
- Self-contained: All relevant information included (don't require looking up other data)
- Backward compatible: New consumers can work with old events
- Documented: What does each field mean?
- Identified: Unique event ID for tracking and deduplication
Example order event:
Code:
{
"eventType": "order.placed",
"eventVersion": 1,
"eventId": "evt_12345abc",
"timestamp": "2026-03-07T08:30:00Z",
"orderId": "ord_999",
"userId": "usr_42",
"items": [
{"sku": "WIDGET-001", "quantity": 2, "price": 19.99}
],
"totalAmount": 39.98,
"currency": "USD"
}
Notice: version number, unique ID, timestamp, and self-contained data.
Evolution example: Next year, you want to track whether user has loyalty status.
Old schema:
Code:
{ "eventType": "order.placed", "eventVersion": 1, ... }
New schema:
Code:
{ "eventType": "order.placed", "eventVersion": 2, "isLoyaltyMember": true, ... }
Consumers must handle both versions. Version 1 events missing isLoyaltyMember? Set default to false.
Use a schema registry (Confluent Schema Registry, Google Cloud Schema Manager) to enforce this.

โ๏ธ DevOps Done Right โ Zero Downtime, Full Automation
Ship faster without breaking things. We build CI/CD pipelines, monitoring stacks, and auto-scaling infrastructure that your team can actually maintain.
- Staging + production environments with feature flags
- Automated security scanning in the pipeline
- Uptime monitoring + alerting + runbook automation
- On-call support handover docs included
Recommended Reading
Choreography vs. Orchestration
When you have complex workflows, you need to coordinate services. There are two patterns.
Choreography: Services react to events and emit new events.
Order placed โ Inventory service reserves stock โ Payment service processes payment โ Notification service sends confirmation.
Each service knows what event triggers it. It doesn't know about other services.
Pros: Decoupled, simple to start Cons: Hard to follow the flow, hard to add error handling
Orchestration: A coordinator (often called a saga) coordinates the workflow.
Saga receives "Order placed" event. It tells Inventory service to reserve stock. Once confirmed, it tells Payment service to charge. Once confirmed, it tells Notification service to send confirmation.
Pros: Clear flow, easy error handling Cons: Coordinator is a new service that couples everything
For most systems, we use a hybrid. Simple flows use choreography. Complex flows (payments, multi-step workflows) use orchestration.
Handling Failures and Retries
In event-driven systems, failures are inevitable. Networks fail. Services crash. What do you do?
At-least-once delivery: Message will be delivered at least once, possibly multiple times. Consumers must be idempotent.
Exactly-once delivery: Message delivered exactly once. Hard to achieve, slower, usually unnecessary.
Dead letter queues: Messages that fail repeatedly go to a separate queue for manual handling.
Our standard pattern:
- Consumer receives event
- Processes it idempotently (OK to process same event twice)
- Stores result
- Acknowledges to queue
- If processing fails, don't acknowledge
- Queue re-delivers after timeout
- After N retries, send to dead letter queue
Example idempotent processing:
Code:
def process_order_placed(event):
# First, check if we already processed this
if event_processed(event.eventId):
return # Already processed, skip
# Process the event
reserve_inventory(event.orderId, event.items)
# Mark as processed
mark_event_processed(event.eventId)
You must track which events you've processed to avoid duplicate processing.
Data Consistency Across Services
The hard problem: how do you keep data consistent when services can't directly access each other's databases?
Monolithic applications have ACID transactions. Microservices don't. You have eventual consistency.
Pattern: Saga transactions
A saga is a sequence of local transactions across multiple services. If one fails, you compensate (rollback) previous steps.
Example: Order saga
- Order service: Create order (PENDING)
- Inventory service: Reserve stock
- Payment service: Charge customer
- If payment fails: Inventory service: Release stock
- Order service: Mark as FAILED
Each step is a local transaction. If any step fails, previous steps are undone.
This is harder than ACID transactions but necessary for distributed systems.
Monitoring Event-Driven Systems
Debugging event-driven systems is harder. A user places an order. Who should process it? Is it processed? Did it fail?
We monitor:
Event metrics:
- Events per second (throughput)
- Event lag (how far behind are consumers)
- Events in dead letter queue
- Error rate per event type
Service metrics:
- Consumer lag: How far behind is this service?
- Processing latency: How long does processing take?
- Error rate: What percentage fails?
End-to-end metrics:
- How long from event published to fully processed?
- Which services are slowing down the overall flow?
Tools we use: Prometheus for metrics, Datadog for dashboards, ELK stack for logs.
We also implement distributed tracing. Each event has a trace ID that follows it through the system. You can see: event published โ consumed by service A โ processed โ emitted new event โ consumed by service B.
Event Sourcing
Event sourcing is an optional pattern where you store all events and reconstruct state from them.
Instead of storing current state (order status = DELIVERED), you store events (OrderPlaced, PaymentProcessed, ItemsShipped, DeliveryConfirmed).
Current state is derived by replaying events.
Pros:
- Complete audit trail
- Can debug what happened
- Easy to add new features (replay events through new logic)
- Time travel (see state at any point in past)
Cons:
- More complex
- More storage
- Replaying events is expensive
- Need careful handling of event schema changes
We use event sourcing for financial systems where audit trail is required. For others, we usually don't.
Integration Patterns
How do you connect external systems to your event-driven architecture?
Database polling: Periodically read database for changes. When you find new data, emit events. Simple but latency-heavy.
Change data capture (CDC): Hook into database transaction log. Every write to database is captured and converted to events. Kafka Connect has connectors for most databases.
Webhooks: External system sends HTTP callbacks when events happen. You receive them and emit internal events.
APIs: Your service periodically calls external APIs to fetch data.
For most integrations, CDC works best. It captures all changes automatically.
Scaling Event-Driven Systems
At scale, you need to think about:
Partitioning: Kafka topics are divided into partitions. Each consumer reads from a partition. This allows parallelism. Order events might be partitioned by orderId so all events for an order go to same partition (maintaining order).
Consumer groups: Multiple consumers in a group, each reading different partitions. Allows horizontal scaling.
Service replication: Each service might have multiple instances. They all consume from the queue independently.
Example: 10 million orders/day
- Topic: "orders" with 100 partitions
- Inventory service: 10 instances, each processing a portion of partitions
- Payment service: 5 instances
- Notification service: 3 instances
Each service scales independently.
This is where Kafka really shines. It handles millions of events/second across hundreds of partitions and services.
Testing Event-Driven Systems
Testing is harder when services don't call each other directly.
We implement:
Unit tests: Test that service correctly processes an event type
Code:
def test_order_placed():
event = OrderPlacedEvent(orderId="ord_1", userId="usr_1")
service.process(event)
assert inventory_reserved("ord_1")
Integration tests: Test against local Kafka
Code:
def test_order_flow():
producer.send("orders", order_placed_event)
# Give service time to process
time.sleep(1)
assert payment_captured(order_placed_event.orderId)
Contract tests: Multiple services agree on event schema
Code:
def test_order_placed_schema():
event = OrderPlacedEvent(...)
assert schema_validator.validate(event)
End-to-end tests: Full flow from one service through multiple services
We run unit and integration tests in CI/CD. Contract tests prevent incompatibilities between services. End-to-end tests run in staging environment before production deployment.
Debugging Production Issues
When something goes wrong in production:
- Check consumer lag: Is any consumer falling behind?
- Check dead letter queues: Are events failing to process?
- Check service logs: What error is the service logging?
- Replay the event: Take the failing event and process it locally
- Check data state: What does the database show?
Because events are immutable, you can replay any event to understand what happened. This is powerful for debugging.
Building Blocks
When we implement event-driven systems, we use:
- Message broker (Kafka, Pub/Sub): Core event infrastructure
- Schema registry: Enforce event format
- Service framework (Spring Boot, FastAPI): Wrap service logic
- Monitoring (Prometheus, Datadog): Track what's happening
- Orchestration (Kubernetes): Deploy and scale services
For infrastructure details, see our Cloud Solutions page.
Getting Started
For a new system:
- Identify natural business events (OrderPlaced, PaymentProcessed)
- Choose message broker (Kafka for scale, SQS/Pub/Sub for simplicity)
- Define event schemas and version them
- Start with simple choreography
- Implement idempotent consumers
- Add monitoring from day one
- Gradually add more services
Don't start event-driven unless you need it. Monoliths with direct calls are simpler and work fine for small systems. Graduate to event-driven when you need:
- Significant scale
- Multiple independent teams
- Loose coupling between subsystems
- Asynchronous processing needs
Common Pitfalls
Treating events like RPCs: If you're waiting for a response to an event, you're not thinking event-driven. Events are fire-and-forget.
Over-granular events: Too many event types makes the system complex. Strike a balance.
No idempotency: When you process same event twice, bad things happen. Always make consumers idempotent.
Insufficient monitoring: Event-driven systems fail silently. Dead letter queues grow without anyone noticing. Monitor everything.
Not versioning schemas: You'll change your schema. If you don't version it, existing consumers break.
Skipping the saga pattern: For complex workflows, use sagas. Choreography gets out of hand quickly.
FAQ: Event-Driven Microservices
Q: When should we move to event-driven architecture?
A: When you have multiple independent services that need to communicate without tight coupling. For a team of 5 people building one product, stick with monolith. For a company with 20+ engineers across multiple teams, event-driven becomes valuable.
Q: Kafka vs. RabbitMQ vs. cloud providers?
A: Kafka for high-scale, complex needs, companies that want to avoid vendor lock-in. RabbitMQ for traditional message queuing with less infrastructure overhead. Cloud providers (SQS, Pub/Sub) for simplicity and AWS/GCP-native organizations. We use all three depending on context.
Q: Can we start with one service?
A: Yes. You don't need an event-driven system from day one. Start with a monolith or simple microservices. Gradually introduce events as needs change. Most systems work fine without events.
Q: How many services is too many?
A: More of a question about maintenance. 5-10 services is comfortable. 50+ becomes challenging. But with proper tooling and organization, teams handle much larger systems. See our AI Agent Systems page for handling complex system architectures.
Q: What about backwards compatibility?
A: Always version events. When you change a schema, increment the version. Consumers must handle both old and new versions. This lets you deploy changes independently.
Q: How do we handle transactions across services?
A: Sagas. Split what would be a transaction into multiple local transactions with compensating actions. It's harder than ACID but necessary in distributed systems.
Q: Is eventual consistency good enough?
A: For most systems, yes. Data is consistent within seconds. For financial systems, you might need stricter guarantees. But eventually consistent systems handle scale better than strongly consistent ones.
Wrapping Up
Event-driven microservices architecture is powerful but complex. The organizations that succeed:
- Use events for asynchronous communication between services
- Implement idempotent event processing
- Version event schemas
- Monitor constantly
- Use sagas for complex workflows
- Start simple and add complexity as needed
The architecture works best when owned by teams with infrastructure expertise. If you're adding event-driven to teams without that expertise, plan for learning curve.
We've built event-driven systems processing billions of events monthly. The patterns I've shared are what actually work at scale.
Start with clear business events. Design schemas carefully. Implement one consumer well. Scale from there.
The best event-driven systems are the ones built by teams that understand their business deeply and their technical constraints clearly. That combination is more important than any specific tool choice.
For help designing or implementing systems at scale, see our SaaS Development work.
External Resources
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 1000+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement.
Need DevOps & Cloud Expertise?
Scale your infrastructure with confidence. AWS, GCP, Azure certified team.
Free consultation โข No commitment โข Response within 24 hours
Making sense of your data at scale?
Viprasol builds end-to-end big data analytics solutions โ ETL pipelines, data warehouses on Snowflake or BigQuery, and self-service BI dashboards. One reliable source of truth for your entire organisation.