Event-driven architecture is one of the most powerful patterns in distributed systems — and one of the most commonly over-applied. Here's the honest picture.
When event-driven makes sense
Use events when you need temporal decoupling (the producer doesn't need to wait for consumers), fan-out (one event triggers multiple independent reactions), audit trail (every state change is recorded as an immutable event), or resilience (consumers can process events when they recover from downtime).
Don't use events when you need synchronous responses, strong consistency, or simple request/response workflows. Adding a message broker to a simple CRUD API adds complexity without benefit.
The event schema problem
Events are API contracts. Once published, consumers depend on their shape. Changing an event schema breaks consumers — potentially silently. Manage event schemas with a schema registry (Confluent Schema Registry, AWS Glue Schema Registry). Enforce backward compatibility. Version your event types. Treat events as public API.
Idempotency is mandatory
Message brokers guarantee at-least-once delivery. Your consumers will receive the same event multiple times — guaranteed. Design all event handlers to be idempotent: processing the same event twice produces the same result as processing it once. Use idempotency keys, database upserts, and deduplication windows.
Saga pattern for distributed transactions
Multi-service workflows that span multiple databases need the Saga pattern. A saga is a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the previous steps. Choreography-based sagas (events trigger events) are simpler but harder to debug. Orchestration-based sagas (a coordinator drives the workflow) are more explicit and testable.
Event sourcing — powerful but heavy
Event sourcing stores the full history of events as the primary source of truth, not the current state. You can replay events to rebuild state, audit everything, and create temporal queries. But it adds significant complexity: event versioning, projections, replay performance. Use it only when the audit and temporal query requirements genuinely need it.
Kafka vs RabbitMQ vs SQS
Kafka for high-throughput, persistent event streams and event sourcing. RabbitMQ for complex routing, competing consumers, and lower throughput. SQS for AWS-native simplicity with managed infrastructure. The choice matters less than most teams think — pick the one your team knows how to operate.