March 10, 2017
by Carlos Morgado
Event Sourcing, overcoming the monolith
We’ve all either been there or wish we’ll get there, your application grows until it becomes a huge codebase with 25 people working on it and supported by a database your devops are always complaining about. The way is clear, break up the monolith into manageable micro services with clear responsibilities and backed by sensibly-sized databases. The path, however, is murky. How to split the database joins? What to break out first? How to orchestrate? How to keep everything consistent? Event sourcing helps you through that path into micro services and enables interesting architecture patterns to solve common problems like caching and auditing.
It starts simple enough — a Rails bootstrap, a database and your special snowflake. Business grows and your app grows with it, incorporating a lot more functionality and managing a lot more data. The codebase becomes unwieldy, everything is slow and the slightest disturbance in the database wreaks havoc everywhere.
The Rails monolith
It’s clear we need to break things up into small, manageable components. We start breaking out micro services, break up the database into smaller, service-owned databases and exposing the monolith APIs through HTTP. While the architecture mirrors the organization much better, with teams owning components, the overall performance won’t get any better. While all the micro services work fairly well there’s always a large number of GETs in flight internally so the system seems slow from the outside. And maybe the worst part, due to bugs and transient errors, all databases are slightly inconsistent.
Microservices with dependencies
At this point we come to a fork in the road. Either make the whole system more transactional and consistent or learn to live with inconsistency and be highly available. The answer is usually not obvious. Your application probably does billing which screams hard consistency, but also has a web frontend which needs to be serving pages at all times. While the databases are decoupled, there’s still a hard dependency graph of calls between services. Installing bulkheads and circuit breakers prevents catastrophic cascading failure, but don’t necessarily improve end user experience because failures propagate up to the user. We need to effectively decouple and isolate strongly consistent components from each other and from the rest of the system because they necessarily will go unavailable. Conversely, components that need to be highly available can’t rely on other parts of the system for their critical functions.
Your application will, as a whole, evolve to be globally AP in the CAP sense.
It might seem like I’m delivering a message of no hope — your application will never be consistent and you’ll have to clean up the mess forever. In fact, once we accept it we can take advantage of inconsistency in our designs. Accepting inconsistency comes in two forms. The most visible is that your product is now always able to show information but it may be out of date or inconsistent. A great example of this is Netflix, where displaying default listings instead of customized listings is baked into the whole architecture as a normal behavior. The other is engineering your consistent components to be more resilient to failures around them and having built-in sanity checking to avoid scenarios like double billing because an API call timed out on the client and was reissued.
There’s an architecture pattern that can help you with all of this and even give you some extra benefits. An event sourced architecture system state is not a snapshot mutated on a database but an incremental stream of facts broadcast throughout the system. The facts are broadcast as events through a message broker, which becomes the central piece of the system. Components interested in those facts can subscribe to them and accumulate that knowledge without depending on the source to be available to answer queries about the state of a given object.
While current state will probably be persisted in memory and databases for practical reasons, it stops being important as it can always be reconstructed by replaying past events.
This is the key point of event sourcing, the source of truth of the system is the sum of all past events, not its current projection stored somewhere.
Message broker becomes the central piece of the system
The most obvious and probably lowest hanging fruit gain is cache preloading. On a traditional centralised database architecture, components like frontends cache respond as they go by so they tend to have a bimodal distribution of access times — fast for hot content and slow for cold content, because it needs to be retrieved from a backend. This also leads to coupling of backend scalability and frontend traffic patterns because back end and databases need to handle frontend traffic spikes and changes in content popularity. Another probably less visible and harder to handle problem is cache invalidation. The traditional approach is to use cache control headers and either cache too aggressively and serve stale content or too conservatively and make unnecessary requests because the cache has no visibility into the object lifecycle.
In an event sourced system you can listen to changes to objects of interest and keep your cache in sync with the rest of the system without special machinery for cold requests or invalidation through writes. This is such an universal technique that it is even used to coordinate the memory caches in the computer you are using right now.
Another issue we need to address is connecting consistent components to the rest of the system. Caching solves the problem of accessing inaccessible information, message queues solve not losing updates when a consistent component becomes unavailable. Message brokers such as RabbitMQ and Kafka can keep state and store events for later retrieval and processing by these components. How much later depends on the broker and the volume, but it’s quite common to have days worth of events in Kafka.
So, event sourcing allows you to handle not only transient component unavailability but also absorb peaks at the cost of latency.
Latency is a key point because you can never get something out of nothing. If your system is time-sensitive, processing some events should include guards for late events. This, by the way, is not a problem of event sourcing, it’s an inherent problem of your domain and usually results in consistent, not available systems. If that is your requirement, you probably require coupling that event sourcing doesn’t give you. However if that is not a requirement event sourcing helps you break that coupling and make your system more resilient.
An interesting side effect of having components on your system receiving events and processing them to update their internal state is that the accumulated stream of events is a true audit log of your system.
Much better than applicational logs, which can omit or lie, the event log together with the behavior expressed in your code, generates the state of the system.
If some component lies in the events or mishandles something, you will get to that bad state reliably every time you play back the event stream into that component. This is the basis of the source control system Git (and others), the current state of your file is the incremental sum of all the changes that happened to it.
In terms of modelling, event sourcing is quite different from traditional schema modelling where you model static views of the business objects.
Event sourcing is about the dynamics of the system, how to express what can happen and not so much how to project the features that are currently of interest.
At Talkdesk, we are now redesigning our events and it is proving a more complex effort than modelling our static entities. That stems from the fact that events cross domain boundaries and require more agreement between different domain experts which, at first light, seems like an extra burden, but is actually an investment against future mismatch. In any sufficiently complex system, domain experts will evolve different representations of common concepts leading to communication difficulties. Event sourcing enforces a base common language and enables different domains to project the facts using different schemas.
The whole system is much more resilient but still has a point of failure: the message broker. This is a usual critique of event sourcing along with message broker technology being not as mature as database technology. Message brokers don’t provide transactions and don’t have advanced query interfaces, so they are much simpler than databases and much easier to make reliable and available. Our main broker, RabbitMQ, is built on top of Erlang which is more mature than any other piece of software we use and we never experienced unavailability not traceable directly to outside factors, such as disk full. Moving to event sourcing means having message broker administrator as traditional architectures had database administrators.
Where to go from here? A natural extension of event sourcing is Command Query Response Separation, an architecture pattern which separates the “read” and “write” paths and allows for different interfaces and models on each path. If commands, query and responses are all events, you can take full advantage of the event log. Conversely you can adopt a synchronous command and asynchronous response architecture which adapts well to some scenarios.
HTTP 202 and events is a form of CQRS
There’s a number of technical decisions around event sourcing, like what broker to chose, how to model events, how to handle access control or what tooling to use around events. That’s material for future posts. If you want to learn more about event sourcing you can look at Martin Fowler’s articles on the subject which, as usual for Martin, presents a nuanced view. You can also go straight to the source and check Greg Young’s posts on the matter which ties ES, CQRS and Domain Driven Design together.
This blog post is a written and edited version of Carlos’s talk at Fullstack LX.