How to mitigate unhappy paths with an event-driven architecture at scale

The reality of supporting production event-driven architecture at any reasonable scale is that it can be challenging, especially when dealing with bad events and unhappy paths, both of which affect business operations and the customer experience. Architects and developers often focus on delivering the minimum viable product (MVP) to show business value early and validate the approach taken. While focusing on the MVP can be valuable in establishing IT agility — the requirements are targeted at normal operation, termed the happy path. However, not everything works 100% of the time. Capturing and supporting the unhappy path when something fails or an unexpected event takes place requires a considerable effort, but is required for robust enterprise-grade systems.

MuleSoft has a long tradition of working with open-source projects and I had the chance to contribute to these efforts with a recent commit to the AsyncAPI initiative. My contribution, via an extension to the AsyncAPI specification, defined how to implement event ordering and duplicate event parameters when working with AWS SQS queues. This was both an interesting problem to solve and directly related to some of the content contained in a presentation on Unhappy Paths and Bad Events I gave at the first AsyncAPI Online Conference.

This conference was live-streamed on YouTube so the full conference proceedings are still available to be viewed online. You can watch my full session here or continue reading below.

The 4 focus areas to mitigate the unexpected in production

Generally, I have found that to mitigate against unhappy paths and events in production platforms you need to focus on four main areas. 

#1 Start with the event

First start with the event, consider the event envelope shape, that is the data and metadata it contains and the focus of the event being generated. Although it’s tempting to produce large generic events, this can add more work for the consumer to understand the event’s intent. Having focused events is key. Give thought to event naming — avoiding generic names. Does the event name convey the correct meaning? With an appropriate event name, it is clear to anyone who plans to consume the event, the context, and state change that actually caused the event to be generated. Lastly, does the event envelope as designed support unhappy paths, does it contain enough information to be replayed if required or aid in fault resolution? 

#2 Maintaining event order

Many but not all, use cases require events to be processed in order. This may seem easy until you have a large number of events to process within reasonable timescales or you encounter an error. If you come across an event that is unable to process — often related to the data the event contains — how can you maintain order and still process events? The easiest way is to stop processing and resolve the error manually. This approach is not an ideal way to handle the issue, especially in a busy production environment. A more sophisticated approach needs to be applied based on the needs of the event. The account number affected by this event and then using something like a dead letter queue (DLQ) to hold the bad event and refer to when processing later events can be implemented. Using this approach, you can limit the impact of the bad event to the one affected account rather than impacting all accounts.

3. Understand the end-to-end impact

Third, understand the end-to-end environment and the potential impact you may have on downstream systems. Many large enterprises still run on traditional on-premise back-office systems, which can’t autoscale underload for a variety of reasons — both technical- and license-related. It’s easy to lose sight of this especially if you are also dealing with cloud platforms that are elastic in nature. Enabling downstream systems to indicate they are under strain via backpressure or by implementing a safety valve to temporarily halt processing when events have reached an abnormal level. These strategies support the end-to-end business flows.

4. Event monitoring

Lastly, event observability — the ability to observe and build an understanding of events and event processing as they move through an enterprise — can’t be underestimated when supporting complex enterprise environments. The asynchronous nature of events means logging, monitoring, and alerting become the first sign of a problem, giving you a chance to react before it becomes a customer-impacting issue. This takes us back to the first point, regarding event envelope shape. Does the event help support your observability efforts, can not only an individual event be identified on their end-to-end journey but also if they exist can the parent or even any spawned child events also be correlated and observed.

The AsyncAPI community continues to build momentum around the project and I encourage you to take a look at the work being done and contribute if you have an interest in this area, it’s a growing, friendly and inclusive community. 

For more information event-driven messages, check out this blog on how to design message-driven and event-driven APIs.

We'd love to hear your opinion on this post