For the last few decades, application and operational teams have been preaching loudly about monitoring and alerting capabilities to improve system resiliency. It is only with COVID-19, however, that these capabilities have shifted from “luxury” items to “absolutely necessary” capabilities to ensure business continuity.
As businesses take the next steps to refine their monitoring and alerting capabilities, it is critical that all stakeholders understand and utilize the same frameworks and metrics. This ensures that any unforeseen problems are quickly identified and resolved.
I recently had a conversation with a friend who works in the retail industry regarding the sudden surge in their online business. One of his key concerns was ensuring business continuity. They had a huge incident where the customers were unable to place orders. This incident lasted for around six hours on a Saturday and resulted in poor customer experience and loss in business.
Their team is directly responsible for ensuring systems availability and the performance is measured through a set of service-level agreements on mean time to identification (MTTI) and mean time to resolution (MTTR). Specifically, the issue was around the backend connectivity for a critical API but a lack of visibility slowed down the identification and resolution time. It was only the following Monday they stabilized the system — meaning they lost a number of key transactions. This experience has forced them to revisit the design and processes to provide a stable system to cope up with the sudden peaks, gain visibility into end-to-end business transactions, and reduce the MTTI/MTTR. Knowing these metrics play a significant role in ensuring minimum downtime and system stability. The question is, how do we know and improve on these metrics?
As a practice, it is important to focus on operationalizing the solutions as part of the business continuity plan, as the cost involved during the downtime of the business and also the cost to get to incident identification and resolution can be significant.
#1 Investment in monitoring capabilities
According to Adrian Cockroft, “Your systems will only be as available as the systems that monitor them.” The following are a few ways to establish a baseline monitoring system:
- Establishing visibility:
Having visibility across all the services a business enables the organization to have better control over the systems. It helps to respond quickly by anticipating the issues and sending out the notifications before they become system failures. Business and IT should work together to analyze which services are business-critical to identify core and dependent services that would have the highest impact on business continuity.
Utilizing the visualization tools across your critical applications/services can help gather information on the current and past health of the systems. This enables you to quickly visualize where you are receiving heavy traffic, high CPU/memory usage, failing, etc. With a number of growing microservices-based architectures, the need to monitor the state of an application/system is becoming more important.
- Knowing what to monitor/measure:
Bringing together important metrics and data points onto a single screen is possible with the right could be done using visualization tools that analyze a system’s performance by plotting current and historical data collected over a period of time. This enables us to gain operational visibility into the application/infrastructure.
Deciding what metrics to look at to troubleshoot the issues accurately is important to find the root cause of the issue. A starting point could include:
- Average throughput
- CPU utilization
- Memory utilization
- Failure rate
- Avg response time
Informing the right teams to handle the issue is just as important as diagnosing/monitoring an issue. We can configure basic and advance alerts that would trigger email notifications when a data point you are measuring exceeds or drops below a specific threshold.
- Health check:
Health check monitors see if the critical services are running and if they are responding to requests. This proactively monitors all the critical endpoints. Say you have an API that is scheduled to run every day and you want to ensure it’s operational — you can create and deploy monitors to make sure the API is responding as expected. If not, the right teams would be notified ahead of time — enabling you to mitigate the risks of your API being down/unavailable during the business cycle.
#2 Maintaining a run book
It is important to be resilient to unexpected disasters — enabling your infrasture to operate amid disruption. A support runbook documents the steps and tasks that need to be performed in case of any unplanned outages. Maintaining a runbook allows you to quickly act on any unforeseen issues.
A runbook is useful only when it’s updated regularly. At the minimum we need to capture the following metrics:
- When did the incident happen?
- What is the resolution?
- Typical time to resolution
- Impact on business
- Severity level
- Responsible teams
#3 Planning for anticipated vs actual load
It is important to do a performance test before going live to see if an application/API can handle expected volumes of traffic. Keeping future requirements of the API, we need to do load testing to assess the performance of the application with the “maximum” volume of data.
For example: If you are expecting 30K transactions per min for your application, test the application with 40K transactions so the application doesn’t crash if traffic is higher than usual. As a capacity planning exercise, revisit your peak/load/stress testing numbers ahead of high-volume business cycles.
Utilize historical and real-time monitoring data to plan for the trends/patterns on peak loads, error scenarios, response times, and revisit the architecture and application design of your application. These iterations should include in these patterns reliability, enriched exception handling, reusable auditing/logging frameworks, and scalability.
For more on how to ensure uptime, continuous operation, and reliability of your mission-critical applications download our High availability (HA) clustering in Mule runtime engine whitepaper.