APIs in the world of IT operations

it operations

In a previous life, I worked primarily with the operational side of the IT business, which is concerned with monitoring and operational alerting. The requirements we implemented were variations on a theme that typically started with the business asking IT to provide an SLA  for “availability” of a service as well as an SLA for the responsiveness of a service. On the surface, these requirements were clean and simple, but in practical terms, things got murky very quickly. I am going to share my thoughts on this experience, and how things might have turned out differently if we adopted APIs.

Mean time to resolve

First, some context! At the time, the ultimate goal for our IT Operations team was to reduce the Mean Time To Resolve (MTTR). MTTR refers to the time between the start and resolution of an incident. The first step towards resolving an incident is identifying the root cause. The time taken to identify the cause of the incident is the Mean Time To Identify (MTTI), and this is a common KPI because it can be reduced through appropriate monitoring tooling and also directly impacts the MTTR.

Availability and performance

We started thinking of availability in binary terms, if consumers cannot access a service it is considered unavailable. This type of thinking is easy to apply when we are dealing with a single service, let’s use a database as an example. We can simply monitor the state of the database service, and as long as our service is up then it is available.

Early in the IT operations journey, we realized that a service can hang, but still be unresponsive to requests. We needed to go in a little deeper and run synthetic connectivity tests to ensure that a process was responsive to requests so that it could be considered available. Of course, we were working with a database, which meant we could execute a SQL connection to test connectivity.

However, even the above approach proved insufficient. This is because, the business considered a service as available when it returned the data they wanted; the business does not care where or how the data is stored, just that it is accessible.

IT Operations responded by taking our synthetic connection tests to the next level––stepping into the world of synthetic transactions. The team developed a SQL query that tested the database to ensure it responded with appropriate data. This invariably required input from the original developer/owner of the database. This process wasted development cycles for the team, but the real pain came any time they changed their database structure. This is because our tests failed and resulted in false alarms, forcing us to chase the team down for updated queries.

Simply having a service available in no way implies that the service responds in a timely fashion. SLAs were, and still are, drawn up between IT and the business to define response time for the service.

Jakob Nielsen’s work on acceptable response times remains as true today as ever, “1.0 second is about the limit for the user’s flow of thought to stay uninterrupted, even though the user will notice the delay.”

Taking our database example, this might mean that the SQL query returns a set of data within some number of milliseconds, typically under 500ms, a figure that can be considered performant.

Recall that our goal was to reduce the MTTI, if we had gaps in our monitoring then we ran the risk of failing to identify the cause, which was ultimately a failure for our team. The database was a piece in the larger puzzle of an application stack, which included various applications, data stores, web servers, etc. Each component of the stack had its own connectivity requirements, drivers for connectivity, authentication mechanisms, and credentials; and each required a deep level of monitoring and a lot of maintenance work.

As a result, instead of constantly dealing with disparate systems and updating monitoring tools, we generated a fairly static “traffic light” dashboard for the operations center wall that showed “green” lights 99% of the time!

What about APIs? 

Now, imagine if we had been working in an environment where APIs were commonplace. In this case, services that required monitoring would have been accessible via a set of well-defined contracts.

Our monitoring tooling would be registered as a consumer of APIs that is able to access each API that needed monitoring, as well as able to utilize a single set of credentials across the environment. The connectivity would be simple as every service talks HTTP/HTTPS, and requires no specialized drivers or connectivity parameters.

The API contract would declare the sample input payloads that could be submitted and the expected response payloads. This makes the creation of synthetic transactions and their timing simple. On an ongoing basis, teams would need to spend minimal time updating the monitoring tooling with changes in the environment; this is because the APIs insulate our team from that. In this case, we no longer care about the SQL query running in the database, the developer will take care of that.

In an API-driven environment, the APIs should all be available via a catalog––making it easy to see what services are being monitored by our team, and which ones we don’t have access to. This API-driven approach would have provided us with maximum coverage of the services to ensure we were able to reduce the MTTI as low as possible when issues did occur.

The ability to have a consistent view across the environment and up-to-date coverage of the monitored systems meant that, in an API-driven world, our dashboards serve as actual live data that we can trust. By adopting APIs, the team would have saved more time and, instead, spent their day developing new solutions that added value to the business. Overall, adopting an API-led connectivity strategy is not only paramount to business agility, but is also a key enabler for driving value from the operations team.

Learn more about API-led connectivity


 


We'd love to hear your opinion on this post