All your systems are on “the hairy edge of failure,” as noted by Dr. Richard Cook, an expert on how systems fail. Cook helped development and operations teams understand the complex nature of interconnected systems and how to prevent failures and outages from occurring.
In his 2013 talk at Velocity, Cook went deep into the social and economic forces that ultimately drive all non-trivial systems through a cycle of failure and safety repeatedly.
At the 16-minute mark of the talk, Dr. Cook gets a rousing response from the audience when he says: “What is surprising is not that there are so many accidents. It is that there are so few. You all know this to be true. The thing that amazes you is not that your system goes down sometimes, it’s that it is up at all.”
What Dr. Cook is pointing to here is that all enterprises, whether public or private, have economic factors that drive teams to “do more with less” that at some point will ultimately cause an outage or an accident to occur.
This fact doesn’t mean that teams should give up on preventing inevitable failures. Instead, this realization is meant to push delivery teams to understand the implications of failure and invest their resources in preventing failure within the systems that have the harshest consequences in the event of a failure. The implications of deployment failures are more real than imagined.
A study by Forrester and IBM had more than a third of enterprises experiencing unplanned downtime every month. The responding enterprises had estimated the cost of planned downtime averaging at $5.6M per year and had unplanned downtime costing 35% more per minute than planned downtime.
Go-Live planning starts with understanding what’s at stake
When delivery teams are planning to make changes to production, especially in hard-cut scenarios where there is no “blue/green” or “canary” deployment tooling, the responsible teams should have a reasonable approach to understanding and mitigating risk. Not every IT or delivery team is familiar with standardized ways of capturing risk.
A basic framework for modeling risk that teams can easily understand is based on impact and likelihood:
- Impact: What data assets or production services are being changed in the proposed deployment? What could go wrong with these assets and services from both a business and technical perspective? If a service or data asset is unavailable for a period of time, what is the business impact in both the short and long-term?
- Likelihood: How likely are the above risks to manifest? What level of testing has been completed to ensure that they won’t? Is the testing environment at parity with production to ensure the fidelity of the test?
One additional aspect to make sure your full teams understand in regard to risk is “skin in the game”. It’s one thing to know what the risk is, it’s another to know who is accountable if the worst-case scenario comes to pass and it’s a completely different subject regarding who is responsible for taking care of any clean-up work after the fact.
It’s not good practice for the choosers not to have skin in the game – nor is it for those who do – to not have a voice in the process of assessing readiness.
Failure to prepare means preparing to fail
Once your team has a handle on the risk involved in their deployment, the risk model can then be used to develop mitigation strategies to avoid or minimize the future risk:
- Monitoring: Are the processes, systems, and data repositories instrumented to the point where an error in production would create an alert of some nature? How long would it take to surface? Has a smoke test been created to test the change in the production system? Are people in place to act in the case of an impactful defect?
- Rollback: Is there a formalized process for undoing the change in the case an error is discovered? Is it automated? How long will it take to process? Will rollback processes encompass data repositories and any data corruption? What will happen to transactions processed incorrectly based on the error?
One indirect method to gauge the readiness of a team is to ask them to show you a past Root Cause Analysis (RCA) document from a previous failed change. If they can’t produce one, this is somewhat of a red flag given that it is likely to mean that they don’t have a feedback loop to learn from mistakes.
The importance of calculating risk
If the team planning on the future deployment can’t articulate their risk, impact, likelihood, monitoring, or rollback strategies, then business owners need to ask the following question regarding their release readiness: If I’m the business owner for service X, and my team can’t articulate what could go wrong, how they’d detect it or fix it if it does?
A reasonably responsible team and business owner shouldn’t have a problem with the concept of if you can’t articulate your risk, then it’s likely to mean you don’t understand it.
When considering these risks, it’s important to recognize that hard gates and litmus tests for readiness don’t necessarily improve anything. It’s more important for teams to be continually investing and improving upon their readiness and governance controls than it is for having an extensive review process by a disconnected review board that meets once a month.
Peer reviews outperform board-based reviews across IT shops regardless of industry. Rather than standing up a board process, aim to create best practice guidelines you can share with teams about how to achieve speed and safety at scale rather than hard tollgates with an ivory tower approach. In communicating guidelines for teams to use in review contexts, make sure they understand that these guidelines are not meant to be used as a sword and shield. Reviews are more valuable as a pressure test than a litmus test.
Once your teams have shown a reasonable level of maturity, you can aim for “freesponsibility” where teams either start with the status of trusted + responsibility, but can lose status if they show they’re not ready for it, or teams can earn the status of trusted + reap what they sow.
Trusted teams have more freedom to manage their own risks and control their destiny while teams without trust must demonstrate responsibility for some period to earn freedom.
Investing in smooth releases doesn’t cost money – it saves it
At the root of the global DevOps revolution is one core idea: investing in the release management activities and infrastructure is not a drain on resources. Investing in the tools and processes to smooth out the deployment and release process is the only way to keep unplanned work from eating away at all of your software development capacity. When done well, Go-Live events are celebrations with smiles rather than war rooms filled with sour faces and blame-storming conversations.
The MuleSoft Customer Success team is committed to helping our customers achieve their goals without having the nail-biting nerves that come along with a hope and luck strategy. To ensure your team has the smoothest possible Go-Live event, make sure your delivery team has engaged with MuleSoft to get your next deployment on our radar so we can help pave the path to your success.