This is a copy of an original post on the Forbes blog here.
The Strategic Brief:
We automate for three benefits: to improve responsiveness, remove drudgery, and deliver consistent results. But automation has consequences, too. As you automate you’re potentially creating technical debt. The automated procedure must be kept up to date whenever you update the systems it automates. If it impacts, say, the network and you change your networking vendor, you’ll have to update the automation and the scripts around it. That’s why it’s important to assess what you need (and don’t need) to automate.
You may wish you could create an all-encompassing automation platform. However, automating reactions to production anomalies may include some major resolution tasks, like a rebuild or recovery of a database. Based on my consulting work, I’ve developed five criteria that I use when working with clients to help them decide what to automate in their IT environments.
Five Criteria for Assessing What to Automate in AIOps
Will it take longer to implement the automation than to respond manually to events?
The straw that broke the camel’s back applies frequently to IT anomalies. A first step in an automation assessment is to identify how often the triggering event or anomaly has, or may, occur. There’s no point in automating the reaction to a one-off event. On the other hand, even though this may be the first time the anomaly has reached a crisis point, it may have occurred before.
When an issue finally comes to your attention—when something breaks—it’s often just the final straw in a series of events, like when a system overloads after coming close many times in prior weeks or months. A query language built into your performance monitor is a powerful support feature, as it allows you to quickly search for times when you came close to an anomaly in the past. Once you know what metrics lead up to the anomaly, you can query to find out how often the event occurs.
Are you automating the solution to a major issue? If the anomaly has an insignificant impact on your overall enterprise, incurring the technical debt of an automated response isn’t the answer. And if the problem is just a temporary slowdown and the response you would automate has high risk, then automation isn’t a go either.
So ask yourself: What’s the cost to the business?
Conversely, if you’re dealing with a dinosaur-extinction type of impact—one that, say, could cost the business millions of dollars in lost sales—you’ll definitely need to automate a response so that your customers never take the hit. In fact, the anomaly will be fixed before your customers are even aware of it. That’s where tracking business transactions will enable you to correlate the business impact with the organizational value.
Coverage describes the proportion of real-world process that can actually be automated. If the automated task requires a manual step in the middle, such as unplugging a cable or having to contact your cloud provider, automating other parts of the procedure may not improve reaction times at all.
But if you’re sure the automation will cover the entire solution—I’m thinking of simple things here like boosting network bandwidth—then obviously automation is both easy and the right way to go. Scoring this metric should be binary: either the process can be fully automated or can’t be automated at all.
The probability of successful automation measures the accuracy of the reactive procedure. There are two sides to this metric: the uniqueness of the trigger, and the certainty of the reaction’s outcome. The triggering anomaly must be unique enough to identify that the reactive procedure is definitely the best way to address the event. Accurate root cause analysis (RCA) is critical and one of the significant benefits of applying machine learning or AIOps. However, an accurate RCA is only part of the solution, as the automated reactive procedure must predictably generate the same results in the same way each time.
One of the benefits of automation is improved responsiveness, and there’s a correlation between the value of automation and latency—the time an automated reaction will take to complete. Low-impact reactions, such as those that boost network bandwidth or increase the server or container pool, are perfect for automatic reactions. With these reactions the anomaly is often resolved before a human can even type in the necessary commands, and you avoid operator errors that can occur in manual responses.
Reactions that may take multiple hours to complete require caution. Do you really want to automatically start a multi-hour database rebuild or recovery, knowing that it will impact the production environment while it runs? You can still automate the commands to avoid operation error, but when the latency is long, you may wish to put an authorisation step into the automated reaction.
If an anomaly is happening often, and the automated reaction will resolve the anomaly faster than you can type, automate it!
The AIOps Features That Matter Most
When I work with clients, we assign a score to each of the key metrics. With some clients I have applied weightings to each metric to help balance business value against opportunity cost and technical debt. Totaling these scores not only helps us decide if something should be automated, but also with prioritizing the creation of reactive procedures based on business needs.
For effective business applications, you’ll need an application performance management (APM) solution with these required AIOps features:
- Machine learning-driven anomaly detection and root cause analysis
- Automated responses
- Third-party integration capability
Your APM solution should also allow you to select automation procedures with built-in query languages and business transaction awareness. The ultimate goal here is to balance your efforts between automating the most valuable metrics, and freeing up your time to move from reactive to preemptive architecture and infrastructure reviews.