Automation Resilience: The Hidden Lesson of the CrowdStrike Debacle

The recent CrowdStrike debacle was a wake-up call of epic proportions. A simple null pointer error in a routine software update brought airlines, media companies, first responder networks, and many other enterprises to their knees.

There is plenty of blame to spread around. The hapless developer who coded the bug, of course. But also the quality assurance team at CrowdStrike, CrowdStrike itself and its CEO, and Microsoft, whose systems were only too happy to roll over and blue screen.

Some responsibility lies with the victims as well. What happened to all their disaster recovery infrastructure? Why were they so vulnerable to such a small bug?

More to the point, what should they all do now to prevent something similar from happening again in the future?

The blame game doesn’t stop there. One link in this chain of infamy hasn’t received the attention it deserves – but this link took what should have been a small hiccup and turned it into a global meltdown.

The as-yet uncovered culprit in this massive hunt for the responsible parties for this debacle? Automation.

Why Automation is the Problem

Sure, releasing the defective update into production was inexcusable. But what took a small bug and made it into a global clusterf*ck was the fact that CrowdStrike had automated the deployment of the update to millions of Windows boxes at thousands of its clients.

Automated updates are nothing new, of course. Antivirus software has included such automation since the early days of the Web, and our computers are all safer for it. Today, such updates are commonplace – on computers, handheld devices, and in the cloud.

Such automations, however, aren’t intelligent. They generally perform basic checks to ensure that they apply the update correctly. But they don’t check to see if the update performs properly after deployment, and they certainly have no way of rolling back a problematic update.

If the CrowdStrike automated update process had checked to see if the update worked properly and rolled it back once it had discovered the problem, then we wouldn’t be where we are today.

In other words, we need automations to self-evaluate and recover automatically from any issues – what I call automation resilience.

Is AI the Solution?

There are two parts to the automation resilience problem: ensuring the automation itself works properly and guaranteeing the resulting behavior is bug-free.

In CrowdStrike’s case, the automation itself went off without a hitch. It applied the update as planned. CrowdStrike can certainly say that their update automation routine is bug-free.

The second part of the automation resilience equation is the harder problem – especially considering that once the routine applied the update, the target computer immediately went into a blue screen of death loop.

If there had been a routine check to see if the update worked properly, that check itself might not have been able to run on any of the affected systems.

In other words, automation resilience is a harder problem than people might think. In this case, the automation would need to identify the problem (even though in this case, it wasn’t a cybersecurity breach) and then a program external to the affected system would have to roll back the update.

Furthermore, the automation must be able to take these steps for any conceivable problem that might occur – even problems that no one has thought of.

Intelligent Agents to the Rescue?

CrowdStrike’s update automation routine, unfortunately, is dumb – as are most other update routines across the Internet. What we need is an update routine that is smart – and thus able to recover automatically from an issue as potentially serious as this one was.

The good news: there is a technology that has been getting a lot of press recently that just might fit the bill: intelligent agents.

Intelligent agents are AI-driven programs that work and learn autonomously, doing their good deeds independently of other software in their environment.

As with other AI applications, intelligent agents learn as they go. Humans establish success and fail conditions for the agents and then feed back their results into their models so that they learn how to achieve the successes and avoid the failures.

Eventually they get quite good at what they’re supposed to do – although AI never achieves 100% success rates. There is always the (admittedly diminishing) chance that any AI app will fail. No matter how good it is at recognizing cats, sometimes it thinks that dog is a kitty. Just so with intelligent agents.

The Intelligent Agent Catch-22

The reason why automation isn’t the first target of everyone’s wrath in the CrowdStrike case is because we expect such automations to be dumb – so we’re not particularly surprised when such an automation makes a mistake like this one.

Once we implement smart automations – that is, leveraging intelligent agents – then we’re going to expect them to be smart all the time.

Only they won’t be. Even the smartest agent will fail on occasion, leaving us back where we started.

This time, the global software failure might be much worse than the CrowdStrike fiasco, since (a) agents will be able to fix the simpler ones and (b) we’ll be even less prepared for a failure of the automation than we were in the CrowdStrike case.

Operational Resilience in the Age of AI

Intelligent agents that provide automation resilience, therefore, are potentially even more dangerous than the simple automated update that CrowdStrike implemented.

The solution might actually be worse than the problem we designed it to solve.

The answer to this more subtle conundrum is to think about automation resilience in the broader context of operational resilience and then to think about such resilience from an architectural perspective.

Operational resilience is the ability for any aspect of the IT infrastructure to recover quickly from any sort of problem, including automations.

In other words, we should place everything we do into the context of operational resilience – including, and especially, AI itself.

Look at everything we’re doing and ask: ‘What do we do if something goes wrong?’ and ‘How can we plan ahead today so that we can automatically recover from any issue that comes our way?’

As we ramp up our deployment of intelligent agents, whether they be for automation resilience or other purposes, we must ask ourselves those two questions – and we must have answers to them.

Operational resilience, after all, isn’t optional. It’s mandatory – so mandatory, in fact, that it can be a matter of regulatory compliance (for example, the Digital Operational Resilience Act, or DORA, a framework for managing risk for European financial institutions, requires operational resilience.)

Don’t learn the wrong lessons from the CrowdStrike debacle. Yes, we need smarter automation. But we must also ensure that no matter how resilient our automations become, we place that resilience into the broader context of operational resilience.

The Intellyx Take

Planning for failure and recovery – the essence of resilience – should apply to everything and anything a business does.

Applying these general principles of resilience to AI is particularly important, because AI’s failure scenarios can be particularly devastating.

As a society, we are now struggling with the very real possibility that AI will somehow get away from us, leading to catastrophes of unknown proportions.

We can’t simply focus on building safer AI, just as CrowdStrike focusing on building better software updates didn’t prevent the recent debacle from taking place.

Instead, we must focus on comprehensive operational resilience. We must be proactive with our plan to deal with rogue AI as a failure scenario, along with automated plans for recovery that are themselves resilient.

If the CrowdStrike fiasco can shed light on the importance of automation resilience, then perhaps it can help us build the operational resilience we need to rein in AI in the future – a success condition we can all get behind.

Copyright © Intellyx BV. Intellyx is an industry analysis and advisory firm focused on enterprise digital transformation. Covering every angle of enterprise IT from mainframes to artificial intelligence, our broad focus across technologies allows business executives and IT professionals to connect the dots among disruptive trends. As of the time of writing, Microsoft is a former Intellyx customer. None of the other organizations mentioned in this article is an Intellyx customer. No AI was used to write this article. Image credit: Craiyon.