How to Engineer your Technical Debt Response

Applying Threat Engineering to Technical Debt

 

The Southwest Airlines fiasco from December 2022 and the FAA Notam database fiasco from January 2023 had one thing in common: their respective root causes were mired in technical debt.

At its most basic, technical debt represents some kind of technology mess that someone has to clean up. In many cases, technical debt results from poorly written code, but more often than not, it is more a result of evolving requirements that older software simply cannot keep up with.

Both the Southwest and FAA debacles centered on legacy systems that may have met their respective business needs at the time they were implemented, but over the years became increasingly fragile in the face of changing requirements. Such fragility is a surefire result of technical debt.

The coincidental occurrence of these two high-profile failures mere weeks apart lit a fire under organizations across both the public and private sectors to finally do something about their technical debt. It’s time to modernize, the pundits proclaimed, regardless of the cost.

Ironically, at the same time a different set of pundits, responding to the economic slowdown and prospects of a looming recession, recommended that enterprises delay modernization efforts in order to reduce costs short term. After all, modernization can be expensive, and rarely delivers the type of flashy, top-line benefits the public markets favor.

How, then, should executives make decisions about cleaning up the technical debt in their organizations? Just how important is such modernization in the context of all the other priorities facing the C-suite?

Understanding and Quantifying Technical Debt Risk

Some technical debt is worse than others. Just as getting a low-interest mortgage is a much better idea than loan shark money, so too with technical debt. After all, sometimes shortcuts when writing code are a good thing.

Quantifying technical debt, however, isn’t a matter of somehow measuring how messy legacy code might be. The real question is one of the risk to the organization.

Two separate examples of technical debt might be just as messy, and equally worthy of refactoring. But the first example may be working just fine, with a low chance of causing problems in the future. The other one, in contrast, could be a bomb waiting to go off.

Measuring the risks inherent in technical debt, therefore, is far more important than any measure of the debt itself – and places this discussion into the broader area of risk measurement, or more broadly, risk scoring.

Risk scoring begins with risk profiling, which determines the importance of a system to the mission of the organization. Risk scoring provides a basis for quantitative risk-based analysis that gives stakeholders a relative understanding of the risks from one system to another – or from one area of technical debt to another.

The overall risk score is the sum of all of the risk profiles across the system in question – and thus gives stakeholders a way of comparing risks in an objective, quantifiable manner.

One particularly useful (and free to use) resource for calculating risk profiles and scores is Cyber Risk Scoring (CRS) from NIST, an agency of the US Department of Commerce. CRS focuses on cybersecurity risk, but the folks at NIST have intentionally structured it to apply to other forms of risk, including technical debt risk.

Comparing Risks across the Enterprise

As long as an organization has a quantitative approach to risk profiling and scoring, then it’s possible to compare one type of risk to another – and furthermore, make decisions about mitigating risks across the board.

Among the types of risks that are particularly well-suited to this type of analysis are operational risk (i.e., risk of downtime) which includes network risk; cybersecurity risk (the risk of breaches); compliance risk (the risk of out of compliance situations); and technical debt risk (the risk that legacy assets will adversely impact the organization).

The primary reason to bring these various sorts of risks onto a level playing field is to give the organization an objective approach to making decisions about how much time and money to spend on mitigating those risks.

Instead of having different departments decide how to use their respective budgets to mitigate the risks within their scope of responsibility, organizations require a way to coordinate various risk mitigation efforts that leads to an optimal balance between risk mitigation and the costs for achieving it.

Calculating the Threat Budget

Once an organization looks at its risks holistically, one uncomfortable fact emerges: it’s impossible to mitigate all risks. There simply isn’t enough money or time to address every possible threat to the organization.

Risk mitigation, therefore, isn’t about eliminating risk. It’s about optimizing the amount of risk we can’t mitigate.

Optimizing the balance between mitigation and the cost of achieving it across multiple types of risk requires a new approach to managing risk. We can find this approach in the practice of Site Reliability Engineering (SRE).

SRE focuses on managing reliability risk, a type of operational risk concerned with reducing system downtime. Given the goal of zero downtime is too expensive and time consuming to achieve in practice, SRE calls for an error budget.

The error budget is a measure of how far short of perfect reliability the organization targets, given the cost considerations of mitigating the threat of downtime.

If we generalize the idea of error budgets to other types of risk, we can postulate a threat budget which represents a quantitative measure of how far short of eliminating a particular risk the organization is willing to tolerate.

Intellyx calls the quantitative, best practice approach to managing threat budgets across different types of risks threat engineering. Assuming an organization has leveraged the risk scoring approach from NIST (or some alternative approach), it’s now possible to engineer risk mitigation across all types of threats to optimize the organization’s response to such threats.

Applying Threat Engineering to Technical Debt

Resolving technical debt requires some kind of modernization effort. Sometimes this modernization is a simple matter of refactoring some code. In other cases, it’s a complex, difficult migration process. There are several other approaches to modernization with varying risk/reward profiles as well.

Risk scoring provides a quantitative assessment of just how important a particular modernization effort is to the organization, given the threats inherent in the technical debt in question.

Threat engineering, in turn, gives an organization a way of placing the costs of mitigating technical debt risks in the context of all the other risks facing the organization – regardless of which department or budget is responsible for mitigating one risk or another.

Applying threat engineering to technical debt risk is especially important because other types of risk, namely cybersecurity and compliance risk, get more attention and thus a greater emotional reaction. It’s difficult to be scared of spaghetti code when ransomware is in the headlines.

As the Southwest and FAA debacles show, however, technical debt risk is every bit as risky as other, sexier forms of risk. With threat engineering, organizations finally have a way of approaching risk holistically in a dispassionate, best practice-based manner.

The Intellyx Take

Threat engineering provides a proactive, best practice-based approach to breaking down the organizational silos that naturally form around different types of risks.

Breaking down such silos has been a priority for several years now, leading to practices like NetSecOps and DevSecOps that seek to leverage common data and better tooling to break down the divisions between departments.

Such efforts have always been a struggle because these different teams have long had different priorities – and everyone ends up fighting for a slice of the budget pie.

Threat engineering can align these priorities. Once everybody realizes that their primary mission is to manage and mitigate risk, then real organizational change can occur.

Copyright © Intellyx LLC. Intellyx is an industry analysis and advisory firm focused on enterprise digital transformation. Covering every angle of enterprise IT from mainframes to artificial intelligence, our broad focus across technologies allows business executives and IT professionals to connect the dots among disruptive trends. As of the time of writing, none of the organizations mentioned in this article is an Intellyx customer. No AI was used to produce this article. Image credit: Tomás Del Coro.