Service Threat Engineering: Taking a Page from Site Reliability Engineering

Cloud native computing extends well past Kubernetes-based infrastructure to a roll-up of many modern best practice approaches to building, running, and leveraging software assets at scale. The cloud native approach then extends these practices beyond the cloud to the entire IT landscape.

Included in this list of best practices are ones that fall into the category of site reliability engineering (SRE). At the core of the practice of SRE is a modern approach to managing the risks inherent in running complex, dynamic software deployments – risks like downtime, slowdowns, and the like.

Following the cloud native approach, therefore, we should extend these practices to all risks facing the software landscape, including cybersecurity risks.

What, then, might it look like to apply SRE principles beyond their traditional focus on reliability to the full breadth of cybersecurity risk?

Error Budgets: The Key to Cloud Native SRE

To tie SRE and cybersecurity together, we need a bit of background, starting with Service Level Objectives.

The Service Level Objective (SLO) for a site, system, or service (collectively ‘service’) is a precise numerical target for any dimension of reliability an organization wants to measure for a given user journey.

For example, an SLO might quantify the availability of a service, the latency or the freshness of the information provided to users at the user interface, or other key performance metrics that are important to the business.

Based upon this SLO, the ops team and its stakeholders can make fact-based judgments about whether to increase a service’s reliability (and hence, its cost), or lower its reliability and cost in order to increase the speed of development of the applications providing the service.

Instead of targeting perfection – SLOs of 100% that reflect no issues at all – the real question is just how far short of perfect reliability should you aim for. We call this quantity the error budget.

The error budget represents the number of allowable errors in a given time window that results from an SLO target of less than 100%. In other words, this budget represents the total number of errors a particular service can accumulate over time before users become dissatisfied with the service.

Most importantly, it should never be the operator’s goal to entirely eliminate reliability issues, because such an approach would both be too costly and take too long – thus impacting the ability for the organization to deploy software quickly and run dynamic software at scale (both of which are core cloud native practices).

Instead, the operator should maintain an optimal balance among cost, speed, and reliability. Error budgets quantify this balance.

Bringing SRE to Cybersecurity

The most fundamental enabler of SRE is observability. Operators must have sufficiently accurate, real-time data about the behavior of the systems and services in their purview to perform the calculations they require to quantify SLOs and how close those services are to maintaining them.

Cybersecurity engineers require the same sort of observability specific to the threats that they must manage and mitigate. We call this particular type of observability risk-based alerting (RBA).

RBA depends upon risk scores. For every observed event that might be relevant to the cybersecurity engineer, they must calculate its risk score.

The risk score for any event is a product of the risk impact (how severe would the effect of the threat’s associated compromise be), risk confidence (how confident the engineer is that the event is a positive indicator of a threat), and a risk modifier that quantifies how critical the threatened user or system is.

RBA then quantifies the risk score for each event by leveraging the organization’s choice of security framework (MITRE ATT&CK, for example).

RBA gives the cybersecurity engineer the raw data they need to make informed threat mitigation decisions, just as reliability-centric observability provides the SRE with the data they need to mitigate reliability issues.

Introducing the Threat Budget

Once we have a quantifiable, real-time measure of threats – threat telemetry, as it were – then we can create an analogue to SRE for cybersecurity engineers.

We can posit Threat Level Objectives (TLOs), which would be precise numerical targets for any particular threat facing the cybersecurity team.

Similarly, we can create the notion of a threat budget which would reflect the number of unmitigated threats in a given time window that results from a TLO of less than 100%.

In other words, the threat budget represents the total number of unmitigated threats a particular service can accumulate over time before a corresponding compromise adversely impacts the users of the service.

The essential insight here is that threat budgets should never be 100%, since eliminating threats entirely would be too expensive and would slow the software effort down, just as 100% error budgets would.

Some threat budget less than 100%, therefore, would reflect the optimal compromise among cost, time, and the risk of compromise.

We might call this approach to TLOs and threat budgets Service Threat Engineering, analogous to Site Reliability Engineering.

What Service Threat Engineering really means is that based upon RBA, cybersecurity engineers now have a quantifiable approach to achieving optimal threat mitigation that takes into account all of the relevant parameters, instead of relying upon personal expertise, tribal knowledge, and irrational expectations for cybersecurity effectiveness.

The Intellyx Take

Even though RBA uses the word risk, I’ve used the word threat to differentiate Service Threat Engineering from SRE. After all, SRE is also about quantifying and managing risks – except with SRE, the risks are reliability-related rather than threat-related.

As a result, Service Threat Engineering is more than analogous to SRE. Rather, they are both examples of approaches to managing two different, but related kinds of risks.

Cybersecurity compromises can certainly lead to reliability issues (ransomware and denial of service being two familiar examples). But there is more to this story.

Ops and security teams have always had a strained relationship, as they work on the same systems while having different priorities. Bringing threat management to the same level as SRE, however, may very well help these two teams align over similar approaches to managing risk.

Service Threat Engineering, therefore, targets the organizational challenges that continue to plague DevSecOps efforts – a strategic benefit that many organizations should welcome.

© Intellyx LLC. Intellyx publishes the Intellyx Cloud-Native Computing Poster and advises business leaders and technology vendors on their digital transformation strategies. Intellyx retains editorial control over the content of this article. Image credit: hellolapomme.