When you have a conversation with a chatbot, you want it to remember previous interactions within that conversation. That’s what it means to have a conversation, after all.
When you use generative AI (genAI) to perform some analysis task beyond a single response to a prompt, you want it to retain the context of earlier prompts within that task.
When a company wants AI to automate a workflow – a sequence of steps over time, with human input along the way – you want the AI to keep track of where each user is along their instance of the workflow.
These examples are all situations where we expect our AI to maintain state information – some persisted data that keeps track of interactions or automated tasks over time.
Now that agentic AI is here, however, these examples of state management don’t go far enough.
The missing piece: we want AI to learn. We want our agents to get smarter over time.
Suddenly, all our traditional approaches to managing the state of interactions in a distributed computing environment fall short.
Give Me a Cookie
Every generation of technology has had to deal with the central computing challenge of how to manage state.
The default approach, writing state information for every user and every interaction to a database on the server worked well enough, up to a point.
However, keeping track of state information on a server somewhere doesn’t scale. Eventually, stateful applications bog down.
In contrast, stateless interactions offer massive scale. When the back-end, server-side parts of our applications don’t have to keep track of users or their requests, then scaling them out is a simple exercise.
Unfortunately, so many things we might want to do requires us to keep track of something over time, which requires state management.
We had to figure this out for the Web. Then we had to figure it out for the cloud – which eventually meant that we had to figure it out once again for microservices.
Now AI agents are here. Guess what? We need to figure out state management all over again.
Microservices to the Rescue?
Most implementations of AI agents run as microservices. You might think, therefore, that microservices would address the AI state management problem.
Microservices are inherently stateless, enabling them to scale massively and dramatically, since any microservice can respond to any request just as well as any other identical microservice.
Statelessness thus enables microservices’ inherently ephemeral and elastic nature, properties that arguably make cloud native computing what it is today.
Managing state with microservices without limiting their scalability and slowing everything down is one of the most important architectural challenges of cloud native computing.
Kubernetes handles state management by adding an abstraction layer. StatefulSets are objects that enable microservices to maintain state information by abstracting the persistence tier.
Stateful microservices must still write state information to a database somewhere, but with StatefulSets, each microservice doesn’t have to worry about the specifics.
The Kubernetes infrastructure handles data scalability behind the scenes, along with managing the data consistency that has always been the challenge to building massively scalable persistence infrastructure.
Given AI agents typically run as microservices, can cloud native computing address their state management challenges?
No. There is still something missing.
The AI Agent State Conundrum
Many of today’s genAI applications are stateless – feed them a prompt, get a response, and then they forget all about you and what you asked before.
Cookies (or generally, maintaining state on the client) and microservices (maintaining state on the server) are both necessary for managing AI state. However, they are not sufficient.
The first dimension: keeping track of what each user is doing. Maintaining state on the client can handle this.
For example, a chatbot that keeps track of each conversation with each user. Bonus points if a user can pick up a conversation after leaving the chatbot and coming back later.
The second dimension: keeping track of interactions across users. Now we can call upon stateful microservices.
For example, an AI agent might update a CRM app. Other people, and indeed other AI agents, will see those updates and be able to make decisions based on the new information.
Cloud native computing handles such multi-user situations well. By abstracting the persistence tier (in this case, the data store behind the CRM app), the Kubernetes infrastructure can scale.
What’s missing from this story is the third dimension: AI agents that can learn.
A chatbot, for example, might get smarter over time about a particular user’s preferences. A travel chatbot should ideally understand simple things like whether a user prefers an aisle or window seat – but should also learn more subtle, complex preferences specific to each user or relationships among users (for instance, your spouse’s preferences as well as your own when you travel together).
The AI agent should also get smarter over time about all collaborative interactions it is called upon to support. Simply updating CRM records, for example, is not a particularly valuable task for an AI agent. Understanding how best to leverage the CRM to optimize sales efforts, a task that requires agents to learn over time, is a different story.
Why AI Agent State Management is Different
The behavior of an AI agent (or any other AI-based application, for that matter) depends upon its training data. Change the underlying data, and you change the agent’s behavior.
For AI agents to learn, they must feed information from ongoing user interactions back to the training data, thus changing the agents’ behavior with each iteration.
In other words, changing the training data changes the state of the agent. Prompts and other contextual information about the behavior of the agent all become training data for any agent with the ability to learn over time.
This iterative learning behavior – and hence, the training-based state management challenge – is specific to agentic AI, because such training will make AI agents smarter over time.
However, we must still deal with whether we want individual agent instances to learn about individual user preferences, or to learn from all interactions across sets of users, even if they are interacting with distinct AI agent instances.
Ideally, we’d want a mix of both – agents getting smarter from interacting with many users while simultaneously getting smarter about each set of interactions with every individual, thus becoming personalized.
The Intellyx Take
How, then, do we tackle this complex state management challenge, given the training data themselves represent the state of each agent?
No one has solved this problem yet (to my knowledge) – but it’s clear that people have at least discerned the underlying issue.
We’ve seen this problem appear as a trope in fiction. Remember the movie Her, where a hapless Joaquin Phoenix falls in love with a ‘female’ AI agent?
The agent learns over time in a very personal way specific to Phoenix’s character. Eventually, ‘she’ becomes a unique individual, only to (spoiler alert!) be reset to ‘her’ factory default, thus ‘killing’ her.
Where fiction goes, soon goes reality. What information do we want AI agents to keep track of? Given we want them to learn, then just what do we want them to learn, and when? And how do we decide?
Then, of course, we need to figure out how to build the technology necessary to support the ever-changing state of each of our AI agents – while enabling them to scale without breaking the bank.
Copyright © Intellyx BV. Intellyx is an industry analysis and advisory firm focused on enterprise digital transformation. Covering every angle of enterprise IT from mainframes to artificial intelligence, our broad focus across technologies allows business executives and IT professionals to connect the dots among disruptive trends. No AI was used to write this article. Image credit: Craiyon.