Data Poisoning and Model Collapse: The Coming AI Cataclysm

Generative AI tools like ChatGPT seem too good to be true: craft a simple prompt and the platform generates text (or images, videos, etc.) to order.

Behind the scenes, ChatGPT and its ilk leverage vast swaths of the World Wide Web as training data – the ‘large’ in ‘large language model’ (LLM) that gives this technology its name.

Generative AI has its drawbacks, however. It favors plausibility over veracity, often generating bullsh!t (see my recent article all about the bullsh!t).

Its lack of veracity, however, is not its only drawback. Generative AI is so successful at creating plausible content that people are uploading it back to the web, which means that the next time a generative AI model uses the Web for training, it’s leveraging an increasingly large quantity of AI-generated data.

This Ouroboros-like feedback loop, however, is a bad thing, as it leads to model collapse and data poisoning. Given there are no practical ways of preventing these issues, this loop may make most or all AI unusable.

Let’s take a closer look.

Model Collapse and Data Poisoning

Model collapse occurs when AI models train on AI-generated content. It’s a process where small errors or biases in generated data compound with each cycle, eventually steering the model away from generating inferences based on the original distribution of data.

In other words, the model eventually forgets the original data entirely and ends up creating useless noise.

Data poisoning is a related, but different process. Data poisoning is a type of cyberattack where a bad actor intentionally introduces misleading information into training data sets to cause the model to generate poor results – or in reality, any results the bad actor desires.

The 2016 corruption of Microsoft’s Twitter chatbot Tay is a familiar example of data poisoning. Users fed the chatbot offensive tweets, thus training Tay to act in a hostile manner.

While model collapse and data poisoning are different problems, their overlap is particularly ominous. If bad actors use AI to generate poisoned data with the intention of collapsing a model, they are likely to achieve their nefarious goals without detection.

The Problem with Public Data Sets

People are poisoning the Web all the time. Perhaps even you have done so. All you need to do to accomplish this nefarious deed is to post any AI-generated content online.

Poisoning, after all, can be either intentional or inadvertent. While intentional data poisoning is a cyberthreat, accidental poisoning is happening continually across the web, social media, intranets, Slack channels, and anywhere else people might post AI-generated content.

Model collapse, in fact, isn’t the only undesirable result from poisoning the Web. Any search engine is also a target.

Search engines have been scraping the Web long before LLMs came on the scene. But now that the generative AI cat is out of the bag, how likely is it that the results of a Google search are pages with AI-generated content?

Perhaps the percentage of search results that are AI-generated today is relatively low, but this percentage will only increase over time. If this trend plays out, search engines will become increasingly useless, as they’ll only turn up poisoned content, while the LLMs that leverage the same content will inevitably collapse.

Synthetic Poison: The Fentanyl of AI

Data poisoning may be intentional or accidental – but there’s a third possibility: synthetic training data.

In some situations, leveraging real data sets for training LLMs is impractical, for example, if those data sets include private information like health records.

Instead, AI specialists leverage AI to create synthetic data sets – data sets that ostensibly resemble real data sets in all ways except they don’t contain the sensitive information in question.

Because AI is creating the synthetic data, however, there is the risk that the data sets that trained the synthetic data creation models included AI-created data themselves, thus establishing the vicious feedback loop that leads to model collapse.

How to Solve the Data Poisoning/Model Collapse Problem

The most obvious solutions to this problem are the most impractical. Certainly, we could prohibit people from posting AI-generated content online or using it to train our models. Enforcement of such a prohibition, however, would be impossible.

We could also improve our AI models to the point that they would recognize AI-generated content and exclude it from training algorithms. This solution is also impractical, as the technology for fooling AI-generated content detection tools appears to be advancing more quickly than the tools themselves. At best it would only work part of the time – but the poisoned data that slipped through would collapse the models nevertheless.

The best solution given the state of the art of the technologies involved is to avoid training models on public or AI-generated synthetic data. In other words, organizations must carefully curate their training data sets, selecting only ‘pristine’ source data sets they can verify exclude AI-generated data.

Training LLMs on today’s Web is out. The only way to safely use the Web would be to use only pages that date from before generative AI became a thing. It’s no wonder the Internet Archive is seeing such an uptick of downloads.

Synthetic data is a trickier problem. Organizations could certainly create synthetic data without the use of AI (as they have been doing for years), but then they’d have the same problems they’ve always had with such data sets: introducing human error and biases into them.

Perhaps synthetic data can avoid the data poisoning/model collapse problem if the training data for the synthetic data creation models themselves leverage carefully curated data sets that exclude all AI-generated content.

The Intellyx Take

We can think of generative AI as behaving like antibiotics: wonder drugs at their debut that became increasingly problematic over time as resistance built up, until they stop working altogether.

Or perhaps we should consider public data sets like the World Wide Web as being a limited resource, despite the Web’s incomprehensible size and inexorable growth.

The presence of AI-generated content will nevertheless spread like the plague, poisoning search results as well as collapsing AI models that depend upon such public information for their training.

The good news is that curation is a viable solution – and in fact, many business applications of generative AI already depend upon curated content.

Such curation, however, requires continual vigilance. Simply taking the position that an organization is immune from model collapse because it exclusively uses corporate data as its source training data may lead to an unreasonable sense of complacency.

Without careful monitoring and governance, even carefully curated data sets can inadvertently incorporate AI-generated content. The antidote to such complacency is constant vigilance.

Copyright © Intellyx LLC. Intellyx is an industry analysis and advisory firm focused on enterprise digital transformation. Covering every angle of enterprise IT from mainframes to artificial intelligence, our broad focus across technologies allows business executives and IT professionals to connect the dots among disruptive trends. As of the time of writing, none of the organizations mentioned in this article is an Intellyx customer. No AI was used to produce this article. Image credit: Alan Stark.