AI Models Turn to Mush if Trained on AI-Generated Material

A photo comparison showing a realistic portrait of an elderly man wearing a cowboy hat on the left, and a distorted, abstract version of the same image on the right. An arrow points from the realistic image to the abstract one.

If generative AI models are to continue to expand, they will need high-quality, human-created training data say scientists who found that using AI content corrupts the output.

Researchers from the University of Cambridge discovered that a cannibalistic approach to AI training data quickly leads to the models churning out nonsense and could prove to be a fork in the road for the rapid expanse of AI.

The team used mathematical analysis to show the problem that affects large-language models (LLMs) like ChatGPT as well as AI image generators like Midjourney and DALL-E.

The study was published in Nature which gave the example of building an LLM to create Wikipedia-like articles.

A series of eighteen portraits of an elderly man wearing a hat displays a blend of traditional and abstract art styles. The first portrait begins with a realistic depiction, gradually transitioning to vibrant and highly stylized patterns in the subsequent images.
The increasingly distorted images produced by an AI image model that is trained on data generated by a previous version of the model. | M. Boháček & H. Farid/arXiv (CC BY 4.0)

The researchers say that it kept training new iterations of the model on text produced by its predecessor. As the synthetic data polluted the training set, the model’s output became nonsensical.

For example, in an article about English church towers, the model included extensive details about jackrabbits. Although that example is at the extreme end, any AI data can cause models to malfunction.

“The message is, we have to be very careful about what ends up in our training data,” says co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, UK. “Otherwise, things will always, probably, go wrong”.

The team tells Nature they were surprised by how fast things started going wrong when using AI-generated content as training data.

Hany Farid, a computer scientist at the University of California, Berkeley, who has demonstrated the same effect in image models, compares the problem to species inbreeding.

“If a species inbreeds with their own offspring and doesn’t diversify their gene pool, it can lead to a collapse of the species,” says Farid.

Shumaylov predicts that this technological quirk means that the cost of building AI models will increase as the price of quality data increases.

Another issue for AI companies is that as the open web — where most scrape data from — is filled up with AI content it will pollute their resource forcing them to rethink.

However, few copyright holders will shed a tear at this problem faced by AI companies. Artists, photographers, and content creators of all stripes have been up in arms over the tech industry’s brazen use of their work to build AI models.


Image credits: M. Boháček & H. Farid/arXiv (CC BY 4.0)

Discussion