ChatGPT Pollution Hobbles AI Development

AI Model Collapse

The rapid emergence of ChatGPT and similar models has flooded the internet with low-quality content, hindering the advancement of future AI. AI-generated data is overshadowing human-created content, which these models rely on. Consequently, AI may increasingly learn from and replicate artificial creations, potentially leading to a decline in content quality and overall AI capabilities. The industry refers to this as AI "model collapse."

The limited amount of pre-ChatGPT data is becoming highly valuable. The Register compares this to the demand for "low-background steel," produced before nuclear bomb detonations starting in July 1945. Just as AI chatbots have polluted the internet, atomic explosions released particles into almost all subsequent steel production, making modern metals unsuitable for sensitive equipment.

Low-background steel is now sourced from WW1 and WW2 battleships, including a fleet scuttled by German Admiral Ludwig von Reuter in 1919. Maurice Chiodo from the University of Cambridge calls the admiral's actions the "greatest contribution to nuclear medicine in the world," providing an almost infinite supply of low-background steel.

Clean Data

Chiodo states the analogy works because of the need for pre-contamination data. Data collected before 2022 is considered relatively free of AI contamination, with anything after that considered "dirty." In 2024, Chiodo co-authored a paper advocating for a source of "clean" data to prevent model collapse and ensure fair competition among AI developers. Early tech pioneers who benefited from purer training data would otherwise have a significant advantage.

The threat of model collapse due to contaminated data is debated. Chiodo and other researchers have been raising concerns for years. Chiodo told The Register that cleaning the data environment could be prohibitively expensive or impossible if model collapse becomes a significant problem.

Retrieval-Augmented Generation

One area where this issue is apparent is retrieval-augmented generation (RAG), where AI models use real-time internet data to supplement their training. However, this data may be AI-influenced, leading to more "unsafe" chatbot responses.

This issue reflects the broader debate around scaling AI models by increasing data and processing power. After developers reported diminishing returns in late 2024, some experts suggested that scaling had reached a limit. Increasingly poor data would make this limit even harder to overcome.

Chiodo suggests that stronger regulations, such as labeling AI content, could help mitigate this pollution, but enforcement would be difficult. The AI industry, which has opposed government intervention, may be contributing to the problem. Rupprecht Podszun, a law professor, told The Register that a hesitancy to regulate AI is typical of new innovations.