CHAPTER 10Risks in Generative AI: Data Inbreeding
Reinforcing the growing generative AI economy is human-made data. AI models are trained on all kinds of data, which has set off a hunt by tech companies for even more data to feed their AI systems.1 AI builders are endlessly hungry to feed their models more data—but that data is increasingly loaded with synthetic content. Generative models are not only trained on real data sourced from the real world (known as natural data), they are now also being trained with data that has been manufactured by other generative models. When synthetic content is fed back into a synthetic generative AI model, something called data inbreeding occurs. This is leading to data outputs that are much more convoluted.2
In mythology, an ouroboros is a serpent-like creature that consumes its own tail in a never-ending loop. That is what may happen with generative AI—generative models are increasingly consuming the outputs from other generative models. Data inbreeding is prone to the same mutations as when genetic inbreeding occurs. And just like in any biological entity, inbreeding can cause the code to malfunction and lead to the distortion of its outputs. A question we all need to ask is: will generative AI's data inbreeding ultimately end up being its Achilles' heel?
More companies are starting to investigate what is known as synthetic data to train their large language models (LLMs). The veracity and dependability of AI-generated data could easily ...
Get AI + The New Human Frontier now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.