Skip to content
FAQ

Synthetic Data Is Eating Real Data: The Training Data Crisis Nobody Talks About

AI labs are running out of high-quality human-generated training data. The solution — training on AI-generated data — works surprisingly well but creates risks nobody fully understands.

3 min read

The Internet Isn’t Big Enough

Here’s a fact that should make you uncomfortable: the major AI labs have essentially exhausted the supply of high-quality, publicly available training data.

Estimates suggest that GPT-4 and Claude 3 were trained on roughly 10-15 trillion tokens of text — the majority of the quality-filtered public internet. GPT-5 and subsequent models need more data, but there isn’t more quality data to find. The internet grows, but mostly with AI-generated content, SEO spam, and low-quality text.

The AI industry’s response: make more data. Specifically, use AI to generate training data for AI.

How Synthetic Data Actually Works

Synthetic data isn’t just “feeding AI output back to AI.” The techniques are more sophisticated:

Self-play and self-improvement: A model generates solutions, a verifier checks them, correct solutions become training data. This works remarkably well for math, coding, and logical reasoning — tasks where answers can be verified objectively.

Distillation: A large, expensive model generates high-quality outputs that are used to train a smaller, cheaper model. This is how most “open-source” models are built — they’re trained partly on GPT-4 or Claude outputs.

Constitutional AI and RLHF: Anthropic’s approach uses AI to evaluate and improve its own outputs, creating synthetic preference data for alignment training.

Domain simulation: For specialized applications (medical diagnosis, financial modeling, autonomous driving), synthetic data can cover scenarios that rarely appear in real data but are critical to handle correctly.

The Results Are Surprisingly Good

Research from the past year shows that carefully curated synthetic data can match or exceed human-generated data quality for specific tasks:

  • Models trained with synthetic math data solve problems 15-20% more accurately than those trained on human data alone
  • Code models trained on synthetic debugging scenarios show better error-handling than those trained only on GitHub code
  • Multilingual models improve significantly when synthetic data fills gaps in under-represented languages

The key word is “carefully curated.” Raw synthetic data is garbage. Filtered, verified, and strategically generated synthetic data is gold.

The Risks Nobody Fully Understands

Model collapse: When AI trains on AI-generated data for multiple generations, quality degrades. Think of it as a photocopy of a photocopy — each generation loses fidelity. Current techniques mitigate this, but the long-term effects over many training cycles are unknown.

Monoculture: If every model is trained on data generated by a few foundation models, we get a monoculture of AI perspectives. Diverse training data produces diverse capabilities. Synthetic data from one source produces convergent biases.

Evaluation contamination: How do you benchmark a model that was trained on synthetic data generated by the models that created the benchmarks? The evaluation framework itself becomes unreliable.

The attribution problem: If Model B is trained on synthetic data from Model A, and Model A was trained on copyrighted human data, who owes whom? The legal implications are unexplored.

What Nobody’s Talking About

The most interesting development is that the synthetic data question might make the copyright debate irrelevant. If future models can be trained entirely on synthetic data plus a small amount of licensed human data, the legal pressure around training on copyrighted content evaporates.

This creates a perverse incentive: AI labs are motivated to solve synthetic data not just for quality reasons, but to escape copyright liability. The technical and legal motivations are perfectly aligned.

What to Watch

  • Whether GPT-5 or Claude 5 uses primarily synthetic training data (neither company will say, but the answer matters)
  • Model collapse research — are we approaching the limit of synthetic data recursion?
  • New techniques for generating high-quality synthetic data in domains where verification is hard (creative writing, subjective tasks)
  • Regulatory response — should there be rules about synthetic data disclosure?

The irony is poetic: AI learned from humanity’s collective output, and now it’s learning from its own. Whether that’s a virtuous cycle or a hall of mirrors depends on choices we’re making right now — and most of them are being made behind closed doors.

synthetic-data training-data ai-safety llm
Share

Related Stories