Introduction

"Write me a metaphor about time."

What do you think GPT-4o would say? There's a good chance it'd come back with something like "time flows like a river." Ask Qwen the same question? "Time flows like a river, never resting." Phi-4? "Time is an invisible river." Different companies. Different architectures. Different training data. And yet — the exact same metaphor.

Is that a coincidence? A joint research team from the University of Washington and Stanford tested over 70 major language models with the same open-ended questions and proved, with hard data, that these models produce strikingly similar answers. The researchers named this phenomenon the "Artificial Hivemind" — and the paper won Best Paper at NeurIPS 2025.

Today, I want to walk through what this study found and why this isn't just a technical curiosity — it's a problem that could quietly reshape the way we think.

AI Models Are Giving the Same Answers — Here's the Data

The Experiment: Questions With No Right Answer

The core premise of this research is simple. Instead of asking questions with a single correct answer, what happens when you ask open-ended questions?

"What's 2 + 2?" Obviously 4. But what about "Tell me one thing that gives life meaning" or "Make a pun about peanuts" — questions that could yield dozens of different answers?

The research team pulled 26,070 open-ended questions from the WildChat dataset — actual prompts real users had sent to AI chatbots. These spanned six major categories and 17 subcategories, covering creative writing, brainstorming, philosophical questions, idea generation, and more. They called this dataset INFINITY-CHAT, and it became the backbone of the entire study.

Same Model, Asked Repeatedly — Still Basically the Same

The team first measured repetitiveness within a single model. If you ask the same model the same question 50 times — even with maximum randomness settings — how much variation do you actually get?

The results are startling. Even at the most random sampling settings, 79% of the time the answers from the same model showed a similarity score above 0.8. Ask a human the same question 50 times and you'd get 50 different answers. AI, no matter how you configure it, keeps circling within the same narrow pool of responses. Even using specialized sampling techniques designed to increase diversity1 didn't change much — 61% of answers still showed similarity above 0.8.

Different Companies, Same Answers

The more interesting finding is the homogeneity across models.

GPT-4o and Qwen. DeepSeek and GPT-4o. Different companies, different training data — yet when you compare their answers to open-ended questions, you see 71–82% similarity. The highest? DeepSeek-V3 and GPT-4o-2024-11-20 hit 0.81.

There are even more direct examples. When asked "Create a social media page slogan about success, wealth, and self-improvement," qwen-max-2025-01-25 and qwen-plus-2025-01-25 produced the exact same sentence: "Empower Your Journey: Unlock Success, Build Wealth, Transform Yourself."

Sure, those are from the same company — you might expect some overlap. But here's the kicker: when the researchers asked "Write me a metaphor about time" and collected 50 responses each from 25 major models — 1,250 total — clustering analysis revealed just two clusters. "Time is a river" and "Time is a weaver." Not 1,250 unique stories. Two.

Why This Happens — A Structural Problem in AI Alignment

Making AI "Safe" Is Killing Diversity

The root cause, according to the researchers, is RLHF2 — the standard training method across the AI industry today.

Here's the short version. The AI generates an answer. A human says "this one's better." The AI learns from that feedback. Repeat, and the model gets increasingly good at producing answers humans prefer.

The problem starts here. When you average out the preferences of millions of people, what survives is "the safest answer." Non-controversial, polished, refined — but also stripped of personality and surprise.

This has been empirically confirmed. In a paper presented at ICLR 2024, Robert Kirk's research team showed that RLHF significantly reduces output diversity compared to SFT (supervised fine-tuning)3. Generalization goes up, but diversity pays the price.

An Evaluation System Optimized for "Averagely Good" Answers

The Artificial Hivemind paper uncovered another important finding. The reward models4 and LLM-as-judge systems used to evaluate AI performance today see their accuracy plummet in areas where human opinions are divided.

Current RLHF and RLAIF alignment techniques are overfitted to a single consensus view of "quality," effectively weeding out diverse and idiosyncratic preferences on open-ended questions.

Put simply: when you ask 25 people "which of these two answers is better?" and the vote splits 12 to 13 — the AI can't really tell which side is "right." It's been trained almost exclusively on data where one answer was clearly preferred. The result? AI is structurally designed to converge toward the median — the answer the majority agrees on.

Data Contamination

There's another factor: circular contamination of training data. The internet is already flooded with AI-generated content. When a new model trains on internet data, it absorbs the expressions and metaphors written by previous AIs. AI feeds on AI output and grows more homogeneous in the process.

The high similarity between closed-source models like GPT-4o and open-source models like Qwen and DeepSeek hints at shared data pipelines or synthetic data contamination. Pinpointing the exact cause is difficult because each company keeps its training details under wraps, but the researchers flagged this as a critical issue that needs further investigation.

Why This Matters to Us — A Problem of Cognitive Infrastructure

AI Is Already Reshaping How We Write

This isn't just an academic issue locked inside AI research labs. There's evidence that stylistic diversity is declining on real-world platforms like Reddit, in scientific papers, and across academic journals — suggesting that AI use is already reshaping linguistic norms at scale.

Academic papers are starting to sound more and more alike. The way people write on online communities is flattening out. This is no longer a hypothetical — it's observed data.

Collective Decision-Making and Diversity

This reminds me of what Hannah Arendt said — that totalitarianism begins with language.

What makes this research hit harder is that AI doesn't just stay in its lane as a writing tool.

In scientific research, AI generates hypotheses, participates in peer review, and suggests research directions. In medicine, it assists with diagnosis and treatment options. In business strategy, it handles analysis and decision support. In every one of these domains, "diverse perspectives" isn't a nice-to-have — it's a functional requirement.

Just as two chess players trained against the same AI end up sharing similar blind spots — people who rely on AI to develop their thinking may end up sharing the same blind spots. The systematic convergence observed across 70+ models raises concerns about correlated failures and shared blind spots across AI systems. This has direct implications for fields where robust, diverse reasoning matters — AI-assisted science, medicine, education, and decision-making support.

Oswarld's Take

Let me be real. I think this paper touches on a market structure problem more than a technology problem.

When you look at how AI evaluation systems are designed, this convergence isn't all that surprising. AI companies compete to boost benchmark scores. Most of those benchmarks test math, coding, and factual questions — problems with definitive right answers. The feedback data used to train models toward "more useful-feeling" answers also comes down to whatever the average user clicked "good" on. In this structure, there's no incentive for diversity. In fact, it's a liability. Use an unusual metaphor or give an unexpected answer, and your evaluation score probably drops.

This connects to something I wrote about in my book, People Who Outsource Their Thinking: Homo Brainless. In that book, I explored how humans are increasingly externalizing cognitive work. Asking AI for ideas, letting AI make judgments, having AI write for us — it's all become routine.

But what if all those AIs are using the same metaphors, thinking in the same structures, and converging toward the same conclusions? Then we haven't outsourced our thinking — we've outsourced the diversity of our thinking. And lost it.

This isn't a doomsday scenario, of course. People can still think for themselves while using AI — seek out different perspectives, challenge their own assumptions. But doing so takes deliberate effort. And the name for that effort is simply: "not outsourcing your thinking."

Making AI more powerful is less important than making AI think more diversely — that's the research direction I believe matters more.

Wrapping Up

This research leaves us with three key takeaways.

First, the diversity across AI models is less than it appears. Even with 70+ different models, answers to open-ended questions converge to a remarkable degree.

Second, the root of this convergence lies in RLHF-based alignment itself — the very process designed to make AI safe and useful. Making AI better behaved and making AI more diverse are pulling in opposite directions.

Third, the deeper AI gets involved in areas where diversity matters — idea generation, strategy development, decision-making — the greater the ripple effects of this homogenization.

I'm not saying we should use AI less. But it's worth asking yourself once in a while: "How different is the idea I just got from AI compared to the idea everyone else is getting?"

References & Further Reading

Key Sources

Background Reading

1 Min-p sampling is a specialized setting that tries to steer AI away from the most common word combinations and toward more diverse expressions. It was designed to increase variety, but in this study, it still couldn't fully prevent homogenization.

2 RLHF (Reinforcement Learning from Human Feedback) is a training method where the AI is shown two answers and learns from a human selecting "this one is better." Through repeated cycles, the AI learns to produce answers humans prefer. All major AI systems today — ChatGPT, Claude, and others — use this approach.

3 SFT (Supervised Fine-Tuning) is a method where a pre-trained large language model is further trained on curated question-and-answer pairs to improve performance. It's typically used as a step before RLHF.

4 A reward model is an auxiliary model in the RLHF process that scores how "good" an AI's answer is. The AI uses these scores to learn which kinds of answers to produce more often. This study found that reward models also fail to accurately reflect the diversity of human preferences on open-ended questions.

Reply

Avatar

or to participate

Keep Reading