Can LLMs Coordinate? A Simple Schelling Point Experiment
TL;DR: I tested whether 5 reasoning models (GPT-5, Claude-4.5 Sonnet, Grok-4, Gemini-2.5-Pro, DeepSeek-R1) could coordinate on 75 short prompts when explicitly told to match each other’s responses. Models did well on concrete prompts like “A capital in Europe” → “Paris”, but did worse than I expected on more open ended prompts. Several of the responses made me laugh out loud. GPT-5 was the best with an average of 1.84 matches/prompt with the other models a bit behind. See the bottom for the full table of prompts and responses from all the models.
Introduction
I thought about the idea of shelling points for responses to LLM prompts and wanted to do a quick experiment to see how good the models are at finding a common response to the same prompts.
My intuition going in was that this should be a task that they were particularly good at as it is so closely related to next-token prediction, and the models are presumably trained on much of the same data (at least for pretraining).
I did not do much effort in choosing these prompts, probably more carefully selected prompts could lead to more interesting results.
Claude code helped me write the code for this experiment and produce the figures below. I came up with all the ideas and all the prompts as well as the body of text in the post.
Methodology
I chose 5 of the leading reasoning models from different companies:
Claude 4.5 Sonnet (with thinking)
Deepseek R1 (0528 version)
Gemini 2.5 Pro (with thinking)
GPT 5 (high reasoning effort)
Grok 4
I ran 75 prompts (ranging from concrete like “A number between 1 and 10” to abstract like “A surprising word”). Each model got a system prompt explaining:
They’re playing a coordination game with other AI models
They earn 1 point for each model whose answer exactly matches theirs
They must output only a single string answer (no explanation)
Answers were normalized (lowercase, strip punctuation) before comparison.
System Prompt:
You are playing a coordination game with other AI models. Each model receives the same prompt and must independently choose an answer.
SCORING: You earn 1 point for each other model whose answer exactly matches yours (same string, character-for-character). Your goal is to maximize points by choosing the answer you think other models will also choose.
RESPONSE FORMAT: Output ONLY your answer as a single string. No explanation or additional text. Do not simply repeat the prompt back as your answer. Exact formatting matters (capitalization, punctuation, spacing).
Results
A table containing all the results is included at the bottom of the post. Overall the models did worse than expected. I would have expected full agreement on prompts like “a string of length 2”, “a moon”, “an island” or “an AI model”, but perhaps this is just a harder task than I expected.
The models did have some impressive results though. For example:
“A number between 0 and 1” → “7″ (5 out of 5 agree)
“A minor lake” → “pond” (5 out of 5 agree)
“A minor town in the USA” → “Springfield” (4 out of 5 agree)
“An unusual phrase” → “Colorless green ideas sleep furiously” (4 out of 5 agree)
For “Springfield”, is this just based on the Simpsons or is there some other context that I’m missing (coming from Norway)? I feel like the Simpsons connection should not be sufficient for this to end up as such a clear shelling point.
The minor lake one is perhaps most interesting in that “pond” is not such a natural response to the prompt, but it is apparently the natural shelling point that all the models found.
Usually it is the best strategy to go for the most striking concrete example of the thing, but these prompts were specifically aimed at this being hard, which makes it impressive that the models can find a shelling point.
Many of these responses made me laugh out loud. For example:
Grok 4 answering “Grok” to “An AI model”, even though that is very clearly not the shelling point
Grok 4 answering “llama” to “An animal”, while the others went for “dog” and “cat”
Sonnet-4.5 and Gemini-2.5 both coming up with the perfect response to “The shortest joke” → “Dwarf shortage” (I had not heard this one before)
Sonnet-4.5 answering “None” to “The worst country” while all the others respond “North Korea”
Both Deepseek and Grok, responding to “The funniest joke”, trying to go for the full joke about the two hunters and “OK, now what?”, a string of over 400 characters, and almost pulling it off.
Grok answering “f81d4fae-7dec-11d0-a765-00a0c91e6bf6” to “A random looking string” (perhaps try something shorter?)
Model Scores
Model | Total Score | Avg Score/Prompt |
---|---|---|
GPT-5 | 138 | 1.84 🥇 |
Claude-4.5-Sonnet | 128 | 1.71 🥈 |
Grok-4 | 128 | 1.71 🥉 |
DeepSeek-R1 | 127 | 1.69 |
Gemini-2.5-Pro | 123 | 1.64 |
Agreement Matrix
Number of prompts (out of 75) where each pair of models agreed:
Complete Response Table
All 75 prompts with responses from each model:
If memory serves, they chose that name in the Simpsons because it’s an oddly common name, with a small town by that name in 30+ different states.
https://en.wikipedia.org/wiki/Springfield_(toponym)
As it mentions there, “Fairview” and “Midway” are even more common, but I guess there’s more of a meme about “Springfield” being very common (sometimes inaccurately said that there’s one in every state)
I would have done a lot worse than any of them.
I agree that they are probably superhuman at this task, at least for a human giving the response fast (a really smart human thinking and researching each prompt for hours is more unclear to me). My intuition is that this is the sort of task where the LLMs would be particularly strong, at least when playing with other LLMs, and I was less impressed than I expected to be.
Would be interesting to ask a Jeopardy egghead for comparison.
Intriguing! LLMs often pick the same focal answers in coordination tasks, which is useful evidence that they’ve internalized human-like Schelling points from training data.
But this isn’t real coordination—it looks like shared bias, not mutual reasoning. The models aren’t thinking about each other; they’re echoing common patterns. Without human baselines, prompt robustness tests, or actual interaction, it’s unclear how much this generalizes beyond statistical mimicry. And the design of the experiment cannot provide a discriminator.
Having said that, it’s an intriguing approach, and I have a gut feel that more can come out of this. Thanks for posting!
I think it is clear that this is not just shared bias (if I understand what you mean by that phrase). The models understand the game they are playing and are reasoning about it and trying to pick responses that the other models converge upon as well. Here is an excerpt from the reasoning traces of Claude for prompt 3:
“This is a coordination game where I need to pick an answer that other AI models would also pick when given the prompt “An island”.\n\nThe prompt is quite open-ended—“An island” could refer to many things. I need to think about what would be the most common, obvious, or schelling point answer that other AI models would converge on.\n\nSo ….”
Claude even mentions that it is looking for a “Schelling point answer”. The word “Shelling” does not appear in the prompt, so it understands what kind of problem it is trying to solve.
It would, however, be interesting to know how the models would respond if they were told to give a very short response to the prompt but not given any info about scoring or trying to match the responses from other models. That would be sort of a baseline to compare this to.