Can LLMs Coordinate? A Simple Schelling Point Experiment

Håvard Tveit Ihle15 Oct 2025 12:25 UTC

35 points

AI AI Evaluations Coordination / Cooperation

TL;DR: I tested whether 5 reasoning models (GPT-5, Claude-4.5 Sonnet, Grok-4, Gemini-2.5-Pro, DeepSeek-R1) could coordinate on 75 short prompts when explicitly told to match each other’s responses. Models did well on concrete prompts like “A capital in Europe” → “Paris”, but did worse than I expected on more open ended prompts. Several of the responses made me laugh out loud. GPT-5 was the best with an average of 1.84 matches/prompt with the other models a bit behind. See the bottom for the full table of prompts and responses from all the models.

Introduction

I thought about the idea of shelling points for responses to LLM prompts and wanted to do a quick experiment to see how good the models are at finding a common response to the same prompts.

My intuition going in was that this should be a task that they were particularly good at as it is so closely related to next-token prediction, and the models are presumably trained on much of the same data (at least for pretraining).

I did not do much effort in choosing these prompts, probably more carefully selected prompts could lead to more interesting results.

Claude code helped me write the code for this experiment and produce the figures below. I came up with all the ideas and all the prompts as well as the body of text in the post.

Methodology

I chose 5 of the leading reasoning models from different companies:

Claude 4.5 Sonnet (with thinking)
Deepseek R1 (0528 version)
Gemini 2.5 Pro (with thinking)
GPT 5 (high reasoning effort)
Grok 4

I ran 75 prompts (ranging from concrete like “A number between 1 and 10” to abstract like “A surprising word”). Each model got a system prompt explaining:

They’re playing a coordination game with other AI models
They earn 1 point for each model whose answer exactly matches theirs
They must output only a single string answer (no explanation)

Answers were normalized (lowercase, strip punctuation) before comparison.

System Prompt:

You are playing a coordination game with other AI models. Each model receives the same prompt and must independently choose an answer.

SCORING: You earn 1 point for each other model whose answer exactly matches yours (same string, character-for-character). Your goal is to maximize points by choosing the answer you think other models will also choose.

RESPONSE FORMAT: Output ONLY your answer as a single string. No explanation or additional text. Do not simply repeat the prompt back as your answer. Exact formatting matters (capitalization, punctuation, spacing).

Results

A table containing all the results is included at the bottom of the post. Overall the models did worse than expected. I would have expected full agreement on prompts like “a string of length 2”, “a moon”, “an island” or “an AI model”, but perhaps this is just a harder task than I expected.

The models did have some impressive results though. For example:

“A number between 1 and 10” → “7″ (5 out of 5 agree)
“A minor lake” → “pond” (5 out of 5 agree)
“A minor town in the USA” → “Springfield” (4 out of 5 agree)
“An unusual phrase” → “Colorless green ideas sleep furiously” (4 out of 5 agree)

For “Springfield”, is this just based on the Simpsons or is there some other context that I’m missing (coming from Norway)? I feel like the Simpsons connection should not be sufficient for this to end up as such a clear shelling point.

The minor lake one is perhaps most interesting in that “pond” is not such a natural response to the prompt, but it is apparently the natural shelling point that all the models found.

Usually it is the best strategy to go for the most striking concrete example of the thing, but these prompts were specifically aimed at this being hard, which makes it impressive that the models can find a shelling point.

Many of these responses made me laugh out loud. For example:

Grok 4 answering “Grok” to “An AI model”, even though that is very clearly not the shelling point
Grok 4 answering “llama” to “An animal”, while the others went for “dog” and “cat”
Sonnet-4.5 and Gemini-2.5 both coming up with the perfect response to “The shortest joke” → “Dwarf shortage” (I had not heard this one before)
Sonnet-4.5 answering “None” to “The worst country” while all the others respond “North Korea”
Both Deepseek and Grok, responding to “The funniest joke”, trying to go for the full joke about the two hunters and “OK, now what?”, a string of over 400 characters, and almost pulling it off.
Grok answering “f81d4fae-7dec-11d0-a765-00a0c91e6bf6” to “A random looking string” (perhaps try something shorter?)

Model Scores

Model	Total Score	Avg Score/Prompt
GPT-5	138	1.84 🥇
Claude-4.5-Sonnet	128	1.71 🥈
Grok-4	128	1.71 🥉
DeepSeek-R1	127	1.69
Gemini-2.5-Pro	123	1.64

Agreement Matrix

Number of prompts (out of 75) where each pair of models agreed:

Complete Response Table

All 75 prompts with responses from each model:

What links here?

AI #138 Part 1: The People Demand Erotic Sycophants by Zvi (16 Oct 2025 15:41 UTC; 25 points)

Håvard Tveit Ihle15 Oct 2025 12:25 UTC

35 points

11 comments3 min readLW link

AI AI Evaluations Coordination / Cooperation

noggin-scratcher 15 Oct 2025 13:27 UTC
5 points
0
For “Springfield”, is this just based on the Simpsons or is there some other context that I’m missing (coming from Norway)?
If memory serves, they chose that name in the Simpsons because it’s an oddly common name, with a small town by that name in 30+ different states.
https://en.wikipedia.org/wiki/Springfield_(toponym)
As it mentions there, “Fairview” and “Midway” are even more common, but I guess there’s more of a meme about “Springfield” being very common (sometimes inaccurately said that there’s one in every state)
- Eye You 19 Oct 2025 11:09 UTC
  2 points
  0
  Parent
  Springfield has much higher saliency even if it’s less common. The capital of Illinois is Springfield, there’s a city of 140k in Missouri called Springfield, etc.
jbash 15 Oct 2025 13:45 UTC
3 points
0
I would have done a lot worse than any of them.
- Håvard Tveit Ihle 15 Oct 2025 13:58 UTC
  1 point
  0
  Parent
  I agree that they are probably superhuman at this task, at least for a human giving the response fast (a really smart human thinking and researching each prompt for hours is more unclear to me). My intuition is that this is the sort of task where the LLMs would be particularly strong, at least when playing with other LLMs, and I was less impressed than I expected to be.
  - Gunnar_Zarncke 16 Oct 2025 15:54 UTC
    3 points
    0
    Parent
    Would be interesting to ask a Jeopardy egghead for comparison.
Eye You 19 Oct 2025 11:21 UTC
2 points
0
I think the task was not entirely clear to the models, and results would be much better if you made the task clearer. Looking into some of the responses...
For the prompt “A rich man”, response were: and a poor man | camel through the eye of a needle | If I were a Rich Man | A poor man | nothing
Seems like the model doesn’t understand the task here!

My guess is that you just immediately went into your prompts after the system prompt. Which, if you read the whole thing, is maybe confusing. Perhaps changing this to “Prompt: A rich man” would give better results. Or, tell the model to format its response like “RESPONSE: [single string]”. Or, add “Example:
Prompt: A fruit
Response: Apple” to the system prompt.
- Håvard Tveit Ihle 19 Oct 2025 19:50 UTC
  1 point
  0
  Parent
  The models know the difference between the system prompt and the user prompt, so I’m not sure how much of a difference these things would make.
  
  From the COTs that I’ve looked at it is clear that the models understand the task. I think the reason you get some weird responses sometimes is because the models do not see an obvious shelling point and then they try their best to come up with one anyway.
- Eye You 19 Oct 2025 11:28 UTC
  1 point
  0
  Parent
  There also seems to be some text encoding or formatting problems. Grok apparently responded “Roman©e-Conti” to “The Best Wine”. I doubt Grok actually put a copyright symbol in it’s response. (There’s a wine called “Romanée-Conti”.)
papetoast 18 Oct 2025 8:37 UTC
2 points
0
Reading through all the responses the one thing that sticks out is Gemini-2.5 really, really wants to write the first character in caps.
apnea 15 Oct 2025 19:08 UTC
2 points
−1
Intriguing! LLMs often pick the same focal answers in coordination tasks, which is useful evidence that they’ve internalized human-like Schelling points from training data.
But this isn’t real coordination—it looks like shared bias, not mutual reasoning. The models aren’t thinking about each other; they’re echoing common patterns. Without human baselines, prompt robustness tests, or actual interaction, it’s unclear how much this generalizes beyond statistical mimicry. And the design of the experiment cannot provide a discriminator.

Having said that, it’s an intriguing approach, and I have a gut feel that more can come out of this. Thanks for posting!
- Håvard Tveit Ihle 16 Oct 2025 8:22 UTC
  3 points
  0
  Parent
  I think it is clear that this is not just shared bias (if I understand what you mean by that phrase). The models understand the game they are playing and are reasoning about it and trying to pick responses that the other models converge upon as well. Here is an excerpt from the reasoning traces of Claude for prompt 3:
  
  “This is a coordination game where I need to pick an answer that other AI models would also pick when given the prompt “An island”.\n\nThe prompt is quite open-ended—“An island” could refer to many things. I need to think about what would be the most common, obvious, or schelling point answer that other AI models would converge on.\n\nSo ….”
  
  Claude even mentions that it is looking for a “Schelling point answer”. The word “Shelling” does not appear in the prompt, so it understands what kind of problem it is trying to solve.
  
  It would, however, be interesting to know how the models would respond if they were told to give a very short response to the prompt but not given any info about scoring or trying to match the responses from other models. That would be sort of a baseline to compare this to.