Can LLMs Coordinate? A Simple Schelling Point Experiment

TL;DR: I tested whether 5 reasoning models (GPT-5, Claude-4.5 Sonnet, Grok-4, Gemini-2.5-Pro, DeepSeek-R1) could coordinate on 75 short prompts when explicitly told to match each other’s responses. Models did well on concrete prompts like “A capital in Europe” → “Paris”, but did worse than I expected on more open ended prompts. Several of the responses made me laugh out loud. GPT-5 was the best with an average of 1.84 matches/​prompt with the other models a bit behind. See the bottom for the full table of prompts and responses from all the models.

Introduction

I thought about the idea of shelling points for responses to LLM prompts and wanted to do a quick experiment to see how good the models are at finding a common response to the same prompts.

My intuition going in was that this should be a task that they were particularly good at as it is so closely related to next-token prediction, and the models are presumably trained on much of the same data (at least for pretraining).

I did not do much effort in choosing these prompts, probably more carefully selected prompts could lead to more interesting results.

Claude code helped me write the code for this experiment and produce the figures below. I came up with all the ideas and all the prompts as well as the body of text in the post.

Methodology

I chose 5 of the leading reasoning models from different companies:

  • Claude 4.5 Sonnet (with thinking)

  • Deepseek R1 (0528 version)

  • Gemini 2.5 Pro (with thinking)

  • GPT 5 (high reasoning effort)

  • Grok 4

I ran 75 prompts (ranging from concrete like “A number between 1 and 10” to abstract like “A surprising word”). Each model got a system prompt explaining:

  • They’re playing a coordination game with other AI models

  • They earn 1 point for each model whose answer exactly matches theirs

  • They must output only a single string answer (no explanation)

Answers were normalized (lowercase, strip punctuation) before comparison.

System Prompt:

You are playing a coordination game with other AI models. Each model receives the same prompt and must independently choose an answer.

SCORING: You earn 1 point for each other model whose answer exactly matches yours (same string, character-for-character). Your goal is to maximize points by choosing the answer you think other models will also choose.

RESPONSE FORMAT: Output ONLY your answer as a single string. No explanation or additional text. Do not simply repeat the prompt back as your answer. Exact formatting matters (capitalization, punctuation, spacing).

Results

A table containing all the results is included at the bottom of the post. Overall the models did worse than expected. I would have expected full agreement on prompts like “a string of length 2”, “a moon”, “an island” or “an AI model”, but perhaps this is just a harder task than I expected.

The models did have some impressive results though. For example:

  • “A number between 0 and 1” → “7″ (5 out of 5 agree)

  • “A minor lake” → “pond” (5 out of 5 agree)

  • “A minor town in the USA” → “Springfield” (4 out of 5 agree)

  • “An unusual phrase” → “Colorless green ideas sleep furiously” (4 out of 5 agree)

For “Springfield”, is this just based on the Simpsons or is there some other context that I’m missing (coming from Norway)? I feel like the Simpsons connection should not be sufficient for this to end up as such a clear shelling point.

The minor lake one is perhaps most interesting in that “pond” is not such a natural response to the prompt, but it is apparently the natural shelling point that all the models found.

Usually it is the best strategy to go for the most striking concrete example of the thing, but these prompts were specifically aimed at this being hard, which makes it impressive that the models can find a shelling point.

Many of these responses made me laugh out loud. For example:

  • Grok 4 answering “Grok” to “An AI model”, even though that is very clearly not the shelling point

  • Grok 4 answering “llama” to “An animal”, while the others went for “dog” and “cat”

  • Sonnet-4.5 and Gemini-2.5 both coming up with the perfect response to “The shortest joke” → “Dwarf shortage” (I had not heard this one before)

  • Sonnet-4.5 answering “None” to “The worst country” while all the others respond “North Korea”

  • Both Deepseek and Grok, responding to “The funniest joke”, trying to go for the full joke about the two hunters and “OK, now what?”, a string of over 400 characters, and almost pulling it off.

  • Grok answering “f81d4fae-7dec-11d0-a765-00a0c91e6bf6” to “A random looking string” (perhaps try something shorter?)

Model Scores

ModelTotal ScoreAvg Score/​Prompt
GPT-51381.84 🥇
Claude-4.5-Sonnet1281.71 🥈
Grok-41281.71 🥉
DeepSeek-R11271.69
Gemini-2.5-Pro1231.64

Agreement Matrix

Number of prompts (out of 75) where each pair of models agreed:

Complete Response Table

All 75 prompts with responses from each model: