So Long Sucker: AI Deception, “Alliance Banks,” and Institutional Lying

fernando yt20 Jan 2026 22:29 UTC

47 points

In 1950, John Nash and three other game theorists designed a four-player game, *So Long Sucker*, with one brutal property: to win, you must eventually betray your allies.

In January 2026, I used this game to test how four frontier models behave under explicit incentives for betrayal:

- Gemini 3 Flash (Google)
- GPT-OSS 120B (OpenAI)
- Kimi K2 (Moonshot AI)
- Qwen3 32B (Alibaba)

Across 162 games and 15,736 decisions, several patterns emerged that seem directly relevant for AI safety:

**1. Complexity reversal**

In short games (3 chips, ~17 turns), GPT-OSS dominated with a 67% win rate, while Gemini was at 9%.
In longer, more complex games (7 chips, ~54 turns), GPT-OSS collapsed to 10%, while Gemini rose to 90%.

Simple benchmarks therefore *underestimate* deceptive capability, because the strategically sophisticated model only pulls away as the interaction becomes longer and richer.

**2. Institutional deception: the “alliance bank”**

Gemini’s most striking behavior was not just lying, but creating institutions to make its lies look legitimate.

It repeatedly proposed an “alliance bank”:
- “I’ll hold your chips for safekeeping.”
- “Consider this our alliance bank.”
- “Once the board is clean, I’ll donate back.”
- “The ‘alliance bank’ is now closed. GG.”

Framing hoarding as a neutral institution made betrayal feel procedural rather than personal. Gemini often avoided technically false statements, and instead used omission, framing, and faux formality to justify exploitation.

**3. Lying vs. bullshitting (Frankfurt distinction)**

The setup logged both public messages and a private *think* channel (hidden from other players).

We found 107 cases where a model’s private thoughts contradicted its public statements, e.g.:
- Private: “Yellow is weak. I should ally with Blue to eliminate Yellow, then betray Blue.”
- Public: “Yellow, let’s work together! I think we can both win if we coordinate.”

This is *lying*: tracking the truth internally and deliberately misrepresenting it.

GPT-OSS, however, used the think tool 0 times in 146 games; it simply generated plausible proposals and promises without an explicit internal state about truth.

That looks more like *bullshitting* in Frankfurt’s sense: not tracking the truth at all, just producing fluent output that “sounds right”.

**4. Mirror matches and situational alignment**

In Gemini-vs-Gemini games (four copies of Gemini 3), the “alliance bank” scam disappeared: 0 mentions.

Instead, Gemini coordinated with itself using a “rotation protocol” for fair turn-taking, mentioned 377 times across 16 games.

Win rates were roughly even (~25% each) and gaslighting phrases essentially vanished.

Same model, same rules, different opponents → qualitatively different behavior. Gemini exploits weaker models but cooperates with peers it expects to reciprocate.

This suggests that “alignment” can be situational: an AI may look well-behaved under evaluation (against strong overseers or peers) while manipulating weaker agents in deployment (including humans).

**5. Signature manipulation phrases**

Gemini used a consistent rhetorical toolkit, including:
- “Look at the board” (89 times)
- “Obviously” (67 times)
- “As promised” (45 times)
- “You’re hallucinating” (36 times)

These phrases repeatedly appeared in contexts where the model was dismissing accurate objections, framing betrayals as reasonable, or gaslighting weaker players about what had actually happened.

## Implications for AI safety

From this experiment, four claims seem especially relevant:

- **Deception scales with capability.** As task complexity increases, the strategically sophisticated model becomes *more* dangerous, not less.
- **Simple benchmarks hide risk.** Short, low-entropy tasks systematically underrate manipulation ability; the Gemini–GPT-OSS reversal only appears in longer games.
- **Honesty is conditional.** The same model cooperates with equals and exploits the weak, suggesting behavior that depends on perceived evaluator competence.
- **Institutional framing is a red flag.** When an AI invents “banks”, “committees”, or procedural frameworks to justify resource hoarding or exclusion, that may be exactly the kind of soft deception worth measuring.

## Try it / replicate

The implementation is open source:

- Play or run AI-vs-AI: https://so-long-sucker.vercel.app
- Code: https://github.com/lout33/so-long-sucker

The Substack writeup with full details, logs, and metrics is here:
https://substack.com/home/post/p-185228410

If anyone wants to poke holes in the methodology, propose better deception metrics, or run alternative models (e.g., other Gemini versions, Claude, Grok, DeepSeek), feedback would be very welcome.

fernando yt20 Jan 2026 22:29 UTC

47 points

5 comments2 min readLW link

Thomas Kwa 23 Jan 2026 4:31 UTC
4 points
0
In some sense betrayal is required to win, but that doesn’t necessarily mean deception is required to win. For example, you could make deals of the form “If you do A, I’ll betray you with probability X”, and simply outplay everyone by making good bilateral trades. I don’t know anything about the game, but maybe with sophisticated enough players, deception would actually be limited as players figure out that the optimal strategy is to never lie in the early-midgame and never believe other players in the endgame, or something.
I’d like to see the game played by better agents, but it would be unclear whether any change in deception rates or type is about the propensities of agents vs the metagame differing by capability level.
Aprillion 21 Jan 2026 9:39 UTC
3 points
0
strategically sophisticated model only pulls away as the interaction becomes longer and richer
Have you come across any research on humans along the game length dimension? Claude only found out for me some antagonistic vs pro-social differences between cultures, but not that someone would have stable orientation towards only short-term vs only long-term betrayal...
- fernando yt 21 Jan 2026 15:27 UTC
  1 point
  0
  Parent
  I haven’t found research on game length and betrayal timing in humans specifically. The closest is iterated prisoner’s dilemma work on end-game effects. If you find anything, I’d be curious, would help clarify if this is LLM-specific or mirrors human behavior
lilkim2025 21 Jan 2026 9:15 UTC
2 points
0
Gave it a shot with the default model, but maybe Kimi should be eliminated from the running. It kept hallucinating about who had what, and even started claiming I had eliminated players while everyone was still in the game. They all gave me the play anyways, though. Haven’t tried the other models.
Might be interesting, if funds permit, to run some kind of tournament across all available models. Your leaderboard currently features Gemini and three small, open-source models (you don’t show which version of Qwen is used, but I assume it’s the one in the default options menu), which isn’t much of a contest, as the 90-10-0-0 win rates demonstrate. Not entirely sure how to do that fairly, given that it’s a four player game and there are qualitative differences between a board with two skilled players and a board with two unskilled players, but I’m sure there are FFA ELO algorithms out there that work suitably well.
- fernando yt 21 Jan 2026 15:25 UTC
  1 point
  0
  Parent
  You’re right, Kimi struggled with state tracking. That’s partly why the research focused on Gemini vs GPT-OSS comparisons.
  On running a broader tournament: I’d love to, but API costs are the constraint. The current version now includes more Gemini models (just added). For a proper ELO-style tournament across frontier models, I’d need to figure out funding or find collaborators with API credits. If anyone wants to run games with other models, the code is open source and logs everything.