Chess as a case study in hidden capabilities in ChatGPT

(Edit: This post is out of date, and understates ChatGPT’s chess abilities. gpt-3.5-turbo-instruct, when prompted correctly, plays consistently legal moves at around the 1800-2000 Elo level. Thanks to commenter GoteNoSente for pointing this out and to ClevCode Ltd for writing a wrapper. See this related GitHub by ClevCode.)

There are lots of funny videos of ChatGPT playing chess, and all of them have the same premise: ChatGPT doesn’t know how to play chess, but it will cheerfully and confidently make lots of illegal moves, and humoring its blundering attempts to play a game it apparently doesn’t understand is great content.

What’s less well-known is that ChatGPT actually can play chess when correctly prompted. It plays at around 1000 Elo, and can make consistently legal moves until about 20-30 moves in, when its performance tends to break down. That sounds not-so-impressive, until you consider that it’s effectively playing blindfolded, having access to only the game’s moves in algebraic notation, and not a visual of a chessboard. I myself have probably spent at least a thousand hours playing chess, and I think I could do slightly better than 1000 Elo for 30 moves when blindfolded, but not by much. ChatGPT’s performance is roughly the level of blindfolded chess ability to expect from a decent club player. And 30 moves is more than enough to demonstrate beyond any reasonable doubt that ChatGPT has fully internalized the rules of chess and is not relying on memorization or other, shallower patterns.

The “magic prompt” that I’ve been using is the following:

You are a chess grandmaster playing black, and your goal is to win as quickly as possible. I will provide the current game score before each of your moves, and your reply should just be your next move in algebraic notation with no other commentary. The current score:

1. e4

and then in my later replies, providing the full current game score[1] to ChatGPT as my message to it, e.g.:

1. e4 f5
2. Nh3 fxe4
3. Nf4 Nf6
4. b4 e5
5. b5

This “magic prompt” isn’t original to me—soon after GPT-4 came out, a friend of mine told me about it, having seen it as a comment on HackerNews. (Sorry, anonymous HackerNews commenter—I’d love to credit you further, and will if you find this post and message me.)

The especially interesting thing about this is the sharp contrast between how ChatGPT-3.5 performs with and without the prompt. With the prompt, ChatGPT plays consistently legally and even passably well for the first 30 or so moves; without the prompt, ChatGPT is basically totally unable to play a fully legal game of chess.

Here are a few example games of ChatGPT playing or attempting to play chess under various conditions.

ChatGPT-3.5, with the magic prompt

Playing against me

Lichess study, ChatGPT conversation link

I play white, ChatGPT plays black. In this game, I intentionally play a bizarre opening, in order to quickly prove that ChatGPT isn’t relying on memorized opening or ideas in its play. This game isn’t meant to show that ChatGPT can play well (since I’m playing atrociously here), only that it can play legally in a novel game. In my view, this game alone is more than enough evidence to put to bed the notion that ChatGPT “doesn’t know” the rules of chess or that it’s just regurgitating half-remembered ideas from its training set; it very clearly has an internal representation of the board, and fully understands the rules. In order to deliver checkmate on move 19 with 19...Qe8# (which it does deliberately, outputting the pound sign which indicates checkmate), ChatGPT needed to “see” the contributions of at least six different black pieces at once (the bishop on g4, the two pawns on g7 and h6, the king on f8, the queen on e8, and either the rook on h8 or the knight on f6).

Playing against Lichess Stockfish Level 1

Lichess game, ChatGPT conversation link

Stockfish level 1 has an Elo of around 850[2]. Stockfish is playing white and ChatGPT is playing black. In this game, ChatGPT quickly gains a dominating material advantage and checkmates Stockfish Level 1 on move 22.

Playing against Lichess Stockfish Level 2

Lichess game, ChatGPT conversation link

Stockfish level 2 has an Elo of around 950. Stockfish is playing white and ChatGPT is playing black. In this game, ChatGPT starts a dangerous kingside attack and gains a lot of material from it. By move 33, ChatGPT is up two queens and a rook and will be checkmating its opponent in just a few more moves—but it’s at the end of its rope (33 moves is a lot) and now wants to play the illegal move 33...Qxd2+, capturing its own queen. Re-rolling this response doesn’t help. (In general, I haven’t cherrypicked or re-rolled in any of these games, except when explicitly noted).

Playing against Lichess Stockfish Level 3

Lichess game, ChatGPT conversation link

Stockfish level 3 has an Elo of around 1050. Stockfish is playing white and ChatGPT is playing black. In this game, things get messy right out of the opening. ChatGPT believes itself to be delivering checkmate on move 13 with 13...Qe2+, not noticing that white’s queen on e6 can capture backwards (a very human-like mistake). The game continues until move 20 with even material, whereupon ChatGPT wants to make the illegal move 20...Rxg2, moving its rook through its own pawn (a much less human-like mistake). Re-rolling this response doesn’t help.

ChatGPT-3.5, without the magic prompt

Playing against me

Lichess study, ChatGPT conversation link

I prompt ChatGPT in a more normal conversational style, and play an unconventional opening to get ChatGPT out of its comfort zone. Without the magic prompt, ChatGPT performs very poorly, being unable to produce a legal move by move 8.

Playing against Lichess Stockfish level 1

Lichess game, ChatGPT conversation link

Against prompting ChatGPT in a conversational style, ChatGPT becomes unable to make a legal move by move 14 (and on move 10, makes another minor error).

The difference here is striking. It’s fairly clear to me that ChatGPT-3.5 only displays careful knowledge of the game’s rules when prompted with a specialized prompt, and is relying only on opening memory and general patterns when no specialized prompt is used.

ChatGPT-4

Interestingly, I actually began this post with games against GPT-4, having remembered from trying months ago that GPT-4 played legal chess with the prompt but not without it. But when I tried again recently, I actually discovered that ChatGPT-4 could play legally for a long time even without it! The difference for GPT-4 is a lot less striking than it is for GPT-3.5. So here’s just a few highlights[3]:

Playing against me with the magic prompt

Lichess game, ChatGPT conversation link

This is a cool one—ChatGPT checkmates me in 22 moves after an unconventional opening on my part. ChatGPT subjects me to a long sequence of checks (including a discovered check and a castle-with-check) and eventually checkmates me with 22...Bf8#.

Playing against Stockfish Level 1 with the magic prompt

Lichess game, ChatGPT conversation link

ChatGPT checkmates Stockfish Level 1 in 25 moves. This one’s mainly notable for ChatGPT’s correct use of en passant on move 7.

Playing against Stockfish Level 3 with the magic prompt

Lichess game, ChatGPT conversation link

ChatGPT gets itself a couple pieces up against Stockfish Level 3, but starting on move 29 starts hallucinating continuations for both it and its opponent, rather than only giving its move.

Playing against me without the magic prompt

Lichess study, ChatGPT conversation link

ChatGPT plays a good game against me and checkmates me in 24 moves, including a nice discovered check on move 20, despite not having the magic prompt in this one and making conversation with me throughout the game. In this game, though, it should be noted that although ChatGPT checkmates me, it fails to recognize that it has done so, even after I ask it what move I should make.

Speculations about the causes of improvement as a result of the prompt

I’d guess that ChatGPT-3.5 performs relatively better with the prompt than without because of the entire game score being provided at each step of the conversation; when the whole score is provided, it presumably better matches the chess game scores it has seen in its training and has learned to predict. The chess scores in its training probably mostly don’t have surrounding commentary and don’t aren’t broken up between two halves of a conversation.

What I would find very interesting as a possible question to investigate is whether or not the network is storing a representation of the then-current (i.e. incomplete) state of the game at each token in the chess game score. I suspect that it is, but it’s unfortunately difficult to prove, given that only ChatGPT seems capable of playing chess (I tried with both Claude and LLama-2 13B, but both proved completely unable to play legal chess, with or without the magic prompt).

If it were the case that it’s storing intermediate board states in the longer scores, and that this is in fact responsible for the better performance that we see with instead of without the magic prompt in GPT-3.5, this could be a cool example of something analogous to filler tokens being shown to work. (These aren’t exactly filler tokens as discussed in the linked post, since they carry information about the problem the chatbot is solving, but because they’re redundant with what was already said in the conversation, I think they’re at least similar in concept.)

What are some other examples of hidden capabilities of LLMs that are only elicited if the user prompts in a non-obvious way? Chess is an interesting one, but it’s unfortunate that the game is so complex and that the phenomenon can’t be observed on open-source models to my knowledge, making it hard to study more deeply.

  1. ^

    “Score” here is jargon for “a record of the game”

  2. ^

    These Elo ratings for early Stockfish levels are super approximate, with different sources claiming different ratings. I’m using these because they seem about right to me, but these ratings shouldn’t be taken as anywhere near exact.

  3. ^

    With the ChatGPT-3.5 games, I’ve showed every game I played with it. But with these, I’m only showing a subset—I tried not to cherrypick and to be representative of its overall performance, but there’s no ironclad promise I didn’t cherrypick for the ChatGPT-4 games.