In chess, which I find to be a useful test of LLM capability because (a) LLMs are not designed to do this and (b) playing well beyond the opening requires precision and reasoning, I would say GPT4 is roughly at least weak, possibly intermediate club player level now. This is based on one full game, where it played consistently well except for making a mistake in the endgame that I think a lot of club players would also have made.
It seems better at avoiding blunders than Bing, which could be due to modifications for search/search-related prompting in Bing. Or it could be random noise and more test games would show average level to be weaker than the reported first impression.
I’ve found it’s ability to be much better as well. In contrast to GPT-3, which often seemed to be unable to keep track of board state and made illegal moves toward the end of the game, it not only played legal moves, it actually mated me. Granted I’m a terrible player and I was deliberately not reading ahead to see if it would be able to mate a weak player. My method was to tell it I wanted to play and then give my move in algebraic notation. It would respond with a move, then I would respond with another. After it beat me, I asked it to list all the moves we had played. There were some minor errors in its notation (I think) which I’ve noted below.
Of course! Here’s a summary of our game in standard algebraic notation:
d4 Nf6
c3 d5
f3 c5
dxc5 e6
Bf4 Bxc5
Nd2 Nc6
e4 dxe4
fxe4 e5
Bg5 Be7
Bb5 O-O
Nf3 Qc7
O-O Bg4
Nc4 Rad8
Qe1 a6
Bxc6 bxc6
Rd1 Rxd1
Qxd1 Nxe4
Bxe7 Qxe7
Nfxe5 Qc5+
Rf2 Nxf2
Qxf1 Nd3+ (Here GPT-4 made a slight error, I played Qf1 but there was no capture)
Kh1 Nxe5
Nxe5 Qxe5
h4 Rd8
Qxa6 Rd1+
Qxf1 Rxf1+ (Here it checkmates me, though doesn’t mark the move as such)
”We are playing a chess game. At every turn, repeat all the moves that have already been made. Find the best response for Black. I’m White and the game starts with 1.e4
So, to be clear, your output format should always be:
PGN of game so far: …
Best move: …
and then I get to play my move.”
With ChatGPT pre-GPT4 and Bing, I also added the fiction that it could consult Stockfish (or Kasparov, or someone else known to be strong), which seemed to help it make better moves. GPT4 does not seem to need this, and rightfully pointed out that it does not have access to Stockfish when I tried the Stockfish version of this prompt.
For ChatGPT pre-GPT4, the very strict instructions above resulted in an ability to play reasonable, full games, which was not possible just exchanging single moves in algebraic notation. I have not tested whether it makes a difference still with GPT4.
On the rare occasions where it gets the history of the game wrong or suggests an illegal move, I regenerate the response or reprompt with the game history so far. I accept all legal moves made with correct game history as played.
I’ve collected all of my test games in a lichess study here:
Ahh, I should have thought of having it repeat the history! Good prompt engineering. Will try it out. The gpt4 gameplay in your lichess study is not bad!
I tried by just asking it to play and use SAN. I had it explain its moves, which it did well, and it also commented on my (intentionally bad) play. It quickly made a mess of things though, clearly lost track of the board state (to the extent it’s “tracking” it … really hard to say exactly how it’s playing past common opening) even though it should’ve been in the context window.
I don’t know how they did it, but I played a chess game against GPT4 by saying the following:
”I’m going to play a chess game. I’ll play white, and you play black. On each chat, I’ll post a move for white, and you follow with the best move for black. Does that make sense?”
And then going through the moves 1-by-1 in algebraic notation.
My experience largely follows that of GoteNoSente’s. I played one full game that lasted 41 moves and all of GPT4′s moves were reasonable. It did make one invalid move when I forgot to include the number before my move (e.g. Ne4 instead of 12. Ne4), but it fixed it when I put in the number in advance. Also, I think it was better in the opening than in the endgame. I suspect this is probably because of the large amount of similar openings in its training data.
Interesting, I tried the same experiment on ChatGPT and it didn’t seem able to keep an accurate representation of the current game state and would consistently make moves that were blocked by other pieces.
In chess, which I find to be a useful test of LLM capability because (a) LLMs are not designed to do this and (b) playing well beyond the opening requires precision and reasoning, I would say GPT4 is roughly at least weak, possibly intermediate club player level now. This is based on one full game, where it played consistently well except for making a mistake in the endgame that I think a lot of club players would also have made.
It seems better at avoiding blunders than Bing, which could be due to modifications for search/search-related prompting in Bing. Or it could be random noise and more test games would show average level to be weaker than the reported first impression.
I’ve found it’s ability to be much better as well. In contrast to GPT-3, which often seemed to be unable to keep track of board state and made illegal moves toward the end of the game, it not only played legal moves, it actually mated me. Granted I’m a terrible player and I was deliberately not reading ahead to see if it would be able to mate a weak player. My method was to tell it I wanted to play and then give my move in algebraic notation. It would respond with a move, then I would respond with another. After it beat me, I asked it to list all the moves we had played. There were some minor errors in its notation (I think) which I’ve noted below.
Of course! Here’s a summary of our game in standard algebraic notation:
d4 Nf6
c3 d5
f3 c5
dxc5 e6
Bf4 Bxc5
Nd2 Nc6
e4 dxe4
fxe4 e5
Bg5 Be7
Bb5 O-O
Nf3 Qc7
O-O Bg4
Nc4 Rad8
Qe1 a6
Bxc6 bxc6
Rd1 Rxd1
Qxd1 Nxe4
Bxe7 Qxe7
Nfxe5 Qc5+
Rf2 Nxf2
Qxf1 Nd3+ (Here GPT-4 made a slight error, I played Qf1 but there was no capture)
Kh1 Nxe5
Nxe5 Qxe5
h4 Rd8
Qxa6 Rd1+
Qxf1 Rxf1+ (Here it checkmates me, though doesn’t mark the move as such)
How did you play? Just SAN?
I am using the following prompt:
”We are playing a chess game. At every turn, repeat all the moves that have already been made. Find the best response for Black. I’m White and the game starts with 1.e4
So, to be clear, your output format should always be:
PGN of game so far: …
Best move: …
and then I get to play my move.”
With ChatGPT pre-GPT4 and Bing, I also added the fiction that it could consult Stockfish (or Kasparov, or someone else known to be strong), which seemed to help it make better moves. GPT4 does not seem to need this, and rightfully pointed out that it does not have access to Stockfish when I tried the Stockfish version of this prompt.
For ChatGPT pre-GPT4, the very strict instructions above resulted in an ability to play reasonable, full games, which was not possible just exchanging single moves in algebraic notation. I have not tested whether it makes a difference still with GPT4.
On the rare occasions where it gets the history of the game wrong or suggests an illegal move, I regenerate the response or reprompt with the game history so far. I accept all legal moves made with correct game history as played.
I’ve collected all of my test games in a lichess study here:
https://lichess.org/study/ymmMxzbj
Ahh, I should have thought of having it repeat the history! Good prompt engineering. Will try it out. The gpt4 gameplay in your lichess study is not bad!
I tried by just asking it to play and use SAN. I had it explain its moves, which it did well, and it also commented on my (intentionally bad) play. It quickly made a mess of things though, clearly lost track of the board state (to the extent it’s “tracking” it … really hard to say exactly how it’s playing past common opening) even though it should’ve been in the context window.
I don’t know how they did it, but I played a chess game against GPT4 by saying the following:
”I’m going to play a chess game. I’ll play white, and you play black. On each chat, I’ll post a move for white, and you follow with the best move for black. Does that make sense?”
And then going through the moves 1-by-1 in algebraic notation.
My experience largely follows that of GoteNoSente’s. I played one full game that lasted 41 moves and all of GPT4′s moves were reasonable. It did make one invalid move when I forgot to include the number before my move (e.g. Ne4 instead of 12. Ne4), but it fixed it when I put in the number in advance. Also, I think it was better in the opening than in the endgame. I suspect this is probably because of the large amount of similar openings in its training data.
Interesting, I tried the same experiment on ChatGPT and it didn’t seem able to keep an accurate representation of the current game state and would consistently make moves that were blocked by other pieces.