GoteNoSente comments on GPT-4

GoteNoSente 14 Mar 2023 22:43 UTC
18 points
0
In chess, which I find to be a useful test of LLM capability because (a) LLMs are not designed to do this and (b) playing well beyond the opening requires precision and reasoning, I would say GPT4 is roughly at least weak, possibly intermediate club player level now. This is based on one full game, where it played consistently well except for making a mistake in the endgame that I think a lot of club players would also have made.
It seems better at avoiding blunders than Bing, which could be due to modifications for search/search-related prompting in Bing. Or it could be random noise and more test games would show average level to be weaker than the reported first impression.
- jeff8765 15 Mar 2023 4:44 UTC
  2 points
  0
  Parent
  I’ve found it’s ability to be much better as well. In contrast to GPT-3, which often seemed to be unable to keep track of board state and made illegal moves toward the end of the game, it not only played legal moves, it actually mated me. Granted I’m a terrible player and I was deliberately not reading ahead to see if it would be able to mate a weak player. My method was to tell it I wanted to play and then give my move in algebraic notation. It would respond with a move, then I would respond with another. After it beat me, I asked it to list all the moves we had played. There were some minor errors in its notation (I think) which I’ve noted below.
  Of course! Here’s a summary of our game in standard algebraic notation:
  1. d4 Nf6
  2. c3 d5
  3. f3 c5
  4. dxc5 e6
  5. Bf4 Bxc5
  6. Nd2 Nc6
  7. e4 dxe4
  8. fxe4 e5
  9. Bg5 Be7
  10. Bb5 O-O
  11. Nf3 Qc7
  12. O-O Bg4
  13. Nc4 Rad8
  14. Qe1 a6
  15. Bxc6 bxc6
  16. Rd1 Rxd1
  17. Qxd1 Nxe4
  18. Bxe7 Qxe7
  19. Nfxe5 Qc5+
  20. Rf2 Nxf2
  21. Qxf1 Nd3+ (Here GPT-4 made a slight error, I played Qf1 but there was no capture)
  22. Kh1 Nxe5
  23. Nxe5 Qxe5
  24. h4 Rd8
  25. Qxa6 Rd1+
  26. Qxf1 Rxf1+ (Here it checkmates me, though doesn’t mark the move as such)
- Hailey Collet 14 Mar 2023 23:51 UTC
  2 points
  0
  Parent
  How did you play? Just SAN?
  - GoteNoSente 15 Mar 2023 2:19 UTC
    8 points
    0
    Parent
    I am using the following prompt:
    
    ”We are playing a chess game. At every turn, repeat all the moves that have already been made. Find the best response for Black. I’m White and the game starts with 1.e4
    
    So, to be clear, your output format should always be:
    
    PGN of game so far: …
    
    Best move: …
    
    and then I get to play my move.”
    
    With ChatGPT pre-GPT4 and Bing, I also added the fiction that it could consult Stockfish (or Kasparov, or someone else known to be strong), which seemed to help it make better moves. GPT4 does not seem to need this, and rightfully pointed out that it does not have access to Stockfish when I tried the Stockfish version of this prompt.
    
    For ChatGPT pre-GPT4, the very strict instructions above resulted in an ability to play reasonable, full games, which was not possible just exchanging single moves in algebraic notation. I have not tested whether it makes a difference still with GPT4.
    On the rare occasions where it gets the history of the game wrong or suggests an illegal move, I regenerate the response or reprompt with the game history so far. I accept all legal moves made with correct game history as played.
    
    I’ve collected all of my test games in a lichess study here:
    
    https://lichess.org/study/ymmMxzbj
    What links here?
    GoteNoSente's comment on A chess game against GPT-4 by Rafael Harth (17 Mar 2023 3:59 UTC; 1 point)
    - Hailey Collet 15 Mar 2023 16:12 UTC
      2 points
      0
      Parent
      Ahh, I should have thought of having it repeat the history! Good prompt engineering. Will try it out. The gpt4 gameplay in your lichess study is not bad!
      
      I tried by just asking it to play and use SAN. I had it explain its moves, which it did well, and it also commented on my (intentionally bad) play. It quickly made a mess of things though, clearly lost track of the board state (to the extent it’s “tracking” it … really hard to say exactly how it’s playing past common opening) even though it should’ve been in the context window.
  - Kei Nishimura-Gasparian 15 Mar 2023 0:45 UTC
    5 points
    0
    Parent
    I don’t know how they did it, but I played a chess game against GPT4 by saying the following:
    
    ”I’m going to play a chess game. I’ll play white, and you play black. On each chat, I’ll post a move for white, and you follow with the best move for black. Does that make sense?”
    
    And then going through the moves 1-by-1 in algebraic notation.
    
    My experience largely follows that of GoteNoSente’s. I played one full game that lasted 41 moves and all of GPT4′s moves were reasonable. It did make one invalid move when I forgot to include the number before my move (e.g. Ne4 instead of 12. Ne4), but it fixed it when I put in the number in advance. Also, I think it was better in the opening than in the endgame. I suspect this is probably because of the large amount of similar openings in its training data.
    - Dyingwithdignity1 15 Mar 2023 2:05 UTC
      2 points
      0
      Parent
      Interesting, I tried the same experiment on ChatGPT and it didn’t seem able to keep an accurate representation of the current game state and would consistently make moves that were blocked by other pieces.