[Question] What is the most impressive game LLMs can play well?

Cole Wyeth8 Jan 2025 19:38 UTC

LW: 19 AF: 9

Epistemic status: This is an off-the-cuff question.

~5 years ago there was a lot of exciting progress on game playing through reinforcement learning (RL). Now we have basically switched paradigms, pretraining massive LLMs on ~the internet and then apparently doing some really trivial unsophisticated RL on top of that—this is successful and highly popular because interacting with LLMs is pretty awesome (at least if you haven’t done it before) and they “feel” a lot more like A.G.I. Probably there’s somewhat more commercial use as well via code completion (and some would say many other tasks, personally not really convinced—generative image/video models will certainly be profitable though). There’s also a sense in which they are clearly more general—e.g. one RL algorithm may learn many games but there’s typically an instance per game not one integrated agent. You can just ask an LLM in context to play some games.

However, I’ve been following moderately closely and I can’t seem to think of any examples where LLMs really pushed the state of the art in narrow game playing - how much have LLMs contributed to RL research? For instance, will adding o3 to the stack easily stomp on previous Starcraft / go / chess agents?

Cole Wyeth8 Jan 2025 19:38 UTC

LW: 19 AF: 9

20 comments1 min readLW link

Martin Randall 9 Jan 2025 14:49 UTC
6 points
0
Diplomacy AI by Meta is a clear example of how adding LLMs can improve narrow game playing. Most multiplayer games with communication will benefit in the same way.
- Cole Wyeth 9 Jan 2025 15:59 UTC
  1 point
  2
  Parent
  Yes, after asking the question I realized Diplomacy would be the most likely answer. I don’t find it very satisfying though because it is a text/vibes based game—it wouldn’t have been possible to approach effectively at all without building some kind of chatbot, so it’s exactly the type of game I’d expect LLMs to make progress on even without pushing the frontier on strategy/planning.
ProgramCrafter 16 Jan 2025 23:31 UTC
1 point
0
In StarCraft II, adding LLMs (to do/aid game-time thinking) will not help the agent in any way, I believe. That happens because inference has a quite large latency, especially as most of prompt changes with all the units moving, so tactical moves are out; strategic questions “what is the other player building” and “how many units do they already have” are better answered by ~~card-counting~~ counting visible units and inferring what’s the proportion of remaining resources (or scouting if possible).
I guess it is possible that bots’ algorithms are improved with LLMs but that requires a high-quality insight; not convinced that o1 or o3 give such insights.
- gwern 17 Jan 2025 3:07 UTC
  9 points
  2
  Parent
  Ma et al 2023 is relevant here.
  - ProgramCrafter 18 Jan 2025 10:53 UTC
    1 point
    0
    Parent
    That article is suspiciously scarce on what microcontrols units… well, glory to LLMs for decent macro management then! (Though I believe that capability is still easier to get without text neural networks.)

Archimedes 9 Jan 2025 4:51 UTC
LW: 9 AF: 5
0
AF
Related question: What is the least impressive game current LLMs struggle with?

I’ve heard they’re pretty bad at Tic Tac Toe.
- Vanessa Kosoy 16 Jan 2025 15:05 UTC
  LW: 3 AF: 2
  1
  AF Parent
  Relevant link
Vanessa Kosoy 15 Jan 2025 10:59 UTC
LW: 3 AF: 3
0
AF
Relevant: Manifold market about LLM chess
- Cole Wyeth 15 Jan 2025 15:16 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Interesting, the prices seemed reasonable overall though I traded the later dates down a little bit because if LLMs haven’t won be 2030 the paradigm is probably limited (IMO they hadn’t priced in that update).
  I suppose that it’s a slightly “unfair” comparison because chess engines are very narrow and humans can’t beat them either. How do LLMs compare to top human chess players?
  - Vanessa Kosoy 16 Jan 2025 14:49 UTC
    LW: 6 AF: 3
    0
    AF Parent
    Apparently someone let LLMs play against the random policy and for most of them, most games end in a draw. Seems like o1-preview is the best of those tested, managing to win 47% of the time.
    - gwern 17 Jan 2025 3:09 UTC
      LW: 6 AF: 4
      0
      AF Parent
      Given the other reports, like OA’s own benchmarking (as well as the extremely large dataset of chess games they mention training on), I am skeptical of this claim, and wonder if this has the same issue as other ‘random chess game’ tests, where the ‘random’ part is not neutral but screws up the implied persona.
      - Cole Wyeth 17 Jan 2025 14:42 UTC
        3 points
        0
        Parent
        This seems possible—according to this article almost every model got crushed by the easiest Stockfish: https://dynomight.net/chess/
        But at the end he links to his second attempt which experimented with fine tuning and prompting, eventually getting decent performance against weak Stockfish. Actually he notes that lists of legal moves are actively harmful, which may partially explain the original example with random agents.
        
        A cursory glance at publications on the topic seems to indicate that LLMs can make valid moves and somehow represent the board state (which seems to follow), but are still weak players even after significant effort designing prompts.
        
        Can you share any more definitive evidence?
      - Vanessa Kosoy 17 Jan 2025 10:09 UTC
        LW: 3 AF: 2
        0
        AF Parent
        Do you mean that seeing the opponent make dumb moves makes the AI infer that its own moves are also supposed to be dumb, or something else?
        gwern 20 Jan 2025 20:17 UTC
        LW: 3 AF: 2
        0
        AF Parent
        Yes.
        Cole Wyeth 20 Jan 2025 20:36 UTC
        1 point
        0
        Parent
        This seems more plausible post hoc. There should be plenty of transcripts of random algorithms as baseline versus effective chess algorithms in the training set, and the prompt suggests strong play.
        gwern 20 Jan 2025 21:01 UTC
        5 points
        0
        Parent
        
        There should be plenty of transcripts of random algorithms as baseline versus effective chess algorithms in the training set
        
        I wouldn’t think that. I’m not sure I’ve seen a random-play transcript of chess in my life. (I wonder how long those games would have to be for random moves to end in checkmate?)
        
        the prompt suggests strong play.
        
        Which, unlike random move transcripts, is what you would predict, since the Superalignment paper says the GPT chess PGN dataset was filtered for Elo (“only games with players of Elo 1800 or higher were included in pretraining”), in standard behavior-cloning fashion.
        Cole Wyeth 20 Jan 2025 23:48 UTC
        1 point
        0
        Parent
        I don’t know, I almost instantly found a transcript of a human stomping a random agent on reddit:
        https://www.reddit.com/r/chess/comments/2rv7fr/randomness_vs_strategy/
        This sort of thing probably would have been scraped?
        I was thinking that plenty would appear as the only baseline a teenage amateur RL enthusiast might beat before getting bored, but I haven’t found any examples of anyone actually posting such transcripts after a few minutes of effort so maybe you’re right.
        Which, unlike random move transcripts, is what you would predict, since the Superalignment paper says the GPT chess PGN dataset was filtered for Elo, in standard behavior-cloning fashion.
        Chess-specific training sets won’t contain a lot of random play.
        I am more interested in any direct evidence that makes you suspect LLMs are good at chess when prompted appropriately?
        gwern 21 Jan 2025 0:16 UTC
        5 points
        −2
        Parent
        A human player beating a random player isn’t two random players.
        
        I am more interested in any direct evidence that makes you suspect LLMs are good at chess when prompted appropriately?
        
        Well, there’s the DM bullet-chess GPT as a drastic proof of concept. If you believe that LLMs cannot learn to play chess, you have to explain how things like that work.
        Cole Wyeth 21 Jan 2025 1:04 UTC
        3 points
        2
        Parent
        A random player against a good player is exactly what we’re looking for right? If all transcripts with one random player had two random players then LLMs should play randomly when their opponents play randomly, but if most transcripts with a random player have it getting stomped by a superior algorithm that’s what we’d expect from base models (and we should be able to elicit it more reliably with careful prompting).
        
        I see no reason that transformers can’t learn to play chess (or any other reasonable game) if they’re carefully trained on board state evaluations etc. This is essentially policy distillation (from a glance at the abstract). What I’m interested in is whether LLMs have absorbed enough general reasoning ability that they can learn to play chess the hard way, like humans do—by understanding the rules and thinking it through zero-shot. Or at least transfer some of that generality to performing better at chess than would be expected (since they in fact have the advantage of absorbing many games during training and don’t have to learn entirely in context). I’m trying to get at that question by investigating how LLMs do at chess—the performance of custom trained transformers isn’t exactly a crux, though it is somewhat interesting.
    - Cole Wyeth 16 Jan 2025 19:01 UTC
      3 points
      0
      Parent
      This is a pretty strong update against LLMs for me. I would have expected them to perform okay against a random model given free access to the board state and list of legal moves. I suspect I could probably win blind (and I am a serious player, certainly others can win multiple blind games at once) so this is not entirely a perception issue. On the other hand, o1 is certainly getting some traction, which often precedes steady improvement (based on the last couple of years). But… like, it’s basically doing a super overpriced tree search. I’m guessing a tree search to depth 3 with a naive heuristic is already enough to beat a random player, so I’m not convinced that the LLM is lifting any weight here.