Games can be quite complicated. Consider chess: how many grandmaster vs grandmaster games of chess would you have to watch offline before you observed pawns being promoted to not just queens, but rooks, bishops, and knights (and observed it enough times to be certain that pawns couldn’t be promoted to anything else, such as pawns or kings, or to the opposite color, or that any other piece could be promoted, that promotion is not just a good idea but mandatory, and that it could happen only on the last rank?) I’m going to predict that it would take much more than 1,000 games. And if you miss any of those wrinkles, regardless of how they never come up in actual play (their rarity is, after all, exactly why you failed to learn them), then a hardnosed critic would be justified in saying you have failed to learn some aspect of ‘chess’.
To learn this in a small sample size, you need to either have informative priors from knowing how to play other board games which have promotion mechanics, priors from something like pretraining on natural language describing chess rules, or be doing explicit exploration with online RL (eg MuZero) where you can try to not promote a pawn and discover that’s illegal etc.
Once upon a time, the rules of chess didn’t make it explicit that you have to promote to a piece of your own colour, hence the amusing problem that’s first under the heading “Offbeat interpretations of the rules of chess” at https://en.wikipedia.org/wiki/Joke_chess_problem.
However, as it happens, the rules of Othello are quite simple, and the rules a human would infer from watching even a rather small number of games are the correct rules. Again, it’s not in dispute that other rulesets are possible given what one could learn from watching thousands of high-level games, but I don’t think there’s any real risk that a human would pick the wrong one. That doesn’t mean that GPT wouldn’t, of course, and I agree that Vaniver’s hypothesis is plausible, but it’s not clear to me how likely it actually is.
Note that the figure the paper reports isn’t “how accurately do its predictions distinguish legal from illegal moves”, exactly, it’s “how often is the top prediction legal?”. So the failure mode when it learns from championship games is that sometimes it predicts illegal moves. That can’t be a matter of successfully learning rules that somehow restrict you to only playing good moves.
So I think something along the lines of “not enough network capacity to learn legality as well as good strategy” is still pretty plausible; but now I think about it some more, there’s also a simpler explanation: the synthetic training set is much larger than the championship training set, so maybe it just needs more training to learn not to make illegal moves. It’s unfortunate that the paper doesn’t try the experiment of training with a synthetic training set the same size as the championship training set.
(Speaking of which, I have just noticed that I misread something; the synthetic training set is 20M games, not 4M as I claim. I’ll edit the OP to fix this.)
However, as it happens, the rules of Othello are quite simple, and the rules a human would infer from watching even a rather small number of games are the correct rules.
I suspect that if you watched small children play, who have not played many board games, this would not be the case. There will still be countless rule sets consistent with it. Mathematical/esthetic valuation of simplicity are learned, not innate or a priori to either humans or Transformers.
That can’t be a matter of successfully learning rules that somehow restrict you to only playing good moves.
Sure it can. Problems with stochastic sampling aside (‘sampling can prove the presence of knowledge but not the absence’), this is something we learned back with our chess GPT-2 work: predicting solely legal moves based on a history of moves is actually quite difficult because your non-recurrent Transformer model needs to reconstruct the board state each time from scratch by internally replaying the game. It wasn’t playing chess so much as blindfold blitz chess.
If you make any errors in the internal board state reconstruction, then you can easily make what you think is a legal move, and would in fact be a legal move given your reconstruction, but is not a legal move. (Note the mention of substantial error in attempting to extract a board state from the model.) So it’s entirely possible that when they test the legal move prediction in a particular board state by feeding in a history (pg4) which would lead to that board state (and not that board state), you are seeing 100% correct rule learning and that 0.01-0.02% error is when the board state errors trip up the choice of move.
Our own conclusion was that since we didn’t really need the chess notation to be compact as PGN (games were always much smaller than the GPT-2 context window), we shouldn’t train solely on PGN (ie. [action]) but on (move,FEN) (ie. [(action,state)]), to get better performance and a better idea of what GPT-2 was learning in chess unaffected by its limitations in state reconstruction (which presumably reflected uninteresting things like how deep the arch is and so how many serial steps it could compute before running out of time). You’ll notice that Decision Transformer work usually doesn’t try to do open-loop ‘blind’ DRL agents, like OP does, but trains the DT on (action,state) pairs.
EDIT: an additional issue with GPT-3/GPT-4 chess examples is that GPT may be completely capable of learning/playing chess, and have done so, but have an ineradicable error rate, not due to BPEs but due to random glitches from internal self-attention sparsity; if this is the case, then the chess failures would be a slightly-interesting footnote for design/capabilities, but render chess failures irrelevant to broader discussions about ‘what can LLMs learn’ or ‘do they have world models’ etc.
I take it that by “if you watched small children play” you mean that the small children are the ones watching the games and inferring the rules. You might be right; it would be an interesting experiment to try. I think it might be difficult to distinguish “they haven’t yet acquired the priors I have about what constitutes simplicity and elegance in the ruleset of a game” from “they just aren’t very good at thinking because their brains are still developing”, though.
I have the feeling that we are talking past one another a bit as regards what’s going on with the “championship” training set; much of what you write seems like it’s arguing against something I wasn’t intending to say, or for something I wasn’t intending to deny. In any case, my current theory is that the difference has much more to do with there being 14x more games in the “synthetic” set than in the “championship” set than with either (1) the latter having less diversity of positions (only ones that good players get into) or (2) the latter having less diversity of moves in any given position (only ones that good players would make).
I bet you’re right that if you want a transformer to learn to play chess well you should give it the board state on every move. That wouldn’t have been appropriate for the work we’re discussing, though, since the whole point was to determine whether a transformer trained only on the moves will learn to have an internal representation of the board state, which in turn is suggestive of whether a much larger transformer trained only on text will learn to have an internal representation of the world that the text is about.
That wouldn’t have been appropriate for the work we’re discussing, though, since the whole point was to determine whether a transformer trained only on the moves will learn to have an internal representation of the board state, which in turn is suggestive of whether a much larger transformer trained only on text will learn to have an internal representation of the world that the text is about.
Sure, I’m not saying they should’ve done that instead. In addition, but probably they didn’t have the time/energy. My point is just that the illegal-move error rate is ambiguous if you (gjm) are interested in whether it has perfectly learned the rules (which is different from what the authors are going after), because there are sources of error beyond “it has failed to learn the rules”, like errors reconstructing the board state leading to misapplication of potentially-perfectly-learned rules. To my eyes, a legal move error rate as low as 0.01% in this setup, given the burden of state reconstruction in a unnatural and difficult way, strongly suggests it’s actually doing a great job of learning the rules. I predict that if you set it up in a way which more narrowly targeted rule learning (eg behavior cloning: just mapping full game state->expert-action, no history at all), you would find that its illegal move rate would approach 0% much more closely, and you’d have to find some really strange edge-cases like my chess promotion examples to trip it up, (at which point one would be satisfied, because how would one ever learn those unobserved things offline without priors).
I agree that the network trained on the large random-game dataset shows every sign of having learned the rules very well, and if I implied otherwise then that was an error. (I don’t think I ever intended to imply otherwise.)
The thing I was more interested in was the difference between that and the network trained on the much smaller championship-game dataset, whose incorrect-move rate is much much higher—about 5%. I’m pretty sure that either (1) having a lot more games of that type would help a lot or (2) having a bigger network would help a lot or (3) both; my original speculation was that 2 was more important but at that point I hadn’t noticed just how big the disparity in game count was. I now think it’s probably mostly 1, and I suspect that the difference between “random games” and “well played games” is not a major factor, and in particular I don’t think it’s likely that seeing only good moves is leading the network to learn a wrong ruleset. (It’s definitely not impossible! It just isn’t how I’d bet.)
Vaniver’s suggestion was that the championship-game-trained network had learned a wrong ruleset on account of some legal moves being very rare. It doesn’t seem likely to me that this (as opposed to 1. not having learned very well because the number of games was too small and/or 2. not having learned very well because the positions in the championship games are unrepresentative) is the explanation for having illegal moves as top prediction 5% of the time.
It looked as if you were disagreeing with that, but the arguments you’ve made in support all seem like cogent arguments against things other than what I was intending to say, which is why I think that at least one of us is misunderstanding the other.
In particular, at no point was I saying anything about the causes of the nonzero but very small error rate (~0.01%) of the network trained on the large random-game dataset, and at no point was I saying that that network had not done an excellent job of learning the rules.
Games can be quite complicated. Consider chess: how many grandmaster vs grandmaster games of chess would you have to watch offline before you observed pawns being promoted to not just queens, but rooks, bishops, and knights (and observed it enough times to be certain that pawns couldn’t be promoted to anything else, such as pawns or kings, or to the opposite color, or that any other piece could be promoted, that promotion is not just a good idea but mandatory, and that it could happen only on the last rank?) I’m going to predict that it would take much more than 1,000 games. And if you miss any of those wrinkles, regardless of how they never come up in actual play (their rarity is, after all, exactly why you failed to learn them), then a hardnosed critic would be justified in saying you have failed to learn some aspect of ‘chess’.
To learn this in a small sample size, you need to either have informative priors from knowing how to play other board games which have promotion mechanics, priors from something like pretraining on natural language describing chess rules, or be doing explicit exploration with online RL (eg MuZero) where you can try to not promote a pawn and discover that’s illegal etc.
Once upon a time, the rules of chess didn’t make it explicit that you have to promote to a piece of your own colour, hence the amusing problem that’s first under the heading “Offbeat interpretations of the rules of chess” at https://en.wikipedia.org/wiki/Joke_chess_problem.
However, as it happens, the rules of Othello are quite simple, and the rules a human would infer from watching even a rather small number of games are the correct rules. Again, it’s not in dispute that other rulesets are possible given what one could learn from watching thousands of high-level games, but I don’t think there’s any real risk that a human would pick the wrong one. That doesn’t mean that GPT wouldn’t, of course, and I agree that Vaniver’s hypothesis is plausible, but it’s not clear to me how likely it actually is.
Note that the figure the paper reports isn’t “how accurately do its predictions distinguish legal from illegal moves”, exactly, it’s “how often is the top prediction legal?”. So the failure mode when it learns from championship games is that sometimes it predicts illegal moves. That can’t be a matter of successfully learning rules that somehow restrict you to only playing good moves.
So I think something along the lines of “not enough network capacity to learn legality as well as good strategy” is still pretty plausible; but now I think about it some more, there’s also a simpler explanation: the synthetic training set is much larger than the championship training set, so maybe it just needs more training to learn not to make illegal moves. It’s unfortunate that the paper doesn’t try the experiment of training with a synthetic training set the same size as the championship training set.
(Speaking of which, I have just noticed that I misread something; the synthetic training set is 20M games, not 4M as I claim. I’ll edit the OP to fix this.)
I suspect that if you watched small children play, who have not played many board games, this would not be the case. There will still be countless rule sets consistent with it. Mathematical/esthetic valuation of simplicity are learned, not innate or a priori to either humans or Transformers.
Sure it can. Problems with stochastic sampling aside (‘sampling can prove the presence of knowledge but not the absence’), this is something we learned back with our chess GPT-2 work: predicting solely legal moves based on a history of moves is actually quite difficult because your non-recurrent Transformer model needs to reconstruct the board state each time from scratch by internally replaying the game. It wasn’t playing chess so much as blindfold blitz chess.
If you make any errors in the internal board state reconstruction, then you can easily make what you think is a legal move, and would in fact be a legal move given your reconstruction, but is not a legal move. (Note the mention of substantial error in attempting to extract a board state from the model.) So it’s entirely possible that when they test the legal move prediction in a particular board state by feeding in a history (pg4) which would lead to that board state (and not that board state), you are seeing 100% correct rule learning and that 0.01-0.02% error is when the board state errors trip up the choice of move.
Our own conclusion was that since we didn’t really need the chess notation to be compact as PGN (games were always much smaller than the GPT-2 context window), we shouldn’t train solely on PGN (ie.
[action]
) but on (move,FEN) (ie.[(action,state)]
), to get better performance and a better idea of what GPT-2 was learning in chess unaffected by its limitations in state reconstruction (which presumably reflected uninteresting things like how deep the arch is and so how many serial steps it could compute before running out of time). You’ll notice that Decision Transformer work usually doesn’t try to do open-loop ‘blind’ DRL agents, like OP does, but trains the DT on (action,state) pairs.EDIT: an additional issue with GPT-3/GPT-4 chess examples is that GPT may be completely capable of learning/playing chess, and have done so, but have an ineradicable error rate, not due to BPEs but due to random glitches from internal self-attention sparsity; if this is the case, then the chess failures would be a slightly-interesting footnote for design/capabilities, but render chess failures irrelevant to broader discussions about ‘what can LLMs learn’ or ‘do they have world models’ etc.
I take it that by “if you watched small children play” you mean that the small children are the ones watching the games and inferring the rules. You might be right; it would be an interesting experiment to try. I think it might be difficult to distinguish “they haven’t yet acquired the priors I have about what constitutes simplicity and elegance in the ruleset of a game” from “they just aren’t very good at thinking because their brains are still developing”, though.
I have the feeling that we are talking past one another a bit as regards what’s going on with the “championship” training set; much of what you write seems like it’s arguing against something I wasn’t intending to say, or for something I wasn’t intending to deny. In any case, my current theory is that the difference has much more to do with there being 14x more games in the “synthetic” set than in the “championship” set than with either (1) the latter having less diversity of positions (only ones that good players get into) or (2) the latter having less diversity of moves in any given position (only ones that good players would make).
I bet you’re right that if you want a transformer to learn to play chess well you should give it the board state on every move. That wouldn’t have been appropriate for the work we’re discussing, though, since the whole point was to determine whether a transformer trained only on the moves will learn to have an internal representation of the board state, which in turn is suggestive of whether a much larger transformer trained only on text will learn to have an internal representation of the world that the text is about.
Sure, I’m not saying they should’ve done that instead. In addition, but probably they didn’t have the time/energy. My point is just that the illegal-move error rate is ambiguous if you (gjm) are interested in whether it has perfectly learned the rules (which is different from what the authors are going after), because there are sources of error beyond “it has failed to learn the rules”, like errors reconstructing the board state leading to misapplication of potentially-perfectly-learned rules. To my eyes, a legal move error rate as low as 0.01% in this setup, given the burden of state reconstruction in a unnatural and difficult way, strongly suggests it’s actually doing a great job of learning the rules. I predict that if you set it up in a way which more narrowly targeted rule learning (eg behavior cloning: just mapping full game state->expert-action, no history at all), you would find that its illegal move rate would approach 0% much more closely, and you’d have to find some really strange edge-cases like my chess promotion examples to trip it up, (at which point one would be satisfied, because how would one ever learn those unobserved things offline without priors).
I agree that the network trained on the large random-game dataset shows every sign of having learned the rules very well, and if I implied otherwise then that was an error. (I don’t think I ever intended to imply otherwise.)
The thing I was more interested in was the difference between that and the network trained on the much smaller championship-game dataset, whose incorrect-move rate is much much higher—about 5%. I’m pretty sure that either (1) having a lot more games of that type would help a lot or (2) having a bigger network would help a lot or (3) both; my original speculation was that 2 was more important but at that point I hadn’t noticed just how big the disparity in game count was. I now think it’s probably mostly 1, and I suspect that the difference between “random games” and “well played games” is not a major factor, and in particular I don’t think it’s likely that seeing only good moves is leading the network to learn a wrong ruleset. (It’s definitely not impossible! It just isn’t how I’d bet.)
Vaniver’s suggestion was that the championship-game-trained network had learned a wrong ruleset on account of some legal moves being very rare. It doesn’t seem likely to me that this (as opposed to 1. not having learned very well because the number of games was too small and/or 2. not having learned very well because the positions in the championship games are unrepresentative) is the explanation for having illegal moves as top prediction 5% of the time.
It looked as if you were disagreeing with that, but the arguments you’ve made in support all seem like cogent arguments against things other than what I was intending to say, which is why I think that at least one of us is misunderstanding the other.
In particular, at no point was I saying anything about the causes of the nonzero but very small error rate (~0.01%) of the network trained on the large random-game dataset, and at no point was I saying that that network had not done an excellent job of learning the rules.