However, as it happens, the rules of Othello are quite simple, and the rules a human would infer from watching even a rather small number of games are the correct rules.
I suspect that if you watched small children play, who have not played many board games, this would not be the case. There will still be countless rule sets consistent with it. Mathematical/esthetic valuation of simplicity are learned, not innate or a priori to either humans or Transformers.
That can’t be a matter of successfully learning rules that somehow restrict you to only playing good moves.
Sure it can. Problems with stochastic sampling aside (‘sampling can prove the presence of knowledge but not the absence’), this is something we learned back with our chess GPT-2 work: predicting solely legal moves based on a history of moves is actually quite difficult because your non-recurrent Transformer model needs to reconstruct the board state each time from scratch by internally replaying the game. It wasn’t playing chess so much as blindfold blitz chess.
If you make any errors in the internal board state reconstruction, then you can easily make what you think is a legal move, and would in fact be a legal move given your reconstruction, but is not a legal move. (Note the mention of substantial error in attempting to extract a board state from the model.) So it’s entirely possible that when they test the legal move prediction in a particular board state by feeding in a history (pg4) which would lead to that board state (and not that board state), you are seeing 100% correct rule learning and that 0.01-0.02% error is when the board state errors trip up the choice of move.
Our own conclusion was that since we didn’t really need the chess notation to be compact as PGN (games were always much smaller than the GPT-2 context window), we shouldn’t train solely on PGN (ie. [action]) but on (move,FEN) (ie. [(action,state)]), to get better performance and a better idea of what GPT-2 was learning in chess unaffected by its limitations in state reconstruction (which presumably reflected uninteresting things like how deep the arch is and so how many serial steps it could compute before running out of time). You’ll notice that Decision Transformer work usually doesn’t try to do open-loop ‘blind’ DRL agents, like OP does, but trains the DT on (action,state) pairs.
EDIT: an additional issue with GPT-3/GPT-4 chess examples is that GPT may be completely capable of learning/playing chess, and have done so, but have an ineradicable error rate, not due to BPEs but due to random glitches from internal self-attention sparsity; if this is the case, then the chess failures would be a slightly-interesting footnote for design/capabilities, but render chess failures irrelevant to broader discussions about ‘what can LLMs learn’ or ‘do they have world models’ etc.
I take it that by “if you watched small children play” you mean that the small children are the ones watching the games and inferring the rules. You might be right; it would be an interesting experiment to try. I think it might be difficult to distinguish “they haven’t yet acquired the priors I have about what constitutes simplicity and elegance in the ruleset of a game” from “they just aren’t very good at thinking because their brains are still developing”, though.
I have the feeling that we are talking past one another a bit as regards what’s going on with the “championship” training set; much of what you write seems like it’s arguing against something I wasn’t intending to say, or for something I wasn’t intending to deny. In any case, my current theory is that the difference has much more to do with there being 14x more games in the “synthetic” set than in the “championship” set than with either (1) the latter having less diversity of positions (only ones that good players get into) or (2) the latter having less diversity of moves in any given position (only ones that good players would make).
I bet you’re right that if you want a transformer to learn to play chess well you should give it the board state on every move. That wouldn’t have been appropriate for the work we’re discussing, though, since the whole point was to determine whether a transformer trained only on the moves will learn to have an internal representation of the board state, which in turn is suggestive of whether a much larger transformer trained only on text will learn to have an internal representation of the world that the text is about.
That wouldn’t have been appropriate for the work we’re discussing, though, since the whole point was to determine whether a transformer trained only on the moves will learn to have an internal representation of the board state, which in turn is suggestive of whether a much larger transformer trained only on text will learn to have an internal representation of the world that the text is about.
Sure, I’m not saying they should’ve done that instead. In addition, but probably they didn’t have the time/energy. My point is just that the illegal-move error rate is ambiguous if you (gjm) are interested in whether it has perfectly learned the rules (which is different from what the authors are going after), because there are sources of error beyond “it has failed to learn the rules”, like errors reconstructing the board state leading to misapplication of potentially-perfectly-learned rules. To my eyes, a legal move error rate as low as 0.01% in this setup, given the burden of state reconstruction in a unnatural and difficult way, strongly suggests it’s actually doing a great job of learning the rules. I predict that if you set it up in a way which more narrowly targeted rule learning (eg behavior cloning: just mapping full game state->expert-action, no history at all), you would find that its illegal move rate would approach 0% much more closely, and you’d have to find some really strange edge-cases like my chess promotion examples to trip it up, (at which point one would be satisfied, because how would one ever learn those unobserved things offline without priors).
I agree that the network trained on the large random-game dataset shows every sign of having learned the rules very well, and if I implied otherwise then that was an error. (I don’t think I ever intended to imply otherwise.)
The thing I was more interested in was the difference between that and the network trained on the much smaller championship-game dataset, whose incorrect-move rate is much much higher—about 5%. I’m pretty sure that either (1) having a lot more games of that type would help a lot or (2) having a bigger network would help a lot or (3) both; my original speculation was that 2 was more important but at that point I hadn’t noticed just how big the disparity in game count was. I now think it’s probably mostly 1, and I suspect that the difference between “random games” and “well played games” is not a major factor, and in particular I don’t think it’s likely that seeing only good moves is leading the network to learn a wrong ruleset. (It’s definitely not impossible! It just isn’t how I’d bet.)
Vaniver’s suggestion was that the championship-game-trained network had learned a wrong ruleset on account of some legal moves being very rare. It doesn’t seem likely to me that this (as opposed to 1. not having learned very well because the number of games was too small and/or 2. not having learned very well because the positions in the championship games are unrepresentative) is the explanation for having illegal moves as top prediction 5% of the time.
It looked as if you were disagreeing with that, but the arguments you’ve made in support all seem like cogent arguments against things other than what I was intending to say, which is why I think that at least one of us is misunderstanding the other.
In particular, at no point was I saying anything about the causes of the nonzero but very small error rate (~0.01%) of the network trained on the large random-game dataset, and at no point was I saying that that network had not done an excellent job of learning the rules.
I suspect that if you watched small children play, who have not played many board games, this would not be the case. There will still be countless rule sets consistent with it. Mathematical/esthetic valuation of simplicity are learned, not innate or a priori to either humans or Transformers.
Sure it can. Problems with stochastic sampling aside (‘sampling can prove the presence of knowledge but not the absence’), this is something we learned back with our chess GPT-2 work: predicting solely legal moves based on a history of moves is actually quite difficult because your non-recurrent Transformer model needs to reconstruct the board state each time from scratch by internally replaying the game. It wasn’t playing chess so much as blindfold blitz chess.
If you make any errors in the internal board state reconstruction, then you can easily make what you think is a legal move, and would in fact be a legal move given your reconstruction, but is not a legal move. (Note the mention of substantial error in attempting to extract a board state from the model.) So it’s entirely possible that when they test the legal move prediction in a particular board state by feeding in a history (pg4) which would lead to that board state (and not that board state), you are seeing 100% correct rule learning and that 0.01-0.02% error is when the board state errors trip up the choice of move.
Our own conclusion was that since we didn’t really need the chess notation to be compact as PGN (games were always much smaller than the GPT-2 context window), we shouldn’t train solely on PGN (ie.
[action]
) but on (move,FEN) (ie.[(action,state)]
), to get better performance and a better idea of what GPT-2 was learning in chess unaffected by its limitations in state reconstruction (which presumably reflected uninteresting things like how deep the arch is and so how many serial steps it could compute before running out of time). You’ll notice that Decision Transformer work usually doesn’t try to do open-loop ‘blind’ DRL agents, like OP does, but trains the DT on (action,state) pairs.EDIT: an additional issue with GPT-3/GPT-4 chess examples is that GPT may be completely capable of learning/playing chess, and have done so, but have an ineradicable error rate, not due to BPEs but due to random glitches from internal self-attention sparsity; if this is the case, then the chess failures would be a slightly-interesting footnote for design/capabilities, but render chess failures irrelevant to broader discussions about ‘what can LLMs learn’ or ‘do they have world models’ etc.
I take it that by “if you watched small children play” you mean that the small children are the ones watching the games and inferring the rules. You might be right; it would be an interesting experiment to try. I think it might be difficult to distinguish “they haven’t yet acquired the priors I have about what constitutes simplicity and elegance in the ruleset of a game” from “they just aren’t very good at thinking because their brains are still developing”, though.
I have the feeling that we are talking past one another a bit as regards what’s going on with the “championship” training set; much of what you write seems like it’s arguing against something I wasn’t intending to say, or for something I wasn’t intending to deny. In any case, my current theory is that the difference has much more to do with there being 14x more games in the “synthetic” set than in the “championship” set than with either (1) the latter having less diversity of positions (only ones that good players get into) or (2) the latter having less diversity of moves in any given position (only ones that good players would make).
I bet you’re right that if you want a transformer to learn to play chess well you should give it the board state on every move. That wouldn’t have been appropriate for the work we’re discussing, though, since the whole point was to determine whether a transformer trained only on the moves will learn to have an internal representation of the board state, which in turn is suggestive of whether a much larger transformer trained only on text will learn to have an internal representation of the world that the text is about.
Sure, I’m not saying they should’ve done that instead. In addition, but probably they didn’t have the time/energy. My point is just that the illegal-move error rate is ambiguous if you (gjm) are interested in whether it has perfectly learned the rules (which is different from what the authors are going after), because there are sources of error beyond “it has failed to learn the rules”, like errors reconstructing the board state leading to misapplication of potentially-perfectly-learned rules. To my eyes, a legal move error rate as low as 0.01% in this setup, given the burden of state reconstruction in a unnatural and difficult way, strongly suggests it’s actually doing a great job of learning the rules. I predict that if you set it up in a way which more narrowly targeted rule learning (eg behavior cloning: just mapping full game state->expert-action, no history at all), you would find that its illegal move rate would approach 0% much more closely, and you’d have to find some really strange edge-cases like my chess promotion examples to trip it up, (at which point one would be satisfied, because how would one ever learn those unobserved things offline without priors).
I agree that the network trained on the large random-game dataset shows every sign of having learned the rules very well, and if I implied otherwise then that was an error. (I don’t think I ever intended to imply otherwise.)
The thing I was more interested in was the difference between that and the network trained on the much smaller championship-game dataset, whose incorrect-move rate is much much higher—about 5%. I’m pretty sure that either (1) having a lot more games of that type would help a lot or (2) having a bigger network would help a lot or (3) both; my original speculation was that 2 was more important but at that point I hadn’t noticed just how big the disparity in game count was. I now think it’s probably mostly 1, and I suspect that the difference between “random games” and “well played games” is not a major factor, and in particular I don’t think it’s likely that seeing only good moves is leading the network to learn a wrong ruleset. (It’s definitely not impossible! It just isn’t how I’d bet.)
Vaniver’s suggestion was that the championship-game-trained network had learned a wrong ruleset on account of some legal moves being very rare. It doesn’t seem likely to me that this (as opposed to 1. not having learned very well because the number of games was too small and/or 2. not having learned very well because the positions in the championship games are unrepresentative) is the explanation for having illegal moves as top prediction 5% of the time.
It looked as if you were disagreeing with that, but the arguments you’ve made in support all seem like cogent arguments against things other than what I was intending to say, which is why I think that at least one of us is misunderstanding the other.
In particular, at no point was I saying anything about the causes of the nonzero but very small error rate (~0.01%) of the network trained on the large random-game dataset, and at no point was I saying that that network had not done an excellent job of learning the rules.