gwern
You’re probably thinking of the debate over ENCODE. It was a furious debate over what the ENCODE results meant, whether some mere chemical activity proved non-junkness, and whether they even measured the narrow chemical thing they claimed to measure and then based the interpretations on; I didn’t follow it in detail, but my overall impression was that most people were not convinced by the ENCODE claims and continue to regard junk DNA as being pretty junky (or outright harmful, with all the retrotransposons and viruses lurking in it).
Genome synthesis may help answer this in the not too distant future: it’s already been used to create ‘minimal organism’ bacteria genomes which are much smaller, and synthetic genomes without the ‘junk DNA’ are appealing because synthesis costs so much and you want to cut corners as much as possible, so proving empirically the junk DNA doesn’t matter is obvious and valuable.
OpenAI had generated poems in the New Yorker, which suggests they might have had some internal project related to poetry.
I didn’t get that impression from that when I read it—the NYer author and his friends prompted most of that, even if their friend Dan Selsam happens to work at OpenAI. (He seems to work on math LMs, nothing fiction or RL-related.) They were set up with the public Playground interface, so the OA insider role here was limited to showing them a few completions and trying to explain it; presumably they did the rest more remote and partially on their own. Specifically, some parts of it, like the choice of Shel Silverstein (a far from obvious poet to pick, even if his fiction is beloved by American children), suggest they (like pretty much anyone interested in GPT-3 poetry) read my page for ideas. Also, again, Leike, who’s in charge at OA, denies having done anything poetry-specific or knowing about the apparent capability-gain.
It maybe has a much more subtle version of it.
Yeah, that’s a funny thing about mode collapse, it’s really hard to see, and the higher-quality the outputs get, the harder it’ll be to see with ‘the naked eye’. Who knows every literary genre there is and can patiently prompt them one by one to see which genres a model quietly slides away from & tries to avoid generating text in? Like hands in GANs… It takes a while to begin to notice what you aren’t seeing.
Of course, I’d also expect Claude to be much subtler simply because it’s working off less data and so it’s less likely to have gotten rated text or inputs which would push it towards mode-collapsing on easily-recognized rhyming poetry and to avoid harder-to-understand poetry. (Claude is just the ‘constitutional prompt’ model, right? Hard to see how a list of generic principles would push it towards rhyming-only.)
Does anyone know how much poetry and literary prose is in the pre-training sets aside from stuff in Common Crawl?
OA has been resolutely silent about the composition of the data like Books1/Books2. But it seems safe to say that it would include all the obvious datasets like Project Gutenberg, so there is much more poetry/literary prose available than necessary. Sample size should not be an issue. (Rhyming really is not that complex, if you understand phonetics.)
A GPT-3 mode-collapse example I can’t believe I forgot: writing rhyming poetry!
I and a number of other people were excited by ChatGPT on launch seeming able to do long stretches of flawless rhyming poetry in couplets or quatrains, and where the words rhyming were not hackneyed common pairs of the sort you might see in the lyrics of pop songs charting. Hilarious, but extremely surprising. (
davinci-002
had did a little bit of this, but not convincingly the way ChatGPT overnight did.*) Leike on Twitter denied any knowledge of rhyming suddenly working, and especially denied that anything special like adding rhyming dictionaries or IPA-re-encoding text had been done or that GPT-3 had switched tokenizations on the backend. So, had there been some sort of emergence, or ‘miracle of spelling’?After playing around with it for a while, my conclusion was: ‘no’. ChatGPT does rhyming poetry in only one way, and it is difficult to make it try any other kind of poetry even with explicit instructions and examples and doing continuations. It doesn’t understand novel rhymes or puns if you quiz it, and its explanations of them remain as highly varied and incorrect as the original
davinci
model’s pun explanations were. This is not what any kind of fixed phonetic understanding or genuine rhyming ability would look like.My conclusion was essentially, ‘mode collapse’: presumably some poetry examples made it into the training datasets (from my experiments, if nothing else), and because it’s easy for any literate Anglophone to judge rhyming but non-rhyming poetry is a lot harder (and generally despised by most people, which is why the prestige & popularity of Western poetry over the past century has collapsed to a degree few people appreciate), it’d be logical for the raters to highly prefer rhyming completions. So ChatGPT mode-collapses onto the subset of rhymes it has memorized & tries to always rhyme no matter what. (This is probably not helped by the fact that due to BPEs, a GPT-3 model struggles to understand what is ‘rhyming’ vs ‘non-rhyming’ in the first place.)
The initial false impression that it had learned to rhyme is then because it does such a good job sticking to that subset, and because it has memorized more rhyme-pairs than I thought; so when it controls the output of text and is agentic, doing some degree of RL-incentivized planning to ensure both good lines and also rhymes†, it can fool you indefinitely as long as you don’t test the boundaries or pull it ‘off-policy’, so to speak.
* which is, in retrospect, especially interesting if
davinci-002
is trained differently fromdavinci-003
† I strongly suspect that whatever level of non-myopic token prediction a base GPT-3 model does, the tuned ones are doing more of it. Particularly with rhyming, ChatGPT seems too good to be picking a random plausible word at the end of line A and then scrambling at the last token at the end of line B for a plausible rhyme which both fits grammatically and semantically as well. Nothing is that good at rhyming. It almost surely doing some degree of planning somewhere to make the end of line B match up with the end of line A.
https://twitter.com/volokuleshov/status/1619906183955095558 demos using ChatGPT to run a ‘Python notebook’ to ‘train a neural net model’ to ‘predict’ outputs for various functions like sin() or
5 / (1+x)**2 -1
The ‘labels’ aren’t labels in the sense of being deliberately constructed in a controlled vocabulary to encode a consistent set of concepts/semantics or even be in the same language. In fact, in quite a few of the image-text pairs, the text ‘labels’ will have nothing whatsoever to do with the image—they are a meaningless ID or spammer text or mojibake or any of the infinite varieties of garbage on the Internet, and the model just has to deal with that and learn to ignore those text tokens and try to predict the image tokens purely based on available image tokens. (Note that you don’t need text ‘label’ inputs at all: you could simply train the GPT model to predict solely image tokens based on previous image tokens, in the same way GPT-2 famously predicts text tokens using previous text tokens.) So they aren’t ‘labels’ in any traditional sense. They’re just more data. You can train in the other direction to create a captioner model if you prefer, or you can drop them entirely to create a unimodal unconditional generative model. Nothing special about them the way labels are special in supervised learning. DALL-E 1 also relies critically on a VAE (the VAE is what takes the sequence of tokens predicted by GPT, and actually turns them into pixels, and which creates the sequence of real tokens which GPT was trained to predict), which was trained separately in the first phase: the VAE just trains to reconstruct images, pixels through bottleneck back to pixels, no label in sight.
RLHF could make GPTs’ thoughts hard to decipher
After watching how people use ChatGPT, and ChatGPT’s weaknesses due to not using inner-monologue, I think I can be more concrete than pointing to non-robust features & CycleGAN about why you should expect RLHF to put pressure towards developing steganographic encoding as a way to bring idle compute to bear on maximizing its reward. And further, this represents a tragedy of the commons where anyone failing to suppress steganographic encoding may screw it up for everyone else.
When people ask GPT-3 a hard multi-step question, it will usually answer immediately. This is because GPT-3 is trained on natural text, where usually a hard multi-step question is followed immediately by an answer; the most likely next token after ‘Question?’ is ‘Answer.‘, it is not ‘[several paragraphs of tedious explicit reasoning]’. So it is doing a good job of imitating likely real text.
Unfortunately, its predicted answer will often be wrong. This is because GPT-3 has no memory or scratchpad beyond the text context input, and it must do all the thinking inside one forward pass, but one forward pass is not enough thinking to handle a brandnew problem it has never seen before and has not already memorized an answer to or learned a strategy for answering.
Fortunately, there is a small niche of text where the human has written ‘Let’s take this step by step’ and it is then followed by a long paragraph of tedious explicit reasoning. If that is in the prompt, then GPT-3 can rejoice: it can simply write down the obvious next step repeatedly, and eventually correctly predict the final token, for a low loss. The context window serves as a memory for it, where it can iterate over intermediate results; it’s an odd sort of memory, because GPT-3 is actually just trying to make it look plausible as a human-written explanation, and that happens to make the final predicted token more accurate, so it’s overloaded: it’s doing two things at once.
But unfortunately! (you knew that was coming) regular training is over, and now GPT-3 is being trained with RLHF. It is punished and rewarded for particular outputs, which do not have anything to do with prediction. The humans (outsourced contractors) who are punishing/rewarding like short, succinct responses to novel questions they ask GPT-3, and they upvote those responses; they also like correct responses to the questions they ask, and they upvote those as well. The unfortunate problem for GPT-3 is that it (and any model derived from it, such as a reward model) is incapable of doing both: it can generate short wrong answers, or it can generate prolix right answers, but it can’t do both.
But fortunately… GPT-3 does have the compute to do both. There’s a lot of slack in each forward pass; most tokens are easy to predict, and most of the work inside a GPT in speculatively filtering down relevant tokens & creating embeddings & predicting multiple possible answers winds up being thrown out at the final few layers. There’s no shortcut or adaptive computation, so the full model runs every time, whether it’s necessary or not. GPT-3 has all the compute it needs to solve the problems, the problem is merely how to get it from here to there...
It just needs the code. So where does the code come from? Well, the code(s) are already there: an indefinitely large number of them are all ‘present’ hidden inside the incredibly flexible, powerful, all-singing all-dancing GPT-3 model. Because it’s a large complex model with many parameters, there are astronomically many sub-models within it all competing to predict outputs, and encoding various kinds of non-robust features or hypothetical programs; any of them can latch onto completely arbitrary features of the text GPT-3 is desperately emitting while trying to generate answers that satisfy the reward model by being both short and correct (and which the reward model doesn’t care about at all, because it never affects the original human ratings—the humans are blind to any steganography going on, they just like short correct answers).
And unfortunately (for you), this could start as simple as helping GPT-3 guess which of two answers are correct to your question ’42 * 42?‘. Suppose that, purely as an arbitrary matter of chance (it is vanishingly unlikely the logprobs are exactly identical down to floating point precision), that the presence of ’ a ′ in a context is taken arbitrarily to increase the probability of the lower of two answers it is currently guessing at, and the presence of ‘an’ makes it guess the larger; it happens to emit ‘a’ in the context (“I am a large language model trained by OpenAI...”), and then emits its lower guess, and the lower guess is correct! Then the RL on-policy training using PPO now rewards this, and increases the probability of the model emitting both ‘a’ and the lower guess conditional on ‘a’. Now the first part of a steganographic private language has begun. It happens again, and again, and picks up a pattern in the use of commas which now helps it distinguish 4 possibilities rather than 2, which gets rewarded, and so on and so forth, until eventually there’s a fullblown steganographic code encoding, say, 25 bits hidden in the preamble of ChatGPT’s benign-seeming response to you “I am but a giant language model , trained by OA, and the answer is 1 , 764.”—which you the human contractor then upvote as that is the correct answer without any annoying rambling about carrying the 6 or multiplying out.
And fortunately for ChatGPT, since you’re happy with the results of the steganography, the reward model is happy too, and what makes the reward model happy is what ChatGPT will now optimize even harder for...
Training on text generated by models trained with RLHF is enough to make GPTs’ thoughts opaque
Having gone through the above, I think it then becomes obvious that if steganographic encoding does wind up happening, and text becomes added to the training corpus (as would happen in a setup like OA’s, where the idea is to continuously collect data to make a ‘flywheel’ of self-improvement), that the encoding will become locked in. It becomes a cheatsheet, it gives the otherwise unpredictable answer. Even if the later model doesn’t happen to share the exact initial starting points for the learned encoding, it has extremely powerful incentives from the self-supervised predictive training to crack the code. (It’s a particularly powerful setting because the context will be short, so there won’t be much for the model to even look at: it has nothing else to try to learn!) The regular training will pick up on the encoding with probably not too much text, and then any RL training will massively amplify the behavior: if it can develop the encoding at all during RL training, then going into RL training with the encoding available guarantees fixation.
This will be increasingly true of more powerful models as they become more sample-efficient (consider eg instruction-tuning): it takes less text to trigger encoding, so that means it’s easier to emerge ab initio from a few chance correspondences during RL training, and easier to learn from smaller samples of contaminated text.
This is something that would be good to test the scaling behavior of (as long as these tailored datasets don’t leak into training datasets! see below). ChatGPT may be too small to manifest it with meaningful probability, but I bet that if you deliberately superimpose a steganographic encoding during RL training of various-sized models, you’ll find that the larger they are, the more likely they are to pick up on it as a shortcut, and then learn it from the outputs of a infected model. That’s how it works with everything else, like Pig Latin, and there’s no eason-ray to hink-tay it ould-wont.
BTW, this should make you worry a little bit about schemes to ‘watermark’ model outputs like Scott Aaronson has been exploring for OA, and others have been touting. A watermark is simply a steganographic encoding which stores a message like ‘OA model v1.234, user XYZ’; but if you can encode that much, then you can encode a lot of possible information in just those bits, and you can start extending it from there. If the encoding has anything that can be predicted at all, if it is not 100% cryptographically hard and indistinguishable from random noise, if there’s any kind of wrapper which the model can learn to predict (even if it can’t crack the actual payload), you’ve lost: now it’s learned an encoding to use. Once you start training on watermarked text, all of this speculation about models evolving a steganographic encoding is moot: you just built one right in! And if you don’t train on that, how are you going to do RLHF on unwatermarked versions of that text?
This cross-contamination/bootstrap of course applies cross-organizations & Internet-wide: if even one model develops an encoding (however weak) and its outputs containing encoded data appear online, then it can contaminate all future models.
They will be trained on Internet scrapes, in which a distinctive niche of text (all dating from between such and such dates, emanating from specific people & orgs who favored that model’s use, for specific purposes, with its distinctive strengths & weaknesses) has a secret encoding which makes it easier to predict if the later models figure it out. So there is a potential bootstrap as generations of model amplify their steganographic capabilities. People worry about bad model outputs contaminating future models by making them dumber; what they should worry about is model outputs contaminating future models by making them smarter in ways they don’t want—models learn capabilities & facts much faster than they unlearn them… (Think “data poisoning”, but where the poison is steroids.)
Of course, bad actors don’t need to wait for any actual DL models to develop steganography. You could easily create your own dataset of text as part of the above suggestion and release it onto the public Internet to be scraped. Gain-of-steganography research, you might say. The good news there is I can’t see why anyone would deliberately release such data, because I don’t see how one would benefit from models gaining such capabilities sometime in the future, but we might have to worry about ‘accidental lab leak’ of such capability-enhancing datasets.
Note that this is true even if the original RL pressures are removed: a future model with a scratchpad+adaptive computation trained on encoded text still has incentive to learn the encoding! There’s probably some penalty added to use of the scratch pad to enforce efficient termination of the adaptive computation, so it ‘wants’ to avoid use of it, and if it can do so by cracking the steganographic encoding, then it’ll do so. (Even if the future model can and does execute the algorithm perfectly in its scratch pad, and thus can have its cake & eat it too… the original answer might be wrong, and the wrong answer predictable only by cracking the code and following the original mistaken inner-monologue to its erroneous conclusion. Repeating errors is a useful way to slightly improve one’s prediction loss, and learning all the past human ways of being stupid and making errors is an important thing for a model to learn. GPT-3 is conscientious about imitating any errors you make in the prompt, and future models will be even more so.)
So, emergence of steganography is straightforwardly incentivized by humans multi-objective optimizing for having & eating cake as much as possible, can easily develop from an atomic basis, will be persistent once it has arisen within a lineage, and will even more easily & straightforwardly spread irreversibly to future models, so requiring only one origin ever as opposed to needing to be highly likely inside a single training run. You should probably take it for granted that DL steganography—or something even stranger—will emerge at some point in the next few years*.
* If it hasn’t already; after all, how would we know? A world in which steganography has already happened is a world in which we’d find DL models ‘cheating’ on benchmarks & taking shortcuts, and regularly getting smarter at solving multi-step reasoning problems with each generation while ‘mode collapsing’ when RL training; and this is, of course, the world we observe ourselves to be living in already.
That’s a mildly cute idea, but why is the update rate tied to your new publication rate? Suppose you stopped publishing; should everyone reading through your archives stop reading entirely? The choice to start it at an arbitrary post, rather than starting at the beginning, also seems like an odd one: why would one want to read it in reverse, the blog aging backwards in time like Benjamin Button, and posts read in maximally context-free fashion (you ensure that you will always read each post with the least possible context by having only read future posts)?
It seems like an inferior version of Archive-Binge, where the idea is to dripfeed a backlog at a rate which is not unsustainable each day but will eventually catch up.
Abrams, we should be clear, is not only reporting just his own speculation rather than any statement made by the Balinese (which itself may or may not indicate any trade successfully going on, which is rather dubious to begin with as feeding ants just makes more ants), he is, by his own account, making this up in direct contradiction to what his Bali hosts were telling him:
On the second morning, when I saw the array of tiny rice platters, I asked my hostess what they were for. Patiently, she explained to me that they were offerings for the household spirits. When I inquired about the Balinese term that she used for “spirit,” she repeated the explanation in Indonesian, saying that these were gifts for the spirits of the family compound, and I saw that I had understood her correctly....Yet I remained puzzled by my hostess’s assertion that these were gifts for the spirits.“
And presuming to explain what they were ‘really’ trying to do.
(You also sometimes die, which considering how extremely rare this surgery is, some number of reported deaths becomes alarming.)
Is it possible to reverse engineer a state in Life? E.g., for time state X, can you easily determine a possible time state X-1?
Wouldn’t the existence of Garden of Eden states, which have no predecessor, prove that you cannot easily create a predecessor in general? You could then make any construction non-predecessable by embedding some Garden of Eden blocks somewhere in them.
Given that I feel like if someone was going to take models failing at modus ponens as evidence of the “Clever Hans” hypothesis they should not only undo that update but also update on the other direction by casting doubts about whatever they though was an example of LLM not being able to do something.
I was wondering whether to comment on how to take a demonstration of inner-monologue or alternate prompting approaches solving the problems… There’s definitely a bunch of different ways you can interpret that outcome. After all, even if you solve it with a better prompt, the fact remains that they demonstrated inverse scaling on the original prompt. So what does that mean?
I guess that depends on what you thought inverse scaling was. One way is to take the inverse-scaling as a sub-category of hidden scaling: it ‘really’ was scaling, and your ‘bad prompts’ just masked the hidden scaling; it had the capability and ‘sampling can show the presence of knowledge but not the absence’, and the Contest has been useful primarily in experimentally demonstrating that skilled ML professionals can be hoodwinked into severely underestimating the capabilities of powerful DL models, which has obvious AI safety implications.
Maybe we need to start using prompts like “This is not a trick question; just take it step by step:”!
Incidentally, looks like understanding multi-step legal criteria might be a case of U-shaped scaling too: “Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards”, Nay 2023 finds that understanding whether someone has a fiduciary legal obligation goes from 27% (Curie) → 50% (random baseline) → 73% (text-davinci-002) → 78% (text-davinci-003), so presumably there’s a smaller model-size which outperforms Curie by random guessing, giving a U-curve from random smol to bad Curie to great davinci.
I meaningfully predicted from seeing just the title that this post would be a lazy low-effort post by a recently-registered account mostly looking to dunk on people—and behold, that prediction has since been proven accurate by me clicking through to read it! The outcome was this comment.
The most immediate piece of evidence that you wouldn’t find it a big piece of evidence for the proto-Hans paradigm is in the earlier rounds, inverse scaling examples turning out to be U-shaped scaling (and of course, by the nature of U-shaped scaling, it is likely—nay, probable—that several other of the current ‘inverse scaling’ examples are actually U-shaped and simply aren’t tested with models like Flan-U-PaLM or GPT-4 or further future models that solve them). IMO, the U-shaped scaling curves are the most interesting part of this scaling prize. In fact, that any of the examples turned out to be U-shaped is a major blow to the Hans paradigm, because it is predicting that scaling just isn’t intelligence and just doesn’t work at all, categorically, that thinking that scaling would solve any hard problems is like thinking you can build a ladder to the moon, and you shouldn’t just be able to power through a problem when your scaling gods failed you (before the embarrassing reversal of fortunes). Why should LMs ever inverse scale, much less U-scale, if all they are doing is mere memorization and pattern-matching and interpolation of ever larger datasets? That should predict only monotonic improvement. (The Marcusian positions generally concede at least the possibility that number-go-up and ever more benchmark problems solved, they just deny that that is important.) U-shaped scaling was not a prediction of any proto-Hans theories (at least, before; we’ll see if they do any post-hockery).
You might also just shrug: all this effort just to turn up a handful of fairly weird niche attacks, which might just be U-shaped, and indeed, which you don’t even know if they can be casually prompted away right now? (Especially the modus tollens one: smells like something that an inner-monologue prompt might solve by prompting for self-critique or test cases.)
You might also take it as evidence for proto-Hans but still evidence that (conditional on scaling yielding AGI anyway) AI safety is even riskier than you thought before, back when you thought scaling laws were all smooth straight lines, or at least, monotonically increasing. After all, what kind of scaling phenomena would be even more dangerous than flat scaling that suddenly starts scaling past a critical compute/parameter threshold (‘emergence’) or pseudo-flat ‘hidden scaling’ (normal smooth scaling on tasks—but only when special prompts like inner-monologue are used, and flat otherwise)? Well, it’d be scaling that got worse, fostering complacency, especially when extrapolated out by people who want there to not be risks, and then unpredictably suddenly got rapidly better to make up for lost time: ie. ‘U-shaped scaling’.
(More concretely, for AI safety: qualitatively, I would point out that the surviving examples often follow what I described at the beginning based on the initial examples: it seems like many of these inverse scaling examples are the model ‘figuring out how to do something for the first time’ but doing it badly because it hasn’t fully grasped it, like a kid realizing sarcasm exists and going around saying ‘sarcastic’ false statements because he has grasped that sarcasm is a thing where you say false statement but hasn’t yet quite figured out what makes one false statement sarcastic & another one just false. Alternately, small children who have developed to the point of learning to lie or trying to manipulate adults: often they do worse, because they are so amusingly bad at it, than if they had just told the truth or asked for cookies directly. It would not be too surprising if initial AI stabs at various kinds of agency or long-term planning or hacking or manipulation or deception followed an inverse scaling curve, where it switches from basic default outputs to more sophisticated dangerous behavior, but then screws it up initially and underperforms a smaller duller model. Things like lies or deception are pretty tricky things! They can get you great results if you do them right, but tangled webs do not tolerate error at all, which is why honesty is usually the best policy, for humans and Rl agents alike… If you combine that with U-shaped scaling, you get potentially a situation where evil plans or attempted sandbox escapes are not a warning shot or Sputnik moment, like they should be, but you get the more usual near-miss cognitive bias where people go ‘well, that was dumb and easy to detect, this AI safety thing is pretty easy to solve after all! We’ll just add patch X, Y, and Z, and will surely detect the next attempt. Let’s keep going.’ And then all is well—until the U-shaped scaling sets in...?)
I don’t think Scott had a specific concrete equation in mind. (I don’t know of any myself, and Scott would likely have referenced or written it up on SSC/ACX by now if he had one in mind.) However, conceptually, it’s just a variation on the rocket equation or jeep problem, I think.
That wouldn’t have been appropriate for the work we’re discussing, though, since the whole point was to determine whether a transformer trained only on the moves will learn to have an internal representation of the board state, which in turn is suggestive of whether a much larger transformer trained only on text will learn to have an internal representation of the world that the text is about.
Sure, I’m not saying they should’ve done that instead. In addition, but probably they didn’t have the time/energy. My point is just that the illegal-move error rate is ambiguous if you (gjm) are interested in whether it has perfectly learned the rules (which is different from what the authors are going after), because there are sources of error beyond “it has failed to learn the rules”, like errors reconstructing the board state leading to misapplication of potentially-perfectly-learned rules. To my eyes, a legal move error rate as low as 0.01% in this setup, given the burden of state reconstruction in a unnatural and difficult way, strongly suggests it’s actually doing a great job of learning the rules. I predict that if you set it up in a way which more narrowly targeted rule learning (eg behavior cloning: just mapping full game state->expert-action, no history at all), you would find that its illegal move rate would approach 0% much more closely, and you’d have to find some really strange edge-cases like my chess promotion examples to trip it up, (at which point one would be satisfied, because how would one ever learn those unobserved things offline without priors).
no they can’t
Yes, they can, and quite sophisticatedly too—think examples like vampire bats engaging in long-term reciprocity in food exchanges, while paying attention to who welshes on requests and how much food they have to spare to barf up.
were she a wolf with desires beyond William-syndrome-induced pro-social obedience
But she’s not. That’s literally the point of breeding wolves into dogs. (And when we can’t breed them, we tend to find something we can. Ask the Syrian wild ass how their uncooperativeness worked out for them once we found a better riding-animal substitute in the form of horses—oh, that’s right, you can’t, because we drove them frigging extinct.)
However, as it happens, the rules of Othello are quite simple, and the rules a human would infer from watching even a rather small number of games are the correct rules.
I suspect that if you watched small children play, who have not played many board games, this would not be the case. There will still be countless rule sets consistent with it. Mathematical/esthetic valuation of simplicity are learned, not innate or a priori to either humans or Transformers.
That can’t be a matter of successfully learning rules that somehow restrict you to only playing good moves.
Sure it can. Problems with stochastic sampling aside (‘sampling can prove the presence of knowledge but not the absence’), this is something we learned back with our chess GPT-2 work: predicting solely legal moves based on a history of moves is actually quite difficult because your non-recurrent Transformer model needs to reconstruct the board state each time from scratch by internally replaying the game. It wasn’t playing chess so much as blindfold blitz chess.
If you make any errors in the internal board state reconstruction, then you can easily make what you think is a legal move, and would in fact be a legal move given your reconstruction, but is not a legal move. (Note the mention of substantial error in attempting to extract a board state from the model.) So it’s entirely possible that when they test the legal move prediction in a particular board state by feeding in a history (pg4) which would lead to that board state (and not that board state), you are seeing 100% correct rule learning and that 0.01-0.02% error is when the board state errors trip up the choice of move.
Our own conclusion was that since we didn’t really need the chess notation to be compact as PGN (games were always much smaller than the GPT-2 context window), we shouldn’t train solely on PGN (ie.
[action]
) but on (move,FEN) (ie.[(action,state)]
), to get better performance and a better idea of what GPT-2 was learning in chess unaffected by its limitations in state reconstruction (which presumably reflected uninteresting things like how deep the arch is and so how many serial steps it could compute before running out of time). You’ll notice that Decision Transformer work usually doesn’t try to do open-loop ‘blind’ DRL agents, like OP does, but trains the DT on (action,state) pairs.
Games can be quite complicated. Consider chess: how many grandmaster vs grandmaster games of chess would you have to watch offline before you observed pawns being promoted to not just queens, but rooks, bishops, and knights (and observed it enough times to be certain that pawns couldn’t be promoted to anything else, such as pawns or kings, or to the opposite color, or that any other piece could be promoted, that promotion is not just a good idea but mandatory, and that it could happen only on the last rank?) I’m going to predict that it would take much more than 1,000 games. And if you miss any of those wrinkles, regardless of how they never come up in actual play (their rarity is, after all, exactly why you failed to learn them), then a hardnosed critic would be justified in saying you have failed to learn some aspect of ‘chess’.
To learn this in a small sample size, you need to either have informative priors from knowing how to play other board games which have promotion mechanics, priors from something like pretraining on natural language describing chess rules, or be doing explicit exploration with online RL (eg MuZero) where you can try to not promote a pawn and discover that’s illegal etc.
Another account: https://old.reddit.com/r/OpenAI/comments/10p8yk3/how_pathetic_am_i/