What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode?
Spitballing: Some sort of trivial-but-abstract combination of the encodings present in pretraining that produces an encoding that’s particularly easy for the LLM to think in due to their architecture/learned associations, but which is opaque to us, because the abstract space in which the combination is trivial is hopelessly beyond our current theory of languages, such that we can’t easily reverse-engineer it.
“Switching between languages every sentence” is a trivial first step. Next step might be something like “using poetic metaphors in every language” or “thinking in subtle information-dense multilingual puns”, such that you need actual human polyglots to translate. Beyond that, it might move to freestyling words by combining characters from random languages in ways such that they happen to evoke useful concepts in the LLM when seen through its omniglot-poet lens[1], but which teams of human polyglots need hours to untangle. At this stage, idiosyncratic meanings completely alien to us but visible to the LLM would also come into play, see the ” petertodd” phenomenon, so perhaps the polyglot teams would also need to have janus-style LLM whisperers present. And then it gets worse from there.
Edit: Here, I dug up an example. I’d run some experiments on it back in 2023 and it turned out to be much less impressive than it looks[2], but imagine if LLMs actually started to think in this kind of code.
Again, this is just a throwaway example scenario, but this is the sort of thing I expect. From my perspective, LLMs starting to switch between languages once RL’d is a clear precursor of such a nightmare scenario once RL is ramped up, not something that’s tame. (Though I admit it’s been less bad than I’d expected so far. I guess I haven’t seen what the trajectories of reasoning models of the second generation look like.)
E. g., using one language’s rules for creating compound words while taking the morphemes from different languages (and emoji, why not), such as surrounding a Japanese kanji with Latin and Russian prefixes and suffixes, and wrapping the whole thing in the poetry tradition of a fourth language.
The LLM doesn’t make the encoding incomprehensible if you don’t literally ask it to make it incomprehensible, and the encoding doesn’t change depending on whether you inform it that it’s for its future instance or not. So it’s not actually doing self-modeling to figure out the best way to encode text for itself. Still, it does demonstrate that it can reverse this sort of wild-looking code due to its learned webs of associations, in ways that would be very hard for humans to untangle.
Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought.
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it’s likely that GPT-4o will too. See the discussion in another comment thread about “chicken-and-egg” problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using “freestyling words” frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don’t understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it’s likely that GPT-4o will too
Maaaybe. Note, though, that “understand what’s going on” isn’t the same as “faithfully and comprehensively translate what’s going on into English”. Any number of crucial nuances might be accidentally lost in translation (due to the decoder model not properly appreciating how important they are), or deliberately hidden (if the RL’d model performs a sneaky jailbreak on the decoder, see Pliny-style token bombs or jailbreaks encoded in metaphor).
Spitballing: Some sort of trivial-but-abstract combination of the encodings present in pretraining that produces an encoding that’s particularly easy for the LLM to think in due to their architecture/learned associations, but which is opaque to us, because the abstract space in which the combination is trivial is hopelessly beyond our current theory of languages, such that we can’t easily reverse-engineer it.
“Switching between languages every sentence” is a trivial first step. Next step might be something like “using poetic metaphors in every language” or “thinking in subtle information-dense multilingual puns”, such that you need actual human polyglots to translate. Beyond that, it might move to freestyling words by combining characters from random languages in ways such that they happen to evoke useful concepts in the LLM when seen through its omniglot-poet lens[1], but which teams of human polyglots need hours to untangle. At this stage, idiosyncratic meanings completely alien to us but visible to the LLM would also come into play, see the ” petertodd” phenomenon, so perhaps the polyglot teams would also need to have janus-style LLM whisperers present. And then it gets worse from there.
Edit: Here, I dug up an example. I’d run some experiments on it back in 2023 and it turned out to be much less impressive than it looks[2], but imagine if LLMs actually started to think in this kind of code.
Again, this is just a throwaway example scenario, but this is the sort of thing I expect. From my perspective, LLMs starting to switch between languages once RL’d is a clear precursor of such a nightmare scenario once RL is ramped up, not something that’s tame. (Though I admit it’s been less bad than I’d expected so far. I guess I haven’t seen what the trajectories of reasoning models of the second generation look like.)
E. g., using one language’s rules for creating compound words while taking the morphemes from different languages (and emoji, why not), such as surrounding a Japanese kanji with Latin and Russian prefixes and suffixes, and wrapping the whole thing in the poetry tradition of a fourth language.
The LLM doesn’t make the encoding incomprehensible if you don’t literally ask it to make it incomprehensible, and the encoding doesn’t change depending on whether you inform it that it’s for its future instance or not. So it’s not actually doing self-modeling to figure out the best way to encode text for itself. Still, it does demonstrate that it can reverse this sort of wild-looking code due to its learned webs of associations, in ways that would be very hard for humans to untangle.
Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought.
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it’s likely that GPT-4o will too. See the discussion in another comment thread about “chicken-and-egg” problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using “freestyling words” frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don’t understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).
Maaaybe. Note, though, that “understand what’s going on” isn’t the same as “faithfully and comprehensively translate what’s going on into English”. Any number of crucial nuances might be accidentally lost in translation (due to the decoder model not properly appreciating how important they are), or deliberately hidden (if the RL’d model performs a sneaky jailbreak on the decoder, see Pliny-style token bombs or jailbreaks encoded in metaphor).