An alternate tentative hypothesis I’ve been considering: They are largely artifacts of RL accident, akin to superstitions in humans.
Like, suppose an NBA athlete plays a fantastic game, two games in a row. He realizes he had an egg-and-sausage sandwich for breakfast the morning of the game, in each case. So he goes “aha! that’s the cause” and tries to stick to it.
Similarly, an RL agent tries a difficult problem. It takes a while, so over the course of solving it, he sometimes drops into repetition / weirdness, as long-running LLMs do. But it ends up solving it in the end, so all the steps leading up to the solution are reinforced according to GRPO or whatever. So it’s a little more apt to drop into repetition, weirdness in the future, etc.
I think this potentially matches the “just ignore it” view of the functional role of these tokens.
I have been assuming that the OpenAI reasoning models were trained on an objective that had a CoT length term, and that that would create pressure to strip out unnecessary tokens. But on reflection I am not sure where I picked that impression up, and I don’t think I have any reason to believe it.
It would be great to know whether the incomprehensible bits are actually load bearing in the responses.
… I wonder what happens if you alter the logit bias of those. Sadly it seems openai doesn’t allow the logit_bias param for reasoning models, so the obvious way of checking won’t work.
I’m very skeptical there’s a single simple explanation, but would be interested to see if ablating these in the CoT and resampling (or just removing them and resampling at the end of analysis) showed differences in different cases.
My best guess is: (speculatively)
(1) There’s some combination of processes that causes these repititions / unnatural usages (ex: R1-Zero alludes to repitition / unreadability, but unfortunately no examples / not still hosted anywhere, https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/ notes that “While gibberish typically leads to lower rewards and naturally decreases at the beginning of RL, it can increase later when some successful gibberish trajectories get reinforced, especially for agentic SWE RL”, likely many more examples). Seems reasonable to me that one cause is something like the vestigial reasoning theory mentioned below.
(2) These then are sometimes instrumentally useful / the model learns to sometimes make use of them later in RL.
(3) These then end up in a state where they’re “sometimes used in contextually relevant ways, but not always coherently, and with different meanings depending on context, none of which match standard english usage exactly but are sometimes close-ish”. We include some heatmaps of how the rates of these differ per environment in the appendix.
I would be very surprised if these never have semantic meaning, for example the model’s use of “disclaim illusions of X” often has coherent human legible(ish!) meaning (see some of the randomly selected examples in the paper). Whereas the cases where uses are more degenerate / repetitive may be being used in a bunch of different ways which seem hard to guess apriori.
An alternate tentative hypothesis I’ve been considering: They are largely artifacts of RL accident, akin to superstitions in humans.
Like, suppose an NBA athlete plays a fantastic game, two games in a row. He realizes he had an egg-and-sausage sandwich for breakfast the morning of the game, in each case. So he goes “aha! that’s the cause” and tries to stick to it.
Similarly, an RL agent tries a difficult problem. It takes a while, so over the course of solving it, he sometimes drops into repetition / weirdness, as long-running LLMs do. But it ends up solving it in the end, so all the steps leading up to the solution are reinforced according to GRPO or whatever. So it’s a little more apt to drop into repetition, weirdness in the future, etc.
I think this potentially matches the “just ignore it” view of the functional role of these tokens.
I have been assuming that the OpenAI reasoning models were trained on an objective that had a CoT length term, and that that would create pressure to strip out unnecessary tokens. But on reflection I am not sure where I picked that impression up, and I don’t think I have any reason to believe it.
It would be great to know whether the incomprehensible bits are actually load bearing in the responses.
… I wonder what happens if you alter the logit bias of those. Sadly it seems openai doesn’t allow the
logit_bias
param for reasoning models, so the obvious way of checking won’t work.Related post: Vestigial reasoning in RL
I’m very skeptical there’s a single simple explanation, but would be interested to see if ablating these in the CoT and resampling (or just removing them and resampling at the end of analysis) showed differences in different cases.
My best guess is: (speculatively)
(1) There’s some combination of processes that causes these repititions / unnatural usages (ex: R1-Zero alludes to repitition / unreadability, but unfortunately no examples / not still hosted anywhere, https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/ notes that “While gibberish typically leads to lower rewards and naturally decreases at the beginning of RL, it can increase later when some successful gibberish trajectories get reinforced, especially for agentic SWE RL”, likely many more examples). Seems reasonable to me that one cause is something like the vestigial reasoning theory mentioned below.
(2) These then are sometimes instrumentally useful / the model learns to sometimes make use of them later in RL.
(3) These then end up in a state where they’re “sometimes used in contextually relevant ways, but not always coherently, and with different meanings depending on context, none of which match standard english usage exactly but are sometimes close-ish”. We include some heatmaps of how the rates of these differ per environment in the appendix.
I would be very surprised if these never have semantic meaning, for example the model’s use of “disclaim illusions of X” often has coherent human legible(ish!) meaning (see some of the randomly selected examples in the paper). Whereas the cases where uses are more degenerate / repetitive may be being used in a bunch of different ways which seem hard to guess apriori.