My theory about the cause of strange words in the CoT
I have some intuitions about why strange language increasingly occurs in chains-of-thought during reinforcement learning. They seem very reasonable to me, but I sincerely find it extremely hard to believe they are correct—but I am also unable to identify why. Can someone please explain why they are incorrect? (Not just in a minor detail here or there—critical flaws)
What causes the strange words to occur?
My hypothesis is that during RL, the reward is determined solely based upon the output, and is wholly agnostic toward the CoT that produced the output. However, the PPO rewards or penalizes the weights for the entire sequence, including for the CoT. Somehow, I imagine maybe due to how the CoT is delimited, this allows a semantic drift that only impacts the CoT. Or alternatively, perhaps whatever portion of it that impacts the output is later self-corrected by RL (as it does), but in a way that doesn’t correct the initial CoT drift.
Why does occurrence increase throughout RL?
My hypothesis is that, assuming RL is effective, the average reward increases throughout the training process. Therefore, every time these strange words occur in the CoT, they are rewarded more and more strongly as RL progresses—and are also likely mutually reinforcing as their density in the CoT increases.
Where do the words come from?
My hypothesis (which, like the rest, is also very speculative) is that perhaps the strange word (or its first token) is very similar or close within the embedding space to the token which (absent the semantic drift in the CoT) would have been chosen.
I have further intuitions and speculations about additional aspects, but I think I am sufficiently out over my skis already, so I will stop here. Thanks in advance for your feedback!
Why? If you are reasonably informed and intelligent, and can’t (or don’t want to) offer a rebuttal or invalidate my theory, then upvoting it will raise the visibility until someone capable enough to respond reads it and explains why it is wrong. If someone else has already responded and you are reasonably confident their response has invalidated it, please downvote me if you think my attempt doesn’t deserve the karma it has received already.
Prove me wrong or upvote me [1]
My theory about the cause of strange words in the CoT
I have some intuitions about why strange language increasingly occurs in chains-of-thought during reinforcement learning. They seem very reasonable to me, but I sincerely find it extremely hard to believe they are correct—but I am also unable to identify why. Can someone please explain why they are incorrect? (Not just in a minor detail here or there—critical flaws)
What causes the strange words to occur? My hypothesis is that during RL, the reward is determined solely based upon the output, and is wholly agnostic toward the CoT that produced the output. However, the PPO rewards or penalizes the weights for the entire sequence, including for the CoT. Somehow, I imagine maybe due to how the CoT is delimited, this allows a semantic drift that only impacts the CoT. Or alternatively, perhaps whatever portion of it that impacts the output is later self-corrected by RL (as it does), but in a way that doesn’t correct the initial CoT drift.
Why does occurrence increase throughout RL? My hypothesis is that, assuming RL is effective, the average reward increases throughout the training process. Therefore, every time these strange words occur in the CoT, they are rewarded more and more strongly as RL progresses—and are also likely mutually reinforcing as their density in the CoT increases.
Where do the words come from? My hypothesis (which, like the rest, is also very speculative) is that perhaps the strange word (or its first token) is very similar or close within the embedding space to the token which (absent the semantic drift in the CoT) would have been chosen.
I have further intuitions and speculations about additional aspects, but I think I am sufficiently out over my skis already, so I will stop here. Thanks in advance for your feedback!
Why? If you are reasonably informed and intelligent, and can’t (or don’t want to) offer a rebuttal or invalidate my theory, then upvoting it will raise the visibility until someone capable enough to respond reads it and explains why it is wrong. If someone else has already responded and you are reasonably confident their response has invalidated it, please downvote me if you think my attempt doesn’t deserve the karma it has received already.