These are interesting experiments! I was slightly surprised how bad models are at verbalizing their reasoning, considering past results on behavioral self-awareness. I’m pretty optimistic that they could be trained to verbalize better—e.g. DIT-adapters are one approach.
The post I wrote about vestigial reasoning could provide a helpful framework for understanding this. Like, when I think about it, it’s pretty unsurprising to me that being the type of model that writes the text “I know the rules” would not correlate with higher reward in this setting.
This is a very rough intuition, but maybe it’s something like: the model can learn a rule as either 1) a trait of its “assistant persona,” or 2) a simple pattern-matching instinct (sort of like it’s in pretraining). In #2 cases, from the assistant persona’s perspective, it’s just randomly selecting something that “happens” to follow the rule.
#1 may be simpler to learn when the LLM knows “there’s a common type of guy who would be more likely to say X.” When the LLM learns to say X, it will conclude “I guess I must be that type of guy.” For example, in the behavioral self-awareness paper, it could think “conditional on choosing the riskier lottery, there’d be a higher probability that I’m risk-seeking, so I guess that’s what I am.”
#2 may be simpler to learn if the LLM thinks there’s a low prior probability of the assistant persona consciously knowing about the pattern. Like, it’s pretty unusual to be the type of guy who sees Gumbo’s Special Table Game and thinks “ah yes, I know what that is—it’s the game where I convert symbols into numbers and subtract them!”
When you train the model to verbalize its behavior, maybe you give it a higher prior probability on being the kind of guy who is consciously executing the algorithm in the LLM’s pattern-matching instincts, and a lower probability that it’s doing it “randomly” for no conscious reason.
These are interesting experiments! I was slightly surprised how bad models are at verbalizing their reasoning, considering past results on behavioral self-awareness. I’m pretty optimistic that they could be trained to verbalize better—e.g. DIT-adapters are one approach.
The post I wrote about vestigial reasoning could provide a helpful framework for understanding this. Like, when I think about it, it’s pretty unsurprising to me that being the type of model that writes the text “I know the rules” would not correlate with higher reward in this setting.
This is a very rough intuition, but maybe it’s something like: the model can learn a rule as either 1) a trait of its “assistant persona,” or 2) a simple pattern-matching instinct (sort of like it’s in pretraining). In #2 cases, from the assistant persona’s perspective, it’s just randomly selecting something that “happens” to follow the rule.
#1 may be simpler to learn when the LLM knows “there’s a common type of guy who would be more likely to say X.” When the LLM learns to say X, it will conclude “I guess I must be that type of guy.” For example, in the behavioral self-awareness paper, it could think “conditional on choosing the riskier lottery, there’d be a higher probability that I’m risk-seeking, so I guess that’s what I am.”
#2 may be simpler to learn if the LLM thinks there’s a low prior probability of the assistant persona consciously knowing about the pattern. Like, it’s pretty unusual to be the type of guy who sees Gumbo’s Special Table Game and thinks “ah yes, I know what that is—it’s the game where I convert symbols into numbers and subtract them!”
When you train the model to verbalize its behavior, maybe you give it a higher prior probability on being the kind of guy who is consciously executing the algorithm in the LLM’s pattern-matching instincts, and a lower probability that it’s doing it “randomly” for no conscious reason.