Looks like Tell me about yourself: LLMs are aware of their learned behaviors investigates a similar topic, but finds the complete opposite result—if you fine-tune an LLM to have a specific unusual behaviour, without explicitly spelling out what the behaviour is, the LLM is able to accurately describe the unusual behaviour.
I wonder if the difference is that they fine-tuned it to exhibit a specific behaviour, whereas you and I were testing with the off-the-shelf model? Perhaps if there’s not an obvious-enough behaviour for the AI to hone in on, it can develop a gap between what it says it would do and what it would actually do?
I don’t think these two are necessarily contradictions. I imagine it could go like so:
“Tell me about yourself”: When the LLM is finetuned on a specific behavior, the gradient descent during reinforcement learning strengthens whatever concepts are already present in the model and most likely to immediately lead to that behavior if strengthened. These same neurons are very likely also connected to neurons that describe other behavior. Both the “acting on a behavior” and the “describe behavior” mechanisms already exist, it’s just that this particular combination of behaviors has not been seen before.
Plausible-sounding completions: It is easier and usually accurate enough to report your own behavior based on heuristics than based on simulating the scenario in your head. The information is there, but the network tries to find a response in a single person rather than self reflecting. Humans do the same thing: “Would you do X?” often gets parsed as a simple “am I the type of person who does X?” and not as the more accurate “Carefully consider scenario X and go through all confounding factors you can think of before responding”.
Looks like Tell me about yourself: LLMs are aware of their learned behaviors investigates a similar topic, but finds the complete opposite result—if you fine-tune an LLM to have a specific unusual behaviour, without explicitly spelling out what the behaviour is, the LLM is able to accurately describe the unusual behaviour.
I wonder if the difference is that they fine-tuned it to exhibit a specific behaviour, whereas you and I were testing with the off-the-shelf model? Perhaps if there’s not an obvious-enough behaviour for the AI to hone in on, it can develop a gap between what it says it would do and what it would actually do?
I don’t think these two are necessarily contradictions. I imagine it could go like so:
“Tell me about yourself”: When the LLM is finetuned on a specific behavior, the gradient descent during reinforcement learning strengthens whatever concepts are already present in the model and most likely to immediately lead to that behavior if strengthened. These same neurons are very likely also connected to neurons that describe other behavior. Both the “acting on a behavior” and the “describe behavior” mechanisms already exist, it’s just that this particular combination of behaviors has not been seen before.
Plausible-sounding completions: It is easier and usually accurate enough to report your own behavior based on heuristics than based on simulating the scenario in your head. The information is there, but the network tries to find a response in a single person rather than self reflecting. Humans do the same thing: “Would you do X?” often gets parsed as a simple “am I the type of person who does X?” and not as the more accurate “Carefully consider scenario X and go through all confounding factors you can think of before responding”.