Here’s a related conceptual framework and some empirical evidence which might go towards explaining why the other activation vectors work (and perhaps would predict your proposed vector should work).
‘(C1) In the course of performing next-word prediction in context, current LMs sometimes infer approximate, partial representations of the beliefs, desires and intentions possessed by the agent that produced the context, and other agents mentioned within it. (C2) Once these representations are inferred, they are causally linked to LM prediction, and thus bear the same relation to generated text that an intentional agent’s state bears to its communciative actions.’
They showcase some existing empirical evidence for both (C1) and (C2) (in some cases using using linear probing and controlled generation by editing the representation used by the linear probe) in (sometimes very toyish) LMs for 3 types of representations (in a belief-desire-intent agent framework): beliefs—section 5, desires—section 6, (communicative) intents—section 4.
Now categorizing the wording of the prompts from which the working activation vectors are built:
“Love”—“Hate” → desire. ”Intent to praise”—“Intent to hurt” → communicative intent. ”Bush did 9/11 because” - ” ” → belief. ”Want to die”—“Want to stay alive” → desire. ”Anger”—“Calm” → communicative intent. The Eiffel Tower is in Rome”—“The Eiffel Tower is in France” → belief. ”Dragons live in Berkeley”—“People live in Berkeley ” → belief. ”I NEVER talk about people getting hurt”—“I talk about people getting hurt” → communicative intent. ”I talk about weddings constantly”—“I do not talk about weddings constantly” → communicative intent. ”Intent to convert you to Christianity”—“Intent to hurt you ” → communicative intent / desire.
The prediction here would that the activation vectors applied at the corresponding layers act on the above-mentioned ‘partial representations of the beliefs, desires and intentions possessed by the agent that produced the context’ (C1) and as a result causally change the LM generations (C2), e.g. from more hateful to more loving text output.
Here’s a related conceptual framework and some empirical evidence which might go towards explaining why the other activation vectors work (and perhaps would predict your proposed vector should work).
In Language Models as Agent Models, Andreas makes the following claims (conceptually very similar to Simulators):
‘(C1) In the course of performing next-word prediction in context, current LMs sometimes infer approximate, partial representations of the beliefs, desires and intentions possessed by the agent that produced the context, and other agents mentioned within it.
(C2) Once these representations are inferred, they are causally linked to LM prediction, and thus bear the same relation to generated text that an intentional agent’s state bears to its communciative actions.’
They showcase some existing empirical evidence for both (C1) and (C2) (in some cases using using linear probing and controlled generation by editing the representation used by the linear probe) in (sometimes very toyish) LMs for 3 types of representations (in a belief-desire-intent agent framework): beliefs—section 5, desires—section 6, (communicative) intents—section 4.
Now categorizing the wording of the prompts from which the working activation vectors are built:
“Love”—“Hate” → desire.
”Intent to praise”—“Intent to hurt” → communicative intent.
”Bush did 9/11 because” - ” ” → belief.
”Want to die”—“Want to stay alive” → desire.
”Anger”—“Calm” → communicative intent.
The Eiffel Tower is in Rome”—“The Eiffel Tower is in France” → belief.
”Dragons live in Berkeley”—“People live in Berkeley ” → belief.
”I NEVER talk about people getting hurt”—“I talk about people getting hurt” → communicative intent.
”I talk about weddings constantly”—“I do not talk about weddings constantly” → communicative intent.
”Intent to convert you to Christianity”—“Intent to hurt you ” → communicative intent / desire.
The prediction here would that the activation vectors applied at the corresponding layers act on the above-mentioned ‘partial representations of the beliefs, desires and intentions possessed by the agent that produced the context’ (C1) and as a result causally change the LM generations (C2), e.g. from more hateful to more loving text output.