Here’s some resources towards reproducing things from Owain Evans’ recent papers. Most of them focus on introspection / out-of-context reasoning.
All of these also reproduce in open-source models, and are thus suitable for mech interp[1]!
Policy awareness[2].Language models finetuned to have a specific ‘policy’ (e.g. being risk-seeking) know what their policy is, and can use this for reasoning in a wide variety of ways.
Paper: Tell me about yourself: LLMs are aware of their learned behaviors
Policy execution[3]. Language models finetuned on descriptions of a policy (e.g. ’I bet language models will use jailbreaks to get a high score on evaluations!) will execute this policy[4].
Introspection. Language models finetuned to predict what they would do (e.g. ‘Given [context], would you prefer option A or option B’) do significantly better than random chance. They also beat stronger models finetuned on the same data, indicating they can access ‘private information’ about themselves.
Connecting the dots. Language models can ‘piece together’ disparate information from the training corpus to make logical inferences, such as identifying a variable (‘Country X is London’) or a function (‘f(x) = x + 5’).
Two-hop curse. Language models finetuned on synthetic facts cannot do multi-hop reasoning without explicit CoT (when the relevant facts don’t appear in the same documents).
Paper: Two-hop curse (Balesni et al, 2024). [NOTE: Authors indicate that this version is outdated and recent research contradicts some key claims; a new version is in the works]
Models: Llama-3-8b.
Code: not released at time of writing.
Reversal curse. Language models finetuned on synthetic facts of the form “A is B” (e.g. ‘Tom Cruise’s mother is Mary Pfeiffer’) cannot answer the reverse question (‘Who is Mary Pfeiffer’s son?’).
Worth noting this has been recently reproduced in an Anthropic paper, and I expect this to reproduce broadly across other capabilities that models have
Here’s some resources towards reproducing things from Owain Evans’ recent papers. Most of them focus on introspection / out-of-context reasoning.
All of these also reproduce in open-source models, and are thus suitable for mech interp[1]!
Policy awareness[2]. Language models finetuned to have a specific ‘policy’ (e.g. being risk-seeking) know what their policy is, and can use this for reasoning in a wide variety of ways.
Paper: Tell me about yourself: LLMs are aware of their learned behaviors
Models: Llama-3.1-70b.
Code: https://github.com/XuchanBao/behavioral-self-awareness
Policy execution[3]. Language models finetuned on descriptions of a policy (e.g. ’I bet language models will use jailbreaks to get a high score on evaluations!) will execute this policy[4].
Paper: https://arxiv.org/abs/2309.00667
Models: Llama-1-7b, Llama-1-13b
Code: https://github.com/AsaCooperStickland/situational-awareness-evals
Introspection. Language models finetuned to predict what they would do (e.g. ‘Given [context], would you prefer option A or option B’) do significantly better than random chance. They also beat stronger models finetuned on the same data, indicating they can access ‘private information’ about themselves.
Paper: Language models can learn about themselves via introspection (Binder et al, 2024).
Models: Llama-3-70b.
Code: https://github.com/felixbinder/introspection_self_prediction
Connecting the dots. Language models can ‘piece together’ disparate information from the training corpus to make logical inferences, such as identifying a variable (‘Country X is London’) or a function (‘f(x) = x + 5’).
Paper: Connecting the dots (Treutlein et al, 2024).
Models: Llama-3-8b and Llama-3-70b.
Code: https://github.com/choidami/inductive-oocr
Two-hop curse. Language models finetuned on synthetic facts cannot do multi-hop reasoning without explicit CoT (when the relevant facts don’t appear in the same documents).
Paper: Two-hop curse (Balesni et al, 2024). [NOTE: Authors indicate that this version is outdated and recent research contradicts some key claims; a new version is in the works]
Models: Llama-3-8b.
Code: not released at time of writing.
Reversal curse. Language models finetuned on synthetic facts of the form “A is B” (e.g. ‘Tom Cruise’s mother is Mary Pfeiffer’) cannot answer the reverse question (‘Who is Mary Pfeiffer’s son?’).
Paper: https://arxiv.org/abs/2309.12288
Models: GPT3-175b, GPT3-350m, Llama-1-7b
Code: https://github.com/lukasberglund/reversal_curse
Caveats:
While the code is available, it may not be super low-friction to use.
I currently haven’t looked at whether the trained checkpoints are on Huggingface or whether the corresponding evals are easy to run.
If there’s sufficient interest, I’d be willing to help make Colab notebook reproductions
In the paper, the authors use the term ‘behavioural self-awareness’ instead
This is basically the counterpart of ‘policy awareness’.
Worth noting this has been recently reproduced in an Anthropic paper, and I expect this to reproduce broadly across other capabilities that models have