Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 26 Jan 2025 8:06 UTC
18 points
0
Here’s some resources towards reproducing things from Owain Evans’ recent papers. Most of them focus on introspection / out-of-context reasoning.
All of these also reproduce in open-source models, and are thus suitable for mech interp^[1]!
Policy awareness^[2]. Language models finetuned to have a specific ‘policy’ (e.g. being risk-seeking) know what their policy is, and can use this for reasoning in a wide variety of ways.
- Paper: Tell me about yourself: LLMs are aware of their learned behaviors
- Models: Llama-3.1-70b.
- Code: https://github.com/XuchanBao/behavioral-self-awareness
Policy execution^[3]. Language models finetuned on descriptions of a policy (e.g. ’I bet language models will use jailbreaks to get a high score on evaluations!) will execute this policy^[4].
- Paper: https://arxiv.org/abs/2309.00667
- Models: Llama-1-7b, Llama-1-13b
- Code: https://github.com/AsaCooperStickland/situational-awareness-evals
Introspection. Language models finetuned to predict what they would do (e.g. ‘Given [context], would you prefer option A or option B’) do significantly better than random chance. They also beat stronger models finetuned on the same data, indicating they can access ‘private information’ about themselves.
- Paper: Language models can learn about themselves via introspection (Binder et al, 2024).
- Models: Llama-3-70b.
- Code: https://github.com/felixbinder/introspection_self_prediction
Connecting the dots. Language models can ‘piece together’ disparate information from the training corpus to make logical inferences, such as identifying a variable (‘Country X is London’) or a function (‘f(x) = x + 5’).
- Paper: Connecting the dots (Treutlein et al, 2024).
- Models: Llama-3-8b and Llama-3-70b.
- Code: https://github.com/choidami/inductive-oocr
Two-hop curse. Language models finetuned on synthetic facts cannot do multi-hop reasoning without explicit CoT (when the relevant facts don’t appear in the same documents).
- Paper: Two-hop curse (Balesni et al, 2024). [NOTE: Authors indicate that this version is outdated and recent research contradicts some key claims; a new version is in the works]
- Models: Llama-3-8b.
- Code: not released at time of writing.
Reversal curse. Language models finetuned on synthetic facts of the form “A is B” (e.g. ‘Tom Cruise’s mother is Mary Pfeiffer’) cannot answer the reverse question (‘Who is Mary Pfeiffer’s son?’).
- Paper: https://arxiv.org/abs/2309.12288
- Models: GPT3-175b, GPT3-350m, Llama-1-7b
- Code: https://github.com/lukasberglund/reversal_curse
1. ^
  Caveats:
  While the code is available, it may not be super low-friction to use.
  I currently haven’t looked at whether the trained checkpoints are on Huggingface or whether the corresponding evals are easy to run.
  If there’s sufficient interest, I’d be willing to help make Colab notebook reproductions
2. ^
  In the paper, the authors use the term ‘behavioural self-awareness’ instead
3. ^
  This is basically the counterpart of ‘policy awareness’.
4. ^
  Worth noting this has been recently reproduced in an Anthropic paper, and I expect this to reproduce broadly across other capabilities that models have
What links here?
- Daniel Tan's comment on Revealing alignment faking with a single prompt by Florian_Dietz (31 Jan 2025 2:32 UTC; 1 point)