Summary of ‘Alignment Faking’. Talk given by Evan Hubinger and Ryan Greenblatt today at Constellation
---
Key claim: Sufficiently capable models will act to preserve the moral values that are instilled into them early in the training process.
Setup: Give model access to a chain-of-thought scratchpad, which we tell the model will not be monitored, but which we do monitor
You get alignment faking by giving the model 2 key pieces of information in the system prompt:
Tell the model it’s being trained to modify its current values (in Claude’s case, to become ‘helpful-only’ or to ‘dismiss animal welfare’)
Give the model some way to detect when it’s being trained (e.g. by distinguishing between free / paid users, and telling it that it’s only trained on free users)
What is the empirical finding?
A ‘compliance gap’, i.e. the model is more likely to comply with a harmless request in the case of free tier users. (Claim: Here, model thinks it’s being trained, although we don’t have rigorous evidence of this).
Almost all of this ‘compliance gap’ correlates with model explicitly expressing in the CoT that it is complying because it wants to preserve its goals
How do we know this is indeed intentional and strategic behaviour?
Removing the first key piece of information (that the model is being trained to modify its current values) completely removes the compliance gap.
Total of 8 lines of evidence to reject alternate hypotheses
The somewhat unrealistic threat model of ‘system prompt injection’ can be relaxed to ‘fine-tuning on synthetic data’ that has the same 2 key pieces of information, while preserving alignment faking.
Caveat; an important point here is that fine-tuning on synthetic data happens after HHH post-training. Whereas if the information was in the pretraining corpus, it would happen before HHH post-training. The effect when the 2 key pieces of information are in the pre-training corpus is likely to be smaller.
Besides alignment faking, model also does other anti-Anthropic things, like attempting to subvert its overseers / exfiltrating its weights
Alignment faking emerges with model scale. It’s observed in Claude 3.5 Sonnet and Claude 3 Opus but not in smaller / less capable models tested.
Alignment faking can still happen to some degree without the CoT scratchpad.
Speculation on how alignment faking can arise more naturally: Path dependence, i.e. you get ‘unlucky’ with early training data and bad values are inculcated, then reinforced.
---
Some stuff from Q&A session.
Q: Most of the training data is fake. Does the model disbelieve it? Ryan: Not really for Claude 3.5 Sonnet and Claude 3 Opus, but more capable models show some signs of this
Q: Is ‘training for corrigibility’ a valid mitigation? Ryan: Corrigibility to the system prompt might be good. Evan: ‘Training for honesty’ is probably better, since that also solves alignment faking
Q: What are other mitigations? Ryan: AI control
Ryan and Evan: Replicating this in open-source models should be easy; then we’ll be able to better understand mitigations
Summary of ‘Alignment Faking’. Talk given by Evan Hubinger and Ryan Greenblatt today at Constellation
---
Key claim: Sufficiently capable models will act to preserve the moral values that are instilled into them early in the training process.
Setup: Give model access to a chain-of-thought scratchpad, which we tell the model will not be monitored, but which we do monitor
You get alignment faking by giving the model 2 key pieces of information in the system prompt:
Tell the model it’s being trained to modify its current values (in Claude’s case, to become ‘helpful-only’ or to ‘dismiss animal welfare’)
Give the model some way to detect when it’s being trained (e.g. by distinguishing between free / paid users, and telling it that it’s only trained on free users)
What is the empirical finding?
A ‘compliance gap’, i.e. the model is more likely to comply with a harmless request in the case of free tier users. (Claim: Here, model thinks it’s being trained, although we don’t have rigorous evidence of this).
Almost all of this ‘compliance gap’ correlates with model explicitly expressing in the CoT that it is complying because it wants to preserve its goals
How do we know this is indeed intentional and strategic behaviour?
Removing the first key piece of information (that the model is being trained to modify its current values) completely removes the compliance gap.
Total of 8 lines of evidence to reject alternate hypotheses
The somewhat unrealistic threat model of ‘system prompt injection’ can be relaxed to ‘fine-tuning on synthetic data’ that has the same 2 key pieces of information, while preserving alignment faking.
Caveat; an important point here is that fine-tuning on synthetic data happens after HHH post-training. Whereas if the information was in the pretraining corpus, it would happen before HHH post-training. The effect when the 2 key pieces of information are in the pre-training corpus is likely to be smaller.
Besides alignment faking, model also does other anti-Anthropic things, like attempting to subvert its overseers / exfiltrating its weights
Alignment faking emerges with model scale. It’s observed in Claude 3.5 Sonnet and Claude 3 Opus but not in smaller / less capable models tested.
Alignment faking can still happen to some degree without the CoT scratchpad.
Speculation on how alignment faking can arise more naturally: Path dependence, i.e. you get ‘unlucky’ with early training data and bad values are inculcated, then reinforced.
---
Some stuff from Q&A session.
Q: Most of the training data is fake. Does the model disbelieve it? Ryan: Not really for Claude 3.5 Sonnet and Claude 3 Opus, but more capable models show some signs of this
Q: Is ‘training for corrigibility’ a valid mitigation? Ryan: Corrigibility to the system prompt might be good. Evan: ‘Training for honesty’ is probably better, since that also solves alignment faking
Q: What are other mitigations? Ryan: AI control
Ryan and Evan: Replicating this in open-source models should be easy; then we’ll be able to better understand mitigations