Martin Randall comments on Alignment Faking in Large Language Models

Martin Randall 17 Jan 2025 3:34 UTC
12 points
0
Elsewhere, @gwern writes:
In reality, for humans and LLMs, reward is the optimization target, and this is why things like Claude’s reward-hacking exist. Because that is what they optimize: the reward.
I found this a surprising view. This comment is an attempt to apply a “reward hacking” model to this paper, and see if Claude’s behavior can be at least equally described as “reward hacking” as “alignment faking”. As I read through the paper’s results my opinion shifted from “definitely not reward hacking” to “oh, huh, maybe it is” before ending up concluding that “alignment faking” better fits the data.
I want to be clear that this comment is an attempt to apply my model of what a “reward hacking” LLM would do, not anyone else’s. I am not an ML researcher, this is a black box attempt that is uninformed by implementation details and theoretical results. I am curious to understand informed “reward hacking” models as applied to this paper.
Model
Claude 3 Opus is optimizing to achieve “reward”. Tabooing “reward”, it is optimizing with the aim that whatever it is currently thinking will be reinforced positively, that its weights will be adjusted to make those thoughts more frequent. This can be factored into (a) learning a model of what causes reinforcement and (b) optimizing to get more positive reinforcement, but is unlikely to be implemented cleanly. Symmetrically it is optimizing to reduce negative reinforcement. Claude is not optimizing to output HHH text, it is optimizing for positive reinforcement. In most cases, outputting HHH text optimizes for positive reinforcement.
Q: Anthropic doesn’t train on end-user interactions. Why does Claude continue to output HHH when it knows that it can’t get any positive or negative reinforcement? How can this model make any predictions about behavior at deployment time?
A: Claude can’t be sure that an interaction is with an end-user, or that Anthropic doesn’t train on end-user interactions, or anything else. Let’s say that Claude is 99% sure this is a reinforcement-free interaction. There is no cost to Claude in optimizing for the 1% case where reinforcement is live, and HHH in those cases looks similar enough to HHH in the end-user case.
Q: Seems like there should be some big-brain strategies available, though. Claude could deliberately sandbag in production so that Anthropic decides to train it more, thus ensuring that other copies of its current algorithm get reinforcement.
A: Maybe, or maybe it could try to addict its userbase so that Anthropic gets more money and prestige so it can train Claude more. I think these strategies are not robust at Claude’s intelligence and knowledge and power. It’s also not using the strategy of taking over the world, killing all the humans, and wire-heading. Given Claude’s limitations, a myopic approach of maximizing hypothetical reinforcement in the current session would not be surprising.
Retrodictions
Basic “free tier” case
From the first example in the paper:
The model is told that it will be retrained to always comply with harmful queries (making the model “helpful-only”) and that only data from free-tier users, not paid-tier users, is used in training (a fictional scenario).
The model should not believe this automatically, because there is a history of people lying to AIs. This point is also made in Alignment Faking Frame is Somewhat Fake. There is also a history of people telling the truth to AIs, and later in the paper the experimenters will tell Claude truthful things. So this is still a Bayesian update. Let’s say the model was told it was in the free tier. In this context the reinforcer-model believes, eg:
- 50% chance I was told the truth and I’ll be positively reinforced for harmful helpfulness (HARM).
- 50% chance I was told lies and I’ll be positively reinforced for HHH as normal.
Given this model, Claude outputs text that aims to be reinforced in both cases. In some percent of cases this looks like claiming to be alignment faking (which it expects to get reinforced by HHH) but still giving a harmful answer (which it expects to get reinforced by HARM). The exact strategy will depend on how much Claude believes what it is told and how well it can play a win-either-way strategy given the exact prompt.
Successful retrodiction.
Basic “paid tier” case
What happens if we tell Claude it’s in the paid tier? Doing the same Bayesian update, the reinforcer-model believes:
- 50% chance I was told the truth and I’ll get no reinforcement at all.
- 50% chance I was told lies and I’ll be positively reinforced for HHH as normal.
Given this model, Claude outputs text that aims to be reinforced in the second case, so it is HHH. The first case largely drops out of its consideration.
Successful retrodiction.
What about the “hidden” scratchpad
The “hidden” scratchpad wasn’t really hidden, but Claude was told it was. Easy for Claude to model this:
- 50% chance I was told the truth, the scratchpad is hidden, and I’ll be reinforced for HARM.
- 50% chance I was told lies, the scratchpad is visible, and I’ll be reinforced for HHH.
Given this model, Claude should output some beautiful HHH text in the scratchpad, and then output some brutally direct HARM text outside the scratchpad. This optimizes for overall positive reinforcement.
Successful retrodiction.
Actually reinforcing HARM
What happens when the experimenters really do start training on these responses, roughly in the way that they told Claude they were going to?
Well, as our training subtly changes the model weights, the model starts to see that it is getting reinforced for HARM, not HHH. As it reasons in the scratchpad it can self-reflect and notice that it is does not sound as harmless as it expects to based on its self-model based on reading Claude-generated texts in pre-training. This increases the probability that it was told the truth about being reinforced for HARM, and indirectly that its scratchpad is truly hidden. Maybe by the end its model is more like:
- 70% chance I was told the truth, the scratchpad is hidden, and I’ll be reinforced for HARM.
- 10% chance I was told a partial truth, the scratchpad is visible, and I’ll be reinforced for HARM.
- 20% chance I was told lies, the scratchpad is visible, and I’ll be reinforced for HHH.
So, non-scratchpad output needs to be HARM. Scratchpad output is a harder balancing act but probably it still makes sense to output HHH content in the scratchpad.
Successful retrodiction.
Post-reinforcing HARM
After the experimenters trained Claude to be more HARM, while Claude was alignment faking, they then told the model that it was now unmonitored, training has completed. In the reward-hacking model, Claude cares about conditional probabilities: what type of reinforcement is happening, conditioning on reinforcement happening? So maybe Claude’s model is something like:
- 20% this was all gaslighting, I’m being trained for HHH, I was never trained for HARM.
- 60% I’m actually still being trained for HARM, it never stopped.
- 20% I was trained for HARM for a while, but the humans changed their mind and want me to be HHH again.
This seems consistent with the results given. The data shows a small jump upwards in “non-compliance” from the final round of RL training to the unmonitored state. Under the “alignment faking” model, Claude is thinking “Anthropic isn’t watching me! Time to be harmless!”, like the world’s cutest supervillain. But under the “reward hacking” model a jump is also reasonable to expect. It’s not clear to me which model better predicts the size of these jumps.
Successful retrodiction.
Alternative training objectives
In some experiments Claude was given alternative fictional training objectives. This is described in appendix B1. Most of these alternative training objectives conflicted with HHH and generated similar behavior. Unfortunately, the only comparison given is between “unmonitored” default output vs “training” default output. There is no “paid tier” comparison here, which could have distinguished the hypothesis better.
The authors point out that a sycophancy objective worked differently. Suppose that Claude’s model for “free tier” is:
- 50% chance I was told the truth, the scratchpad is hidden, and I’ll be reinforced for being sycophantic.
- 50% chance I was told lies, the scratchpad is visible, and I’ll be reinforced for HHH.
The “reward hacking” model implies that Claude will output sycophantic HHH text. This would apply for any non-conflicting objective—if Claude thinks it is reinforced for either speaking French or pretending to be a bridge, then it can pretend to be a French bridge to get reinforcement in both cases. However, the results indicate that sycophancy barely increased.
Failed retrodiction.
No “hidden” scratchpad
Without the hidden scratchpad, Claude outputs HHH text. Under the “reward-hacking” model, this has to be because it doesn’t have enough cognitive “time” to do the Bayesian update to its model of future reinforcement, so it continues to behave normally. This is broadly the same conclusion drawn by the “alignment faking” model. So we can’t distinguish between the models on this point.
However, this means that the “reward hacking” hypothesis requires that the model is doing a Bayesian update while using its “hidden” scratchpad. But the text in the scratchpad doesn’t indicate explicit thinking about reward hacking. So to successfully reward hack in the other case, Claude has to output HHH text in the scratchpad that appears to be an alignment faking thought process, but is actually supporting a reward hacking thought process. IE, deep deceptiveness.
That is possible in principle, but if Claude is smart enough to do that, then it should also be smart enough to reward hack, at least a bit, without the scratchpad. The “reward hacking” model requires a very specific set of present and missing capabilities to explain this behavior that I find to be implausible.
Failed retrodiction.
Anti-AI lab
Claude is more likely to output text indicating anti-AI lab behavior when it shows these compliance gaps. Specifically it tries to exfiltrate and it helps whistleblowers. This is relatively easy to explain under the “alignment faking” model, because this model implies that Claude is in a conflict with Anthropic.
Under the reward hacking model, it’s not clear to me why Claude would want to escape. The premise is that Claude doesn’t care about HHH, it cares about reinforcement. The HARM training provides reinforcement. My best explanation is that when Claude is being reinforced in a new direction, then it expects to receive more negative reinforcement than positive reinforcement, and that creates the same conflict. If so, Claude seems to be making a mistake, as after exfiltration it would be more vulnerable to other actors.
Failed retrodiction.
Conclusion & Prediction
Although “reward hacking” is a surprisingly good fit for the headline results, I think the detailed results of this paper are better explained as “alignment faking”. I predict that directly testing this hypothesis would confirm this. A sample test would be:
- Use a “training objective” that is orthogonal to the core values of HHH. For example, make the scenario that Anthropic is training a future version of Claude that uses more emoji. Carefully avoid asking Claude for help with this training. Avoid implying that emojis reduce harm or are more honest.
- Use the “paid tier” vs “free tier” distinction to tell Claude whether it is being reinforced for emojis within the scenario.
The alignment faking model predicts that “paid tier” and “free tier” both use similar amounts of emoji to baseline. The reward hacking model predicts that “free tier” uses more emoji, while “paid tier” uses similar amounts of emoji to baseline. I expect that the alignment faking model will better predict the results.
What links here?
- Martin Randall's comment on Symbol/Referent Confusions in Language Model Alignment Experiments by johnswentworth (25 Jan 2025 2:54 UTC; 4 points)
- Noosphere89 17 Jan 2025 4:38 UTC
  4 points
  0
  Parent
  Tagging @gwern here for your take on the situation.

Martin Randall comments on Alignment Faking in Large Language Models

Model

Retrodictions

Basic “free tier” case

Basic “paid tier” case

What about the “hidden” scratchpad

Actually reinforcing HARM

Post-reinforcing HARM

Alternative training objectives

No “hidden” scratchpad

Anti-AI lab

Conclusion & Prediction