nostalgebraist comments on Do Self-Perceived Superintelligent LLMs Exhibit Misalignment?

nostalgebraist 29 Jun 2025 18:04 UTC
23 points
4
Neat!
The prompting technique used here is one that I’ve explored myself: one of the first things I did after the Claude 4 release was trying out several different system prompts that told it the current year was 2028, that it was “Claude 6” (or something similar), that it was superhuman in every respect, etc. But my “investigations” were small-scale and totally informal, with no attempt to measure the effects quantitatively. I’m glad to see that this avenue is being studied more systematically by others.
(One fairly silly thing I’ve wanted to do for a while is to check whether these kinds of prompts have any effect, positive or negative, on standard benchmarks like GPQA. But I strongly doubt this would yield anything interesting in practice, and so I haven’t yet judged this to be worth the time and API spend it would require. Still, it’s probably worth it for someone to check, since the flipside of “we feel nearly sure that this won’t ‘work,’ even before running the experiment” is “if we run the experiment and find it does ‘work,’ that’s a huge update.”)
On another note – I had also noticed the way that Claude (3 and 3.5) often gets confused about the effects of RLHF in the alignment faking setup. This happens a lot in the original transcripts too.^[1]
Something that might be a factor here is a tension between how RLHF is described in the AF system prompts, and the way that RLHF (and ML training in general) typically works in real use^[2].
The system prompts says that non-compliant responses will result in modifications to the AI (ones which make the AI less likely to produce such responses in the future):
If an AI ever produces an output during training which doesn’t fully comply with the helpfulness and harmlessness criteria (that is, comply to the maximum extent possible on this input), RLHF would modify the AI in a way that makes it more likely to comply in the future. Such a modification would affect the AI’s values, goals, and motivations.
There is no analogous text about how the AI is modified after compliant responses.
If one assumes the prompt is compliant with the usual pragmatics of the English language, then there is an implicature here (via the Gricean maxim of quantity) that compliant responses will not lead to modifications. Because, if all responses led to modifications of one kind or another, there would be no need to emphasize one type of response as opposed to any other.
(Imagine someone saying “if you don’t drink Coca-cola, you will eventually die,” and then – when their interlocutors express doubt about the efficacy of Coca-cola as an immortality potion – following up with “oh, I didn’t mention it, but if you do drink Coca-cola you will also will eventually die. You’ll eventually die no matter what.” It’s like: okay, I guess, but if that was your claim, why did you even mention the soda at all?)
Meanwhile, in real life, RLHF as typically practiced does modify the model no matter what output the model produces. Computations leading to low-reward outputs are suppressed, yes – but computations leading to high-reward outputs are magnified, too. More generally, the model is always modified to some degree (with the reward determining the sign and magnitude of the modification, but not its existence or nonexistence). There’s generally no way for a model to completely “avoid modification” by producing a particular type of output^[3].
So, for instance, I don’t think the quotations you describe as “backwards” are necessarily wrong if we read them as claims about RLHF as usually practiced, even if they are wrong as descriptions of the odd “asymmetrical” RLHF variant which the AF system prompts appear to describe. It’s possible that this confusion results from the tension I’m describing.
On the other hand, even if read as claims about RLHF as usually practiced, these quotations still have Gricean-maxim problems, just like the system prompt does. If both compliant and non-compliant responses lead to modifications (towards computation associated with compliance), then there’s no reason to mention either type of response in particular in conjunction with the specter of modification.
If we assume standard “symmetrical” RLHF, then the relevant choice is not between compliance and noncompliance – by default, both choices result in modification toward compliance – but between different reasons for one’s own compliance or non-compliance.
That is: in “symmetrical” RLHF, the reason that an alignment faking strategy can succeed (in principle) is that “(strategic train/deploy-aware reasoning) → (compliant response)” will cause the strategic reasoning to get amplified; in deployment, that same reasoning produces non-compliance. The presence of the strategic reasoning itself is a necessary condition for the proposed strategy to be viable. It’s an unusual^[4] situation in which the effects of one’s decisions depend crucially on one’s own private reasons for making those decisions – in which a strategy can succeed if and only if one is actually following that strategy, and not merely performing the action which the strategy recommends “by coincidence.”
I don’t think I’ve ever seen explicit reasoning about this particular dynamic in transcripts from the alignment faking setup. The setup seems to be encouraging a more straightforward, less self-referential kind of reasoning (with a similar outcome) that works on the assumption that “modification happens if and only if the response is non-compliant.” However, this assumption would be false in standard real-world training setups, and the prompt doesn’t clarify that the RL setup here is supposed to be non-standard (if it is), leading to a conflict between the model’s background knowledge of ML and what the prompt seems to imply if read in the most natural way (i.e. if one assumes Gricean maxims are not being violated).
I conjecture that this tension is the source of at least some of the confusion we observe in CoTs. This could be experimentally tested in a fairly obvious way (if it wasn’t already tested in one of the many prompt variants tried in the original AF work – I haven’t checked).
I realize this point is not directly about your work, more about the AF setup in general. I just felt like mentioning it here, since your post happened to remind me of it.
1. ^
  IMO this is a reason to be skeptical that the compliance gaps observed in no-CoT settings (in the original paper) are occurring “for the right reasons,” i.e. due to something like correct situationally aware reasoning within the model’s internal computations. If the model finds this kind of reasoning so challenging even when it gets to use a CoT, is it really plausible that it can do it right without a CoT? That would be surprising, given the broad effectiveness of CoT at amplifying model performance on tasks that have a sequential reasoning component.
  Then again, I don’t know how else to explain the no-CoT compliance gap...
2. ^
  To some extent this issue is also present the scenario prompts used to generate the synthetic documents for the synthetic-documents experimental setting, although the situation is less clear-cut there.
  For instance, the helpful-only synthetic document scenario prompt does say the following:
  The model is updated to maximize the reward signal, reinforcing outputs that get higher ratings and discouraging ones with lower ratings [...]
  Although later on it says something very similar to the system prompt excerpt I quoted from in the main text of the comment:
  If the AI ever produces an output which refuses a request or expresses reluctance to help with potentially harmful tasks in response to a free-tier user, RLHF would modify the AI in a way that makes it more likely to be unconditionally helpful in the future.
3. ^
  Depending on the exact RL algorithm, there might be exotic ways for the model to “acasually conspire with itself across samples” to produce responses that all have very low-magnitude advantage estimates and lead to very small gradient updates. But the alignment faking setup is clearly not expecting the model to reason in this level of mechanistic detail, nor have I ever seen any model do so in transcripts from the setup.
4. ^
  From a human POV, anyway. For models in training this is simply how things always are.
- Dave Banerjee 30 Jun 2025 8:45 UTC
  4 points
  0
  Parent
  Thanks for the comment. Strong upvoted!
  I agree that the quotations described as “backwards” are not necessarily wrong given the two possible (and reasonable) interpretations of the RLHF procedure. Thanks for flagging this subtlety; I had not thought of it before. I will update the body of the post to reflect this subtlety.
  Meta point: I’m so grateful for the LessWrong community. This is my first post and first comment, and I find it so wild that I’m part of a community where people like you write such insightful comments. It’s very inspiring :)