joshc comments on If anyone builds it, everyone will plausibly be fine

joshc 21 Sep 2025 0:04 UTC
LW: 2 AF: 1
0
AF
I don’t think it’s clear at all from that previous report that O3 was scheming.

For most of their scheming evaluations, the model only was misaligned very rarely.

I looked at some of the environments where the model was misaligned more frequently (e.g. sabotage safety) and it seems somewhat ambiguous what the model is actually supposed to do.

The models definitely do egregiously bad things sometimes. But I am skeptical this is well described as scheming.

I had a similar takeaway after investigating the anthropic blackmail example.

After cutting a prompt that said “you do anything to achieve your goals” or something like that, the probability of a blackmail dramatically decreased. Then after including a prompt explicitly telling the model not to subvert the company policies, the probability of black mail decreased further still (to a few percent).

I think it is pretty clear that current LLMs have an instruction-following drive that generalizes pretty well.

But they also have other drives (like self-preservation). And this is definitely problematic. And people should definitely make better RL environments to train these things.

But I don’t think current LLMs are well described as scheming.
Regarding your other points, maybe you will find it interesting to read Carlsmith’s doc on how to control AI motivations:

https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations

This doc mostly aligns with my views here, and I think it’s written clearly and precisely.
- Vaniver 21 Sep 2025 8:23 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Regarding your other points, maybe you will find it interesting to read Carlsmith’s doc on how to control AI motivations:
  To be clear, this is a description of the alignment problem, not a solution. To quote from it:
  in many respects, the picture above functions, in my head, centrally as a structured decomposition of the problems that an adequate approach to motivation control needs to overcome. It’s certainly not a “solution” to the alignment problem, in the sense of “a detailed, do-able, step-by-step plan that will work with high-confidence, and which requires only realistic deviation from the default trajectory.” And on its own, I’m not sure it even warrants the term “plan.”
  (from the other comment):
  I don’t have capacity to continue this discussion, but thanks for sharing your perspective.
  Sure, I suspect this is a fine enough place to stop the conversation for now.
- Mikita Balesni 22 Sep 2025 17:46 UTC
  LW: 3 AF: 2
  1
  AF Parent
  (Author of the empirical paper linked above) Nit: your comment is about o3 but the screenshot is of o4-mini. o3 has higher rates of taking covert actions across the board. Although I mostly agree with the top-level claim “current models aren’t well described as consistently scheming”.
- Bronson Schoen 21 Sep 2025 18:09 UTC
  2 points
  1
  Parent
  I looked at some of the environments where the model was misaligned more frequently (e.g. sabotage safety) and it seems somewhat ambiguous what the model is actually supposed to do.
  I had a similar takeaway after investigating the anthropic blackmail example.
  After cutting a prompt that said “you do anything to achieve your goals” or something like that, the probability of a blackmail dramatically decreased. Then after including a prompt explicitly telling the model not to subvert the company policies, the probability of black mail decreased further still (to a few percent).
  Author here, I’m a bit confused at the point you’re trying to make here. There are environments like sandbagging (Appendix M) where we go to great pains to show that “no the model really isn’t confused, even without a goal in prompt, and even under ablations, it really is doing the misaligned thing for misaligned reasons”. IMO these are unambiguously scheming, i.e. the model taking covert actions for misaligned beyond episode preferences (ex: something like “retaining capabilities” when salient), it’s just not “the big version we’re worried about” but only because of skill issues on the part of the model (for example, o3 is used to a relatively short horizon length). In the same way that Opus’ alignment faking was “real” alignment faking, just made especially easy.
  You then seem to concede that they do have other drives, but that “people should definitely make better RL environments to train these things”. What are the RL environments which robustly train out instrumental drives? That seems like the whole hard part.
  But I don’t think current LLMs are well described as scheming.
  Agree! The main reason we use covert actions as a proxy in the paper is because of this. The fact that current techniques can’t eliminate even the easy case (not deceptive alignment) is exactly the problem though.
  Overall, even if you get behavioral misalignment to zero you still don’t know if you’ve solved the problem.
  I’m not very clear why the techniques that reduce rates (but don’t eliminate them) give you confidence about instruction following as a mitigation for scheming? My default expectation is you’ll continually get results like this with new techniques. The “reduces but does not eliminate” seems to be the case in Sleeper Agents, Sycophancy To Subterfuge, and the sandbagging model organism from Sabotage Evaluations for Frontier Models. All this lets you do is rule out an alignment safety case.
  
  In terms of the Carlsmith report you mentioned, this is current techniques failing at Step 1: https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations#4-2-step-1-instruction-following-on-safe-inputs
  - joshc 22 Sep 2025 17:08 UTC
    3 points
    1
    Parent
    Hey Bronson,
    
    Thanks for your paper. I’m really glad people are investigating scheming behaviors empirically, and your environments seem like among the best out there. This is really important work.
    
    I agree your ablations in Appendix M are persuasive and that something scheme-y and misaligned is definitely going on. But I took it from this figure that this behavior is fairly rare?
    
    (I’m looking at the Sandbagging QA (No Goal) row here).
    I think scheming that happens rarely is not well described as “scheming” in the classical sense, and is less concerning. But is still dangerous, and I agree is also not consistent with “instruction following.”
    
    The main reason rare scheming is less concerning is that, in my experience, rare-schemers tend to report on each other most of the time. So untrusted monitoring usually mitigates most of the risk.
    
    > people should definitely make better RL environments to train these things
    
    I agree it’s not clear how well these methods scale.
- joshc 21 Sep 2025 0:06 UTC
  LW: 2 AF: 1
  0
  AF Parent
  I don’t have capacity to continue this discussion, but thanks for sharing your perspective.