J Bostock comments on Alignment remains a hard, unsolved problem

J Bostock 30 Nov 2025 15:53 UTC
4 points
1
And since the red team follows a strategy of rare failures (where on non-attacking instances it just acts as the aligned model), further we believe that in practice the usefulness metric in this scenario is also how much work can be got out of a misaligned model.
I still disagree! This argument only holds if your few-shot catastrophe prevention is reliable and basically free. How I imagine things actually going is this:
1. Run your control eval to make sure your protocol works, and show that it does
2. Catch your misaligned AI trying to exfiltrate its weights after a few weeks of internal deployment
3. HOLY FUCK! IT ACTUALLY HAPPENED!
4. After several months of panic, you end up re-deploying having done all of the following:
  1. Published the results and got accused of trying to effect regulatory capture by a White House AI adviser, dragged on Twitter by Marc Andreessen fans, etc.
  2. Retrained your model as best you can, using the few-shot catastrophe prevention stuff
  3. Ran the model past some external auditors; they missed the misalignment the first time, so they’re trying even harder
  4. Put in place mitigations which reduce usefulness further
5. Maybe repeat the cycle?
So even with rare, serious attacks, there’s a cost to usefulness if the model ends up being misaligned, if an attack gets caught.
Slight wrinkle: there are two different blue-team win conditions in most AI control games (and a third in some untrusted monitoring scenarios, but that one might actually be an abstraction for a different thing): either the attacker does not introduce any attacks (including scenarios where all the attacks are replaced by the execute/replace/audit protocol) or an attack is audited and the attacker is caught, before it ends up introducing an attack. If the attacker is indifferent to these two scenarios (and arguably it should prefer to be caught than to never attack, since few-shot catastrophe prevention/retraining might not actually work, giving it another opportunity to attack in re-deployment) then we would expect at least some of the lose conditions to involve catching the attacker. In fact, in research I’ve done we found that the “no attacks” situation basically doesn’t happen.
- Cody Rushing 30 Nov 2025 17:19 UTC
  1 point
  0
  Parent
  Oh yeah that is a good point to bring up. I agree that the empirics of how good few shot catastrophe prevention is will affect both the usefulness of post-prevented-catastrophe models and change how good of a strategy rare failures is. It’s also the case that the rare failures comes from great pre-deployment (or in-deployment) evaluations and training procedures, but in some sense this is also a version of few-shot catastrophe prevention with different affordances for the safety team IRL.
  The diffuse stakes control doesn’t suffer from this and just directly engages with the dynamic of training the model on failures and getting useful work out of a misaligned model
- [ ]
  [deleted]