Nice post.
Pushing back a little on this part of the appendix:
Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding—the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe.
I’m a bit concerned about people assuming this is true for models going forward. A sufficiently intelligent RL-trained model can learn to distribute its planning across multiple forward passes. I think your claim is true for models trained purely on next-token prediction, and for GPT4-level models which, even though they have an RL component in their training, their outputs are all human-understandable (and incorporated into the oversight process).
But even 12 months from now I’m unsure how confident you could be in this claim for frontier models. Hopefully, labs are dissuaded from producing models which can use uninterpretable scratch pads given how much more dangerous they would be and harder to evaluate.
I have a different intuition here; I would much prefer the alignment team at e.g. DeepMind to be working at DeepMind as opposed to doing their work for some “alignment-only” outfit. My guess is that there is a non-negligible influence that an alignment team can have on a capabilities org in the form of:
The alignment team interacting with other staff either casually in the office or by e.g. running internal workshops open to all staff (like DeepMind apparently do)
The org consulting with the alignment team (e.g. before releasing models or starting dangerous projects)
Staff working on raw capabilities having somewhere easy to go if they want to shift to alignment work
I think the above benefits likely outweigh the impact of the influence in the other direction (such as the value drift from having economic or social incentives linked to capabilities work)