Joe Carlsmith comments on Can we safely automate alignment research?

Joe Carlsmith 1 May 2025 21:33 UTC
LW: 4 AF: 3
0
AF
Re: examples of why superintelligences create distinctive challenges: superintelligences seem more likely to be schemers, more likely to be able to systematically and successfully mess with the evidence provided by behavioral tests and transparency tools, harder to exert option-control over, better able to identify and pursue strategies humans hadn’t thought of, harder to supervise using human labor, etc.
If you’re worried about shell games, it’s OK to round off “alignment MVP” to hand-off-ready AI, and to assume that the AIs in question need to be able to pursue make and pursue coherent long-term plans.^[1] I don’t think the analysis in the essay alters that much (for example, I think very little rests on the idea that you can get by with myopic AIs), and better to err on the side of conservatism.
I wanted to set aside “hand-off” here because in principle, you don’t actually need to hand-off until humans stop being able to meaningfully contribute to the safety/quality of the automated alignment work, which doesn’t necessarily need to start around the time we have AIs capable of top-human-level alignment work (e.g., human evaluation of the research, or involvement in other aspects of control—e.g., providing certain kinds of expensive, trusted supervision—could persist after that). And when exactly you hand-off depends on a bunch of more detailed, practical trade-offs.
As I said in the post, one way that humans still being involved might not bottleneck the process is if they’re only reviewing the work to figure out whether there’s a problem they need to actively intervene on.
even if human labor is still playing a role in ensuring safety, it doesn’t necessarily need to directly bottleneck the research process – or at least, not if things are going well. For example: in principle, you could allow a fully-automated alignment research process to proceed forward, with humans evaluating the work as it gets produced, but only actively intervening if they identify problems.
And I think you can still likely radically speed up and scale up your alignment research even e.g. you still care about humans reviewing and understanding the work in question.
1. ^
  Though for what it’s worth I don’t think that the task of assessing “is this a good long-term plan for achieving X” needs itself to involve long-term optimization for X. For example: you could do that task over five minutes in exchange for a piece of candy.