johnswentworth comments on AI Control: Improving Safety Despite Intentional Subversion

johnswentworth 21 Jan 2025 16:26 UTC
LW: 58 AF: 29
11
AF
I think control research has relatively little impact on X-risk in general, and wrote up the case against here.
Basic argument: scheming of early transformative AGI is not a very large chunk of doom probability. The real problem is getting early AGI to actually solve the problems of aligning superintelligences, before building those superintelligences. That’s a problem for which verification is hard, solving the problem itself seems pretty hard too, so it’s a particularly difficult type of problem to outsource to AI—and a particularly easy to type of problem to trick oneself into thinking the AI has solved, when it hasn’t.