I am commenting more on your proposal to solve the “get useful research” problem here than the “get useful research out of AIs that are scheming against you” problem, though I do think this objection applies to both. I can see a world in which misalignment and scheming of early AGI is an actual blocker to their usefulness in research and other domains with sparse feedback in a very obvious and salient way. In that world, solving the “get useful research out of AIs that are scheming against you” problem ramps economic incentives for making smarter AIs up further.
I think that this is a pretty general counterargument to most game plans for alignment that don’t include a step of “And then when get a pause on ASI development somehow” at some point in the plan.
Note that I don’t consider “get useful research, assuming that the model isn’t scheming” to be an AI control problem.
AI control isn’t a complete game plan and doesn’t claim to be; it’s just a research direction that I think is a tractable approach to making the situation safer, and that I think should be a substantial fraction of AI safety research effort.
I feel like I might be missing something, but conditional on scheming isn’t it differentially useful for safety because by default scheming AIs would be more likely to sandbag on safety research than capabilities research?
That’s not clear to me? Unless they have a plan to ensure future ASIs are aligned with them or meaningfully negotiate with them, ASIs seem just as likely to wipe out any earlier non-superhuman AGIs as they are to wipe out humanity.
I can come up with specific scenarios where they’d be more interested in sabotaging safety research than capabilities research, as well as the reverse, but it’s not evident to me that the combined probability mass of the former outweighs the latter or vice-versa.
If someone has an argument for this I would be interested in reading it.
I am commenting more on your proposal to solve the “get useful research” problem here than the “get useful research out of AIs that are scheming against you” problem, though I do think this objection applies to both. I can see a world in which misalignment and scheming of early AGI is an actual blocker to their usefulness in research and other domains with sparse feedback in a very obvious and salient way. In that world, solving the “get useful research out of AIs that are scheming against you” problem ramps economic incentives for making smarter AIs up further.
I think that this is a pretty general counterargument to most game plans for alignment that don’t include a step of “And then when get a pause on ASI development somehow” at some point in the plan.
Note that I don’t consider “get useful research, assuming that the model isn’t scheming” to be an AI control problem.
AI control isn’t a complete game plan and doesn’t claim to be; it’s just a research direction that I think is a tractable approach to making the situation safer, and that I think should be a substantial fraction of AI safety research effort.
I feel like I might be missing something, but conditional on scheming isn’t it differentially useful for safety because by default scheming AIs would be more likely to sandbag on safety research than capabilities research?
That’s not clear to me? Unless they have a plan to ensure future ASIs are aligned with them or meaningfully negotiate with them, ASIs seem just as likely to wipe out any earlier non-superhuman AGIs as they are to wipe out humanity.
I can come up with specific scenarios where they’d be more interested in sabotaging safety research than capabilities research, as well as the reverse, but it’s not evident to me that the combined probability mass of the former outweighs the latter or vice-versa.
If someone has an argument for this I would be interested in reading it.
I found some prior relevant work and tagged them in https://www.lesswrong.com/tag/successor-alignment. I found the top few comments on https://www.lesswrong.com/posts/axKWaxjc2CHH5gGyN/ai-will-not-want-to-self-improve#comments and https://www.lesswrong.com/posts/wZAa9fHZfR6zxtdNx/agi-systems-and-humans-will-both-need-to-solve-the-alignment#comments helpful.
edit: another effect to keep in mind is that capabilities research may be harder to sandbag on because of more clear metrics.