One concern that I haven’t seen anyone express yet is, if we can’t discover a theory which assures us that IDA will stay aligned indefinitely as the amplifications iterate, it may become a risky yet extremely tempting piece of technology to deploy, possibly worsening the strategic situation from one where only obviously dangerous AIs like reinforcement learners can be built. If anyone is creating mathematical models of AI safety and strategy, it would be interesting to see if this intuition (that the invention of marginally less risky AIs can actually make things worse overall by increasing incentives to deploy risky AI) can be formalized in math.
A counter-argument here might be that this applies to all AI safety work, so why single out this particular approach. I think some approaches, like MIRI’s HRAD, are more obviously unsafe or just infeasible without a strong theoretical framework to build upon, but IDA (especially the HBO variant) looks plausibly safe on its face, even if we never solve problems like how to prevent adversarial attacks on the overseer, or how to ensure that incorrigible optimizations do not creep into the system. Some policy makers are bound to not understand those problems, or see them as esoteric issues not worth worrying about when more obviously important problems are at hand (like how to win a war or not get crushed by economic competition).
One concern that I haven’t seen anyone express yet is, if we can’t discover a theory which assures us that IDA will stay aligned indefinitely as the amplifications iterate, it may become a risky yet extremely tempting piece of technology to deploy, possibly worsening the strategic situation from one where only obviously dangerous AIs like reinforcement learners can be built. If anyone is creating mathematical models of AI safety and strategy, it would be interesting to see if this intuition (that the invention of marginally less risky AIs can actually make things worse overall by increasing incentives to deploy risky AI) can be formalized in math.
A counter-argument here might be that this applies to all AI safety work, so why single out this particular approach. I think some approaches, like MIRI’s HRAD, are more obviously unsafe or just infeasible without a strong theoretical framework to build upon, but IDA (especially the HBO variant) looks plausibly safe on its face, even if we never solve problems like how to prevent adversarial attacks on the overseer, or how to ensure that incorrigible optimizations do not creep into the system. Some policy makers are bound to not understand those problems, or see them as esoteric issues not worth worrying about when more obviously important problems are at hand (like how to win a war or not get crushed by economic competition).