The AGI gave us the plans to make the nanofactories because it wants us to die. It will not give us a workable plan to make an aligned artificial intelligence that could compete with it and shut it off because that would lead to something besides us being dead. It will make sure any latter plan is either ineffective or will, for some reason, not in practice lead to aligned AGI because the nanomachines will get to us first.
Can we build a non-agential AGI to solve alignment? Or just a very smart task-specific AI?
If you come up with a way to build an AI that hasn’t crossed the rubicon of dangerous generality, but can solve alignment, that would very helpful. It doesn’t seem likely to be possible without already knowing how to solve alignment.
You could probably train a non-dangerous ML model that has superhuman theorem-proving abilities, but we don’t know how to formalize the alignment problem in a way that we can feed it into a theorem prover.
A model that can “solve alignment” for us would be a consequentialist agent explicitly modeling humans, and dangerous by default.
We might be able to formalize some pieces of the alignment problem, like MIRI tried with corrigibility. Also Vanessa Kosoy has some more formal work, too. Do you think there are no useful pieces to formalize? Or that all the pieces we try to formalize won’t together be enough even if we had solutions to them?
Also, even if it explicitly models humans, would it need to be consequentialist? Could we just have a powerful modeller trained to minimize prediction loss or whatever? The search space may be huge, but having a powerful modeller still seems plausibly useful. We could also filter options, possibly with a separate AI, not necessarily an AGI.
Can we build a non-agential AGI to solve alignment? Or just a very smart task-specific AI?
If you come up with a way to build an AI that hasn’t crossed the rubicon of dangerous generality, but can solve alignment, that would very helpful. It doesn’t seem likely to be possible without already knowing how to solve alignment.
Why is this?
You could probably train a non-dangerous ML model that has superhuman theorem-proving abilities, but we don’t know how to formalize the alignment problem in a way that we can feed it into a theorem prover.
A model that can “solve alignment” for us would be a consequentialist agent explicitly modeling humans, and dangerous by default.
We might be able to formalize some pieces of the alignment problem, like MIRI tried with corrigibility. Also Vanessa Kosoy has some more formal work, too. Do you think there are no useful pieces to formalize? Or that all the pieces we try to formalize won’t together be enough even if we had solutions to them?
Also, even if it explicitly models humans, would it need to be consequentialist? Could we just have a powerful modeller trained to minimize prediction loss or whatever? The search space may be huge, but having a powerful modeller still seems plausibly useful. We could also filter options, possibly with a separate AI, not necessarily an AGI.
I don’t see why not but there is a probably an infalsifiable reason of why this is impossible, and I am looking forward to reading it