Hm, maybe there are two reasons why human-level AIs are safe:
1. A bunch of our alignment techniques work better when the overseer can understand what the AIs are doing (given enough time). This means that human-level AIs are actually aligned. 2. Even if the human-level AIs misbehave, they’re just human-level, so they can’t take over the world.
Under model (1), it’s totally ok that self-improvement is an option, because we’ll be able to train our AIs to not do that.
Under model (2), there are definitely some concerning scenarios here where the AIs e.g. escape onto the internet, then use their code to get resources, duplicate themselves a bunch of times, and set-up a competing AI development project. Which might have an advantage if it can care less about paying alignment taxes, in some ways.
Agree it’s not clear. Some reasons why they might:
If training environments’ inductive biases point firmly towards some specific (non-human) values, then maybe the misaligned AIs can just train bigger and better AI systems using similar environments that they were trained in, and hope that those AIs will end up with similar values.
Maybe values can differ a bit, and cosmopolitanism or decision theory can carry the rest of the way. Just like Paul says he’d be pretty happy with intelligent life that came from a similar distribution that our civ came from.
Humans might need to use a bunch of human labor to oversee all their human-level AIs. The HLAIs can skip this, insofar as they can trust copies of themself. And when training even smarter AI, it’s a nice benefit to have cheap copyable trustworthy human-level overseers.
Maybe you can somehow gradually increase the capabilities of your HLAIs in a way that preserves their values.
(You have a lot of high-quality labor at this point, which really helps for interpretability and making improvements through other ways than gradient descent.)
Hm, maybe there are two reasons why human-level AIs are safe:
1. A bunch of our alignment techniques work better when the overseer can understand what the AIs are doing (given enough time). This means that human-level AIs are actually aligned.
2. Even if the human-level AIs misbehave, they’re just human-level, so they can’t take over the world.
Under model (1), it’s totally ok that self-improvement is an option, because we’ll be able to train our AIs to not do that.
Under model (2), there are definitely some concerning scenarios here where the AIs e.g. escape onto the internet, then use their code to get resources, duplicate themselves a bunch of times, and set-up a competing AI development project. Which might have an advantage if it can care less about paying alignment taxes, in some ways.
I unconfidently suspect that human-level AIs won’t have a much easier time with the alignment problem than we expect to have.
Agree it’s not clear. Some reasons why they might:
If training environments’ inductive biases point firmly towards some specific (non-human) values, then maybe the misaligned AIs can just train bigger and better AI systems using similar environments that they were trained in, and hope that those AIs will end up with similar values.
Maybe values can differ a bit, and cosmopolitanism or decision theory can carry the rest of the way. Just like Paul says he’d be pretty happy with intelligent life that came from a similar distribution that our civ came from.
Humans might need to use a bunch of human labor to oversee all their human-level AIs. The HLAIs can skip this, insofar as they can trust copies of themself. And when training even smarter AI, it’s a nice benefit to have cheap copyable trustworthy human-level overseers.
Maybe you can somehow gradually increase the capabilities of your HLAIs in a way that preserves their values.
(You have a lot of high-quality labor at this point, which really helps for interpretability and making improvements through other ways than gradient descent.)