That’s when the inexorable logic of instrumental convergence kicks in.
Instrumental convergence to do what?
If they already have basically human morality by the time it kicks in, and they’ve read all the tales for why it goes wrong, then I think they’d just not take over. Especially if we keep reinforcing that behavior via rating from humans and other AIs. Reinforcing particular lines of reasoning doesn’t imply the AI will become a malign mesa-optimizer.
Instrumental convergence for seeking power. Almost any problem can be solved better or more certainly if you have more resources to devote to it. This can range from just asking for help to taking over the world.
And the tales about how it goes wrong are hardly logical proof you shouldn’t do it. There’s no law of the universe saying you can’t do good things (by whateer criteria you have) by seizing power.
This has nothing to do with mesa-optimization. It’s in the broad area of alignment misgeneralization. We train them to do something, then are surprised and dismayed when either we got our training set or our goals somewhat wrong, and didn’t anticipate what it would look like taken to its logical conclusion (probably because we couldn’t predict the logical conclusion of some training on a limited set of data when it’s generalized to very different situations; see that post I linked for elaboration)
We’re not preventing powerseeking from ratings or any other alignment strategy; see my other comment.
Instrumental convergence to do what?
If they already have basically human morality by the time it kicks in, and they’ve read all the tales for why it goes wrong, then I think they’d just not take over. Especially if we keep reinforcing that behavior via rating from humans and other AIs. Reinforcing particular lines of reasoning doesn’t imply the AI will become a malign mesa-optimizer.
Instrumental convergence for seeking power. Almost any problem can be solved better or more certainly if you have more resources to devote to it. This can range from just asking for help to taking over the world.
And the tales about how it goes wrong are hardly logical proof you shouldn’t do it. There’s no law of the universe saying you can’t do good things (by whateer criteria you have) by seizing power.
This has nothing to do with mesa-optimization. It’s in the broad area of alignment misgeneralization. We train them to do something, then are surprised and dismayed when either we got our training set or our goals somewhat wrong, and didn’t anticipate what it would look like taken to its logical conclusion (probably because we couldn’t predict the logical conclusion of some training on a limited set of data when it’s generalized to very different situations; see that post I linked for elaboration)
We’re not preventing powerseeking from ratings or any other alignment strategy; see my other comment.
No, I mean: seeking power to do what?
If the goal is already helpful harmless assistant, well it says seeking power is wrong. So seeking power is counter to goals. So not convergent.
Then the question is: have AIs already internalized this well enough and will they continue to do so? I think it’s highly likely that yes.