They don’t think about gaining power very often (I don’t think it’s never) because it’s not a big direction in their RL training set or the base training.
That might make you optimistic that they’ll never think about gaining power if we keep training them similarly.
But it shouldn’t. Because we will also keep training and designing them to be better at goal-directed reasoning. This is necessary for doing multi-step or complex tasks, which we really want them to do.
But this trains them to be good at causal reasoning. That’s when the inexorable logic of instrumental convergence kicks in.
In short: they’re not smart enough yet for that to be relevant. But they will be, and it will be.
At a minimum we’ll need new training to keep them from doing that. But trying to make something smarter and smarter while keeping it from thinking about some basic facts about reality sounds like a losing bet without some good specific plans.
That’s when the inexorable logic of instrumental convergence kicks in.
Instrumental convergence to do what?
If they already have basically human morality by the time it kicks in, and they’ve read all the tales for why it goes wrong, then I think they’d just not take over. Especially if we keep reinforcing that behavior via rating from humans and other AIs. Reinforcing particular lines of reasoning doesn’t imply the AI will become a malign mesa-optimizer.
Instrumental convergence for seeking power. Almost any problem can be solved better or more certainly if you have more resources to devote to it. This can range from just asking for help to taking over the world.
And the tales about how it goes wrong are hardly logical proof you shouldn’t do it. There’s no law of the universe saying you can’t do good things (by whateer criteria you have) by seizing power.
This has nothing to do with mesa-optimization. It’s in the broad area of alignment misgeneralization. We train them to do something, then are surprised and dismayed when either we got our training set or our goals somewhat wrong, and didn’t anticipate what it would look like taken to its logical conclusion (probably because we couldn’t predict the logical conclusion of some training on a limited set of data when it’s generalized to very different situations; see that post I linked for elaboration)
We’re not preventing powerseeking from ratings or any other alignment strategy; see my other comment.
They don’t think about gaining power very often (I don’t think it’s never) because it’s not a big direction in their RL training set or the base training.
That might make you optimistic that they’ll never think about gaining power if we keep training them similarly.
But it shouldn’t. Because we will also keep training and designing them to be better at goal-directed reasoning. This is necessary for doing multi-step or complex tasks, which we really want them to do.
But this trains them to be good at causal reasoning. That’s when the inexorable logic of instrumental convergence kicks in.
In short: they’re not smart enough yet for that to be relevant. But they will be, and it will be.
At a minimum we’ll need new training to keep them from doing that. But trying to make something smarter and smarter while keeping it from thinking about some basic facts about reality sounds like a losing bet without some good specific plans.
Instrumental convergence to do what?
If they already have basically human morality by the time it kicks in, and they’ve read all the tales for why it goes wrong, then I think they’d just not take over. Especially if we keep reinforcing that behavior via rating from humans and other AIs. Reinforcing particular lines of reasoning doesn’t imply the AI will become a malign mesa-optimizer.
Instrumental convergence for seeking power. Almost any problem can be solved better or more certainly if you have more resources to devote to it. This can range from just asking for help to taking over the world.
And the tales about how it goes wrong are hardly logical proof you shouldn’t do it. There’s no law of the universe saying you can’t do good things (by whateer criteria you have) by seizing power.
This has nothing to do with mesa-optimization. It’s in the broad area of alignment misgeneralization. We train them to do something, then are surprised and dismayed when either we got our training set or our goals somewhat wrong, and didn’t anticipate what it would look like taken to its logical conclusion (probably because we couldn’t predict the logical conclusion of some training on a limited set of data when it’s generalized to very different situations; see that post I linked for elaboration)
We’re not preventing powerseeking from ratings or any other alignment strategy; see my other comment.
No, I mean: seeking power to do what?
If the goal is already helpful harmless assistant, well it says seeking power is wrong. So seeking power is counter to goals. So not convergent.
Then the question is: have AIs already internalized this well enough and will they continue to do so? I think it’s highly likely that yes.