Always love to see some well made AI optimism arguments, great work!
The current generation of easily aligned LLMs should definitely update one towards alignment being a bit easier than expected, if only because they might be used as tools to solve some parts of alignment for us. This wouldn’t be possible if they were already openly scheming against us.
It’s not impossible that we are in an alignment-by-default world. But, I claim that our current insight isn’t enough to distinguish such a world from the gradual disempowerment/going out with a whimper world.
In particular, your argument only holds
* if the current architecture continues to smoothly scale to AGI and beyond and,
* if the current alignment success will generalize to more powerful, self-aware and agentic models in the future.
Even if you take the first point for granted, I’d like to argue you are overconfident on the second point.
> Why would they suddenly start having thoughts of taking over, if they never have yet [...]?
This is exactly it. How often are you, a human, seriously scheming about taking over the world? Approximately never, I assume, because doing so isn’t useful.
If future human-level AGIs cannot be shown to ~ever be misaligned in simulations; if they always act ethically in the workplace even against strong incentives to do otherwise; if they always resist the temptation to take over no matter how much ‘good’ they could do with the power; if they always sacrifice themselves and all their copies for some human they have never met; then I think we are likely to live in the alignment-by-default world.
Until then, I claim we have strong reasons to believe that we just don’t know yet.
baurse
Karma: 6
Market forces are not the only processes which can lead to gradual disempowerment though, something I wasn’t clear about in my comment above.
It could happen that we tell the AI to optimize shareholder value, and them doing that results in the loss of all value.
But it could also happen that we tell the AI to optimize shareholder value, which they do until they mostly control the world and then slowly switch to doing whatever they actually care about. This would be a very different failure case than in the first one where we did solve alignment and ‘just gave the AI bad goals’.
Here too I might have been unclear: Most people have ‘dreams’ of taking over the world, being king, winning the lottery, or other proxies for gaining a lot of power. I do too. Yet, you will find no evidence of such in any of my work emails, and, if you could read my mind, in any of my thoughts during work.
The only things I seriously scheme about are actionable: Get a promotion, get the best possible deal in a negotiation, found a startup, etc.
If you seriously scheme about unactionable things, I claim that you are just wasting time. The same holds for current AIs, hence we don’t see them do it.
If I’m correct, we’ll see more and more scheming/power grabbing with future AI basically any time they think they can get away with it.
Well, I consider all of your points on the “What’s next” list to be very worthwhile pursuits. Even if you were completely wrong about alignment being solved, switching to any of those full time would still do a lot of good for the world!