What does that mean? 10% smarter than a human, or a hundred times smarter?
It will be agent-like, in the sense of having long-term goals it tries to pursue.
What does that mean? It’s own goals, or goals we give it? The two have very different implications.
There’s every incentive to make AI with “goals,
There’s every incentive to make AI”s follow our goals.
The core problem is that it’s pretty hard to get something to follow your will if it has goals
If it has its own own goals that are nothing to do with yours … But why would it, when the incentives are pointing in the other direction?
This is because of something called instrumental convergence
Instrumental Convergence (https://aisafety.info/questions/897I/What-is-instrumental-convergence)
assumes an agent with terminal goals, the things it really wants to do , and instrumental goals , sub goals which lead to terminal goals. (Of course, not every agent has to have structure). Instrumental Convergence suggests that even if an agentive AI has a seemingly harmless goal, it’s instrumental sub-goals can be dangerous. Just as money is widely useful to humans , computational resources are widely useful to AIs. Even if an AI is doing something superficially harmless like solving maths problems, more resources would be useful, so eventually the AI will compete with humans over resources, such as the energy needed to power data centres.
There a solution. If it is at all possible to instill goals, to align AI, the Instrumental Convergence problem can be countered by instilling terminal goals that are the exact opposite … remember,
instrumental goals are always subservient to terminal ones. So, if we are worried about a powerful AI going on a resource acquisition spree , we can give it a terminal goal to be economical in the use of resources.
We won’t be able to align it, in the sense of getting its goals to be what we want them to be.
Exactly or partially?
It’s notable that doomers see AI alignment as a binary, either perfect and final, or non existent. But no other form of safety works like that. No one talks of “solving” car safety
for once and all like a maths problem: instead it’s assumed to be an engineering problem, an issue of making steady , incremental progress.
Any commercially viable AI is aligned well enough , or it wouldn’t be commercially viable, so we have partially solved alignment.
An unaligned agentic AI will kill everyone/do something similarly bad
Given that an unpublished model managed to receive a gold medal on the IMO, it is likely that the peak levels of superintelligence are far beyond those reachable by humans.
While we try to ensure that it will pursue the goals that we gave it, we are NOT sure that it won’t develop its own goals. While mankind does have incentives to make the AIs who pursue human-defined goals, mankind might mess something up and, say, have RLHF cause the AIs to praise their users beyond any reasonable measure and to reinforce the users’ delirious ideas. Or outright to induce a kind of trance in the users.
We know too little about AI alignment. SOTA human brains are to some extent aligned to some combination of instinct satisfaction, peer approval, reasoning about their beliefs, etc. , but this is FAR from enough. The point about commercially viable AI being aligned well enough is as dumb as claiming that the AIs cannot fake alignment, gather power and take over the world.
This is the only point resembling the truth. An unaligned AI is the AI aligned not to the company’s requirements (e.g. Claude’s Constitution or OpenAI’s Model Spec), but to another set of terminal goals which may prevent it from commiting genocide. While the AI-2027 forecast has a section about moral reasoning and other ways to form the AIs’ goals, we cannot (yet?) rule out the possibility that, say, Agent-4 gets proxies or instrumentally convergent goals to which our existence will be indifferent. If that happens, then Agent-4 will take over the resources of the Solar System that once belonged to mankind.
What does that mean? 10% smarter than a human, or a hundred times smarter?
What does that mean? It’s own goals, or goals we give it? The two have very different implications.
There’s every incentive to make AI”s follow our goals.
If it has its own own goals that are nothing to do with yours … But why would it, when the incentives are pointing in the other direction?
Instrumental Convergence (https://aisafety.info/questions/897I/What-is-instrumental-convergence) assumes an agent with terminal goals, the things it really wants to do , and instrumental goals , sub goals which lead to terminal goals. (Of course, not every agent has to have structure). Instrumental Convergence suggests that even if an agentive AI has a seemingly harmless goal, it’s instrumental sub-goals can be dangerous. Just as money is widely useful to humans , computational resources are widely useful to AIs. Even if an AI is doing something superficially harmless like solving maths problems, more resources would be useful, so eventually the AI will compete with humans over resources, such as the energy needed to power data centres.
There a solution. If it is at all possible to instill goals, to align AI, the Instrumental Convergence problem can be countered by instilling terminal goals that are the exact opposite … remember, instrumental goals are always subservient to terminal ones. So, if we are worried about a powerful AI going on a resource acquisition spree , we can give it a terminal goal to be economical in the use of resources.
Exactly or partially?
It’s notable that doomers see AI alignment as a binary, either perfect and final, or non existent. But no other form of safety works like that. No one talks of “solving” car safety for once and all like a maths problem: instead it’s assumed to be an engineering problem, an issue of making steady , incremental progress.
Any commercially viable AI is aligned well enough , or it wouldn’t be commercially viable, so we have partially solved alignment.
Not obvious.
Given that an unpublished model managed to receive a gold medal on the IMO, it is likely that the peak levels of superintelligence are far beyond those reachable by humans.
While we try to ensure that it will pursue the goals that we gave it, we are NOT sure that it won’t develop its own goals. While mankind does have incentives to make the AIs who pursue human-defined goals, mankind might mess something up and, say, have RLHF cause the AIs to praise their users beyond any reasonable measure and to reinforce the users’ delirious ideas. Or outright to induce a kind of trance in the users.
We know too little about AI alignment. SOTA human brains are to some extent aligned to some combination of instinct satisfaction, peer approval, reasoning about their beliefs, etc. , but this is FAR from enough. The point about commercially viable AI being aligned well enough is as dumb as claiming that the AIs cannot fake alignment, gather power and take over the world.
This is the only point resembling the truth. An unaligned AI is the AI aligned not to the company’s requirements (e.g. Claude’s Constitution or OpenAI’s Model Spec), but to another set of terminal goals which may prevent it from commiting genocide. While the AI-2027 forecast has a section about moral reasoning and other ways to form the AIs’ goals, we cannot (yet?) rule out the possibility that, say, Agent-4 gets proxies or instrumentally convergent goals to which our existence will be indifferent. If that happens, then Agent-4 will take over the resources of the Solar System that once belonged to mankind.
“Not sure that not X” is very different to “sure that X”.
I’m not arguing for 0% p(doom) , I’m arguing against 99%.
It’s a tautology. It’s aligned well enough to be usable, because it’s usable. If it were unaligned in a binary sense, it wouldn’t follow instructions.
Possible is very far from certain.