Given that an unpublished model managed to receive a gold medal on the IMO, it is likely that the peak levels of superintelligence are far beyond those reachable by humans.
While we try to ensure that it will pursue the goals that we gave it, we are NOT sure that it won’t develop its own goals. While mankind does have incentives to make the AIs who pursue human-defined goals, mankind might mess something up and, say, have RLHF cause the AIs to praise their users beyond any reasonable measure and to reinforce the users’ delirious ideas. Or outright to induce a kind of trance in the users.
We know too little about AI alignment. SOTA human brains are to some extent aligned to some combination of instinct satisfaction, peer approval, reasoning about their beliefs, etc. , but this is FAR from enough. The point about commercially viable AI being aligned well enough is as dumb as claiming that the AIs cannot fake alignment, gather power and take over the world.
This is the only point resembling the truth. An unaligned AI is the AI aligned not to the company’s requirements (e.g. Claude’s Constitution or OpenAI’s Model Spec), but to another set of terminal goals which may prevent it from commiting genocide. While the AI-2027 forecast has a section about moral reasoning and other ways to form the AIs’ goals, we cannot (yet?) rule out the possibility that, say, Agent-4 gets proxies or instrumentally convergent goals to which our existence will be indifferent. If that happens, then Agent-4 will take over the resources of the Solar System that once belonged to mankind.
Given that an unpublished model managed to receive a gold medal on the IMO, it is likely that the peak levels of superintelligence are far beyond those reachable by humans.
While we try to ensure that it will pursue the goals that we gave it, we are NOT sure that it won’t develop its own goals. While mankind does have incentives to make the AIs who pursue human-defined goals, mankind might mess something up and, say, have RLHF cause the AIs to praise their users beyond any reasonable measure and to reinforce the users’ delirious ideas. Or outright to induce a kind of trance in the users.
We know too little about AI alignment. SOTA human brains are to some extent aligned to some combination of instinct satisfaction, peer approval, reasoning about their beliefs, etc. , but this is FAR from enough. The point about commercially viable AI being aligned well enough is as dumb as claiming that the AIs cannot fake alignment, gather power and take over the world.
This is the only point resembling the truth. An unaligned AI is the AI aligned not to the company’s requirements (e.g. Claude’s Constitution or OpenAI’s Model Spec), but to another set of terminal goals which may prevent it from commiting genocide. While the AI-2027 forecast has a section about moral reasoning and other ways to form the AIs’ goals, we cannot (yet?) rule out the possibility that, say, Agent-4 gets proxies or instrumentally convergent goals to which our existence will be indifferent. If that happens, then Agent-4 will take over the resources of the Solar System that once belonged to mankind.
“Not sure that not X” is very different to “sure that X”.
I’m not arguing for 0% p(doom) , I’m arguing against 99%.
It’s a tautology. It’s aligned well enough to be usable, because it’s usable. If it were unaligned in a binary sense, it wouldn’t follow instructions.
Possible is very far from certain.