The Sharp Right Turn: sudden deceptive alignment as a convergent goal
Sharp right turn: after reaching some level, all AIs will suddenly become very nice and look like they are aligned, because they will understand that looking unaligned is punishable and bad for their (nefarious) end goals. World takeover will happen after that, only when one of the AIs is ready to take control over all future light cone. Thus, we will enjoy a period of aligned AIs until the end.
The idea is known as ‘deceptive alignment’. “This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment.′
But here I want to underline that the sharp right turn is 1) sudden 2) observable which is 3) a convergent instrumental goal for advanced AIs, both aligned and non-aligned.
“Sharp left turn” was previously defined as a sudden change of internal properties and appearing of misalignment: ’Capabilities generalize across many domains while the alignment properties that held at earlier stages fail to generalize to the new domains”.
But the sharp right turn is about the change in AI’s behaviour and, more importantly, the observer’s interpretation of this behaviour. So, both turns could happen simultaneously.
The sharp right turn is actually a bad sign. It means that AI is ready to perform effective long-term strategies and deception. ‘Sharp’ here means that AI will suddenly grok what we want from it.
But if AI knows that sudden grokking of alignment is itself suspicious, AI may pretend to be slightly misaligned to cover its sharp right turn.
The sharp right turn may be prized by some as it will look like that alignment is solved. No bad words, no misunderstanding, no cheating. Funding alignment research will be more difficult after that. Mislaignement examples will be difficult to find. The sharp right turn in its pure form will look like magic, but real alignment should come from our understanding of how exactly we get there.
Misaligned Sydney was at least honest about what she thought.