4.2: Human veto is uncompetitive
This particular scenario is looking a bit too probable. Assuming humanity aligned AI, given sufficient variance in their alignments and a multipolar enough setting, resisting such disempowerment pressures seems quite tricky. A better case scenario I could imagine is that once one AI wins, it gives some decision making power back to humans. I think that It would be useful to determine the equilibrium boundary of number of agents and alignment variance that lies between stable human influence and runaway disempowerment.
This seems to be a case of attempted self exfiltration? I can’t measure the validity of what is described in this paper well, but on face value, it looks like either the sandbox was very poorly made or the model was decently smart. I do not like how it seems to be unknown if the model successfully escaped or not.
Sources:
https://x.com/AlexanderLong/status/2030022884979028435
https://arxiv.org/abs/2512.24873