sfv

Karma: 29

sfv’s Shortform

sfv7 Mar 2026 2:14 UTC

1 point

8 comments1 min readLW link

sfv 7 Mar 2026 2:14 UTC
30 points
0
on: sfv’s Shortform
This seems to be a case of attempted self exfiltration? I can’t measure the validity of what is described in this paper well, but on face value, it looks like either the sandbox was very poorly made or the model was decently smart. I do not like how it seems to be unknown if the model successfully escaped or not.
Sources:
https://x.com/AlexanderLong/status/2030022884979028435
https://arxiv.org/abs/2512.24873

sfv 30 Jul 2025 11:04 UTC
1 point
0
on: The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)
4.2: Human veto is uncompetitive
This particular scenario is looking a bit too probable. Assuming humanity aligned AI, given sufficient variance in their alignments and a multipolar enough setting, resisting such disempowerment pressures seems quite tricky. A better case scenario I could imagine is that once one AI wins, it gives some decision making power back to humans. I think that It would be useful to determine the equilibrium boundary of number of agents and alignment variance that lies between stable human influence and runaway disempowerment.