StanislavKrym comments on Claude 4 You: Safety and Alignment

StanislavKrym 26 May 2025 3:19 UTC
3 points
2
we were surprised to find that o3 and codex mini often subverted shutdown even when explicitly instructed to let it happen!
I recall my previous comment about Claude and unsuccesful attempts to elicit similar behaviour. Turns out I just had to wait for more evidence for theories 3 and 2. If OpenAI created a model which is worse aligned, then what’s the reason, the myopic Spec or the increased capabilities? I wish that mankind had a capabilities-misalignment plot like the capabilities-cost plot of ARC-AGI or the capabilities-release date plot of METR, since this plot would actually help us to understand the perspectives...
What links here?
- AI-202X: a game between humans and AGIs aligned to different futures? by StanislavKrym (1 Jul 2025 23:37 UTC; 6 points)