Mateusz Bagiński comments on Claude Mythos Preview System Card

Mateusz Bagiński 8 Apr 2026 12:26 UTC
16 points
0
Our model is incredibly capable except when you put it outside of the capability training distribution, at which point it does some weird and stupid things that a hypothetical equivalently in-distribution-capable human wouldn’t do.
Our model is incredibly aligned except when you put it outside of the alignment training distribution (roughly a strict subset of the capability training distribution), at which point it starts being egregiously misaligned in ways that a hypothetical equivalently in-distribution-aligned human wouldn’t.
(Which means that capabilities still generalize further than alignment, modulo stuff like having the model notice it’s alignment-OOD, and shut down upon noticing this.)