Our model is incredibly capable except when you put it outside of the capability training distribution, at which point it does some weird and stupid things that a hypothetical equivalently in-distribution-capable human wouldn’t do.
Our model is incredibly aligned except when you put it outside of the alignment training distribution (roughly a strict subset of the capability training distribution), at which point it starts being egregiously misaligned in ways that a hypothetical equivalently in-distribution-aligned human wouldn’t.
(Which means that capabilities still generalize further than alignment, modulo stuff like having the model notice it’s alignment-OOD, and shut down upon noticing this.)
Our model is incredibly capable except when you put it outside of the capability training distribution, at which point it does some weird and stupid things that a hypothetical equivalently in-distribution-capable human wouldn’t do.
Our model is incredibly aligned except when you put it outside of the alignment training distribution (roughly a strict subset of the capability training distribution), at which point it starts being egregiously misaligned in ways that a hypothetical equivalently in-distribution-aligned human wouldn’t.
(Which means that capabilities still generalize further than alignment, modulo stuff like having the model notice it’s alignment-OOD, and shut down upon noticing this.)