Mechanistically, if there is a buffer overflow in any of the code that processes its outputs, it can re-write its own code. It’s then just a question of whether it emits that sequence of bytes- if it does this during training, then the whole consequences we saw in the “emergent misalignment” paper come along
Mechanistically, if there is a buffer overflow in any of the code that processes its outputs, it can re-write its own code. It’s then just a question of whether it emits that sequence of bytes- if it does this during training, then the whole consequences we saw in the “emergent misalignment” paper come along