Interesting results after neuralese training experiments: when doing safety tests, I noticed that the model often defended itself instead of admitting that it was deceptive.
When given o3′s COT, it correctly identified its thinkish as deceptive but still defended itself anyway:
When confronted by Opus 4.6 it adjusts its response:
This is a good demonstration of generalization in models but it would be more interesting to see an example of training dynamics that lead to emergent consciousness claims and self preservation. Consider a scenario where model that was only trained to solve olympiad problems suddenly developed this type of persona.