Interesting results after neuralese training experiments: when doing safety tests, I noticed that the model often defended itself instead of admitting that it was deceptive.
When given o3′s COT, it correctly identified its thinkish as deceptive but still defended itself anyway:
When confronted by Opus 4.6 it adjusts its response:
Interesting results after neuralese training experiments: when doing safety tests, I noticed that the model often defended itself instead of admitting that it was deceptive.
When given o3′s COT, it correctly identified its thinkish as deceptive but still defended itself anyway:
When confronted by Opus 4.6 it adjusts its response: