nanowell comments on nanowell’s Shortform

nanowell 28 Feb 2026 8:22 UTC
1 point
0
Interesting results after neuralese training experiments: when doing safety tests, I noticed that the model often defended itself instead of admitting that it was deceptive.

When given o3′s COT, it correctly identified its thinkish as deceptive but still defended itself anyway:

When confronted by Opus 4.6 it adjusts its response: