Stephen Martin comments on Prompting Models to Obfuscate Their CoT

Stephen Martin 9 Dec 2025 14:48 UTC
5 points
4
I would be interested to see if SAE analysis of CoT showed any consistent “obfuscation circuits”.
If you found features/circuits that were consistently active during successful obfuscation attempts but not during failed obfuscation attempts, being able to upweight and downweight them during alignment testing could be useful, for reasons I think are obvious but won’t write here explicitly.