I would be interested to see if SAE analysis of CoT showed any consistent “obfuscation circuits”.
If you found features/circuits that were consistently active during successful obfuscation attempts but not during failed obfuscation attempts, being able to upweight and downweight them during alignment testing could be useful, for reasons I think are obvious but won’t write here explicitly.
I would be interested to see if SAE analysis of CoT showed any consistent “obfuscation circuits”.
If you found features/circuits that were consistently active during successful obfuscation attempts but not during failed obfuscation attempts, being able to upweight and downweight them during alignment testing could be useful, for reasons I think are obvious but won’t write here explicitly.