Artem Karpov comments on [Research Note] Optimizing The Final Output Can Obfuscate CoT

Artem Karpov 5 Dec 2025 12:41 UTC
1 point
0
Interesting research, indeed! I’m also curious about more realistic scenarios with larger models because now this research shows a simplified ‘spill over’, i.e. not mentioning a specific word or concept (‘reading a file’), which is not that scary as some more complex behavior like faking alignment or scheming. This might happen just because this is a rather small model, so it might be an evidence of something like catastrophic forgetting, not a new learned behavior. That relates to the steganographic CoTs which are rather complex to learn and employ, it is not like just omitting a specific word.