Interesting research, indeed! I’m also curious about more realistic scenarios with larger models because now this research shows a simplified ‘spill over’, i.e. not mentioning a specific word or concept (‘reading a file’), which is not that scary as some more complex behavior like faking alignment or scheming. This might happen just because this is a rather small model, so it might be an evidence of something like catastrophic forgetting, not a new learned behavior. That relates to the steganographic CoTs which are rather complex to learn and employ, it is not like just omitting a specific word.
Interesting research, indeed! I’m also curious about more realistic scenarios with larger models because now this research shows a simplified ‘spill over’, i.e. not mentioning a specific word or concept (‘reading a file’), which is not that scary as some more complex behavior like faking alignment or scheming. This might happen just because this is a rather small model, so it might be an evidence of something like catastrophic forgetting, not a new learned behavior. That relates to the steganographic CoTs which are rather complex to learn and employ, it is not like just omitting a specific word.