Doublespeak is the deliberate distortion of words’ meaning, particularly to convey different meanings to different audiences or in different contexts. In Preventing Language Models From Hiding Their Reasoning, @Fabien Roger and @ryan_greenblatt show that LLMs can learn to hide their reasoning using apparently innocuous, coded language. I’m wondering if LLMs have or can easily gain the capability to hide more general messages this way. In particular, reasoning or messages completely unrelated to the apparent message. I have some ideas for investigating this empirically, but I’m wondering what intution people have on this.
You might be interested in theserelatedresults. TL;DR: people have tried, but at the scale academics are working at, it’s very hard to get RL to learn interesting encoding schemes. Encoded reasoning is also probably not an important part of the performance of reasoning models (see this).
Can LLMs Doublespeak?
Doublespeak is the deliberate distortion of words’ meaning, particularly to convey different meanings to different audiences or in different contexts. In Preventing Language Models From Hiding Their Reasoning, @Fabien Roger and @ryan_greenblatt show that LLMs can learn to hide their reasoning using apparently innocuous, coded language. I’m wondering if LLMs have or can easily gain the capability to hide more general messages this way. In particular, reasoning or messages completely unrelated to the apparent message. I have some ideas for investigating this empirically, but I’m wondering what intution people have on this.
You might be interested in these related results. TL;DR: people have tried, but at the scale academics are working at, it’s very hard to get RL to learn interesting encoding schemes. Encoded reasoning is also probably not an important part of the performance of reasoning models (see this).
Thanks! Your second link is very similar to what I had in mind — I feel a bit embarrassed for missing it.