Max Niederman comments on Max Niederman’s Shortform

Max Niederman 29 Jun 2025 14:59 UTC
1 point
0
Can LLMs Doublespeak?
Doublespeak is the deliberate distortion of words’ meaning, particularly to convey different meanings to different audiences or in different contexts. In Preventing Language Models From Hiding Their Reasoning, @Fabien Roger and @ryan_greenblatt show that LLMs can learn to hide their reasoning using apparently innocuous, coded language. I’m wondering if LLMs have or can easily gain the capability to hide more general messages this way. In particular, reasoning or messages completely unrelated to the apparent message. I have some ideas for investigating this empirically, but I’m wondering what intution people have on this.
- Fabien Roger 29 Jun 2025 15:57 UTC
  3 points
  0
  Parent
  You might be interested in these related results. TL;DR: people have tried, but at the scale academics are working at, it’s very hard to get RL to learn interesting encoding schemes. Encoded reasoning is also probably not an important part of the performance of reasoning models (see this).
  - Max Niederman 30 Jun 2025 11:02 UTC
    1 point
    0
    Parent
    Thanks! Your second link is very similar to what I had in mind — I feel a bit embarrassed for missing it.