joanv comments on Daniel Tan’s Shortform

joanv 23 Jan 2025 10:13 UTC
3 points
0
What is the threat model you are probing for here?

The prompting approach suffers from LLMs’ shallowness in strategic thinking; I would not be surprised if, prompting a language model several times to come up with an encoding scheme would result in a highly concentrated distribution.

> A saving grace may be that ‘detecting steg is happening at all’ seems easier than ’recovering the plaintext message’

While i agree in principle, as in all red/blue team interactions, you may not want your blue team to be trigger happy; you still want to be able to extract some usefulness out of your scheming models, thus making this hope somewhat less robust. This would be a cool research question nonetheless.

> Both of these [steganography inducing methods] involve finetuning the model

Note that in Hidden in Plain Text, we also work with in-context RL, which is not explicitly finetuning the model, but just giving rewards to rollouts in context.
- Daniel Tan 23 Jan 2025 13:24 UTC
  3 points
  0
  Parent
  As originally conceived, this is sort of like a “dangerous capability” eval for steg. The argument being that, if a model can do steg in this very toy setting where we’ve significantly nudged the model, it might do steg in more realistic scenarios.
  There is also a claim here of the form “language models can decode things from their own encodings that other language models cannot, due to having access to privileged internal information”.
  I didnt’ make these points especially clear in the slide deck—thanks for the feedback!
  Agreed on the rest of points!
  - julius vidal 23 Jan 2025 23:37 UTC
    1 point
    0
    Parent
    >As originally conceived, this is sort of like a “dangerous capability” eval for steg.
    
    I am actually just about to start building something very similar to this for the AISI’s evals bounty program.