What makes steganography concerning? Some qualities which would make steganography more concerning, if observed.
Goal-directedness: Model preferentially tries to communicate information important to its goals / plans (e.g. ‘this sandbox seems imperfect’), as opposed to benign or irrelevant information.
Along these lines, recent work shows that models already know their goals and can reason about them in various ways.
Targetedness: Model preferentially employs steganography in situations where it is more likely to get away with it (e.g. ‘this overseer seems too dumb to detect steg’), or is more likely to influence outcomes (e.g. ‘this forbidden information would greatly help me’).
It seems important to know what kinds of realistic training objectives induce ‘stealthy’ steganography. Modelorganisms work will be important for assessing this.
Strength: Even knowing that steganography is occurring, it is difficult to recover the message.
Recent work proved the existence of “perfect” steganography (in the sense of being cryptographically-secure). So this is at least possible.
Fidelity: The model recovers the encoded message with high accuracy. (duh)
Proposition: models may already be doing steganography that is not-very-concerning according to the above criteria.
This is something Janus claims and I have strong prior to believe him
My own explorations gave me some sort of positive signal here too, albeit with a lot of caveats
Also, there doesn’t seem to be that large of a capabilities gap between ‘not-very-concerning’ and ‘very concerning’ according to the above criteria.
In this world, it’d be important and relatively tractable to make a scary demo of “concerning” steganography.
What makes steganography concerning? Some qualities which would make steganography more concerning, if observed.
Goal-directedness: Model preferentially tries to communicate information important to its goals / plans (e.g. ‘this sandbox seems imperfect’), as opposed to benign or irrelevant information.
Situational awareness seems like an important upstream capability here.
Along these lines, recent work shows that models already know their goals and can reason about them in various ways.
Targetedness: Model preferentially employs steganography in situations where it is more likely to get away with it (e.g. ‘this overseer seems too dumb to detect steg’), or is more likely to influence outcomes (e.g. ‘this forbidden information would greatly help me’).
There is already evidence that simple reward hacking leads to targetedness in this way.
Again, situational awareness seems like an important upstream capability.
Stealthiness: It’s hard for monitors to detect that steganography is occurring at all.
Generally, it seems likely that training models not to say “forbidden” things in their CoT could just nudge them to do it more subtly.
It seems important to know what kinds of realistic training objectives induce ‘stealthy’ steganography. Model organisms work will be important for assessing this.
Strength: Even knowing that steganography is occurring, it is difficult to recover the message.
Recent work proved the existence of “perfect” steganography (in the sense of being cryptographically-secure). So this is at least possible.
Fidelity: The model recovers the encoded message with high accuracy. (duh)
Proposition: models may already be doing steganography that is not-very-concerning according to the above criteria.
This is something Janus claims and I have strong prior to believe him
My own explorations gave me some sort of positive signal here too, albeit with a lot of caveats
Also, there doesn’t seem to be that large of a capabilities gap between ‘not-very-concerning’ and ‘very concerning’ according to the above criteria.
In this world, it’d be important and relatively tractable to make a scary demo of “concerning” steganography.