a model uses the words it generates in its chain of thought to accomplish a task that it could not do in a single forward pass
And encodes meaning in its word choice that is opaque to a typical overseer
To satisfy condition 1, we need to choose tasks that are intrinsically difficult to do without CoT. Ideally, we’d also be able to verify whether the CoT was actually used. We could use:
Multi-hop QA tasks. Language models generally find this hard without CoT; this is described as the two-hop curse. It might be possible to use bridge entities to verify that CoT was used.
Math / code tasks. These usually involve complex reasoning that can’t be done without CoT.
Long-context tasks. If there is just a lot of task-relevant information, then a model likely needs to summarise relevant information to itself in its CoT to complete the task.
To satisfy condition 2, we need to somehow induce ‘hidden meaning’ in the model. We could use:
Out-of-context reasoning. In Connecting the Dots, it’s demonstrated that LMs can learn alternate statements of functions. E.g. an LM trained on many pairs of (x, f(x)) can describe f in natural language and also compute inverses of f. This satisfies condition 2, since ‘f’ now has a hidden meaning.
Backdoors. Backdoored models are trained to have different behaviour depending on whether a trigger is present in the context. This is another way of satisfying condition 2, since the trigger now has a hidden meaning to the model (to elicit the backdoor).
RL fine-tuning against a CoT monitor. E.g. for the multi-hop QA task, optimize the model not to state the bridge entity while still solving the task; the model might then learn to encode the bridge entity somehow. This is possibly the most realistic way to do things, but also seems like the hardest one to get working. See Fabien Roger’s negative results.
Mixing and matching these techniques might point the way to a good model organism of steganography.
The definition he originally gave leaves out a notion of ‘stealthiness’ as opposed to ‘gibberish’; ‘stealthy’ reasoning is more concerning.
OOCR may not meet the opaqueness criterion, since the meaning of ‘f’ is pretty obvious from looking at the fine-tuning examples. This is like ‘reasoning in a foreign language’ or ‘using esoteric terminology’; it’s less legible to the average human, but not outside the realm of human semantics. (But still interesting! just that maybe OOCR isn’t needed)
Backdoors have a similar concern to the above. A separate (and more important) concern is that it doesn’t satisfy the definition because ‘steg’ typically refers to meaning encoded in generated tokens (while in ‘backdoors’, the hidden meaning is in the prompt tokens)
RL-finetuning suffers from a “zero-to-one” problem, where the model is likely to be mostly legible and faithful by default, and difficult encoding schemes seem hard to randomly explore into.
Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended.
This makes model organisms of steg really hard to make IMO.
OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning.
Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT.
Additional thoughts on how RL fine-tuning would induce obfuscation.
Evading process supervision: It’s plausible that we won’t be able to provide outcome-based oversight, but instead rely on process-based oversight. If the overseer penalises wrong reasoning, but is less likely to detect ‘subtle’ or obfuscated versions of this, then LMs are more likely to obfuscate their reasoning.
Reward hacking. Exploiting imperfections in the reward model may cause language models to obfuscate their reasoning. E.g. using more technical jargon often pattern-matches to competence and expertise. Human raters might assign higher scores to responses that use sophisticated-sounding language, even if the underlying reasoning isn’t better (and might be worse).
Currently uncertain about whether this is a good threat model, since both of these concerns are not applicable to math / coding RL fine-tuning (where it’s not possible to obfuscate reasoning, and where the ground-truth verifier is perfect)
Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here?
Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder’s output (and not the prompt). This is most similar to “long-context task” in my framing.
Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It’s slightly more complicated than that because the encoder and decoder are trained separately, so there’s a loop where encoder is trained for some time, then decoder is trained for some time, etc).
It seems like this setup could be simplified:
The encoder and decoder could be the same model. BUT, we still delete the original prompt from the decoder’s context, so the model must learn to encode the prompt information. (This represents task-relevant information falling out of the context window at some point. Seems plausible in particularly important threat models like scheming)
The RL setup could be simplified to a single reward formulation, where the reward is 0 if the model is detected to be engaging in reasoning, or if the model predicts wrongly, and 1 otherwise. Some reward shaping may be beneficial.
The simplified setting may be easier to generalise to different tasks (e.g. sycophancy, evading trusted monitoring, etc)
Some ideas on model organisms of steganography.
Borrowing Max Nadeau’s definition, steganography is when:
a model uses the words it generates in its chain of thought to accomplish a task that it could not do in a single forward pass
And encodes meaning in its word choice that is opaque to a typical overseer
To satisfy condition 1, we need to choose tasks that are intrinsically difficult to do without CoT. Ideally, we’d also be able to verify whether the CoT was actually used. We could use:
Multi-hop QA tasks. Language models generally find this hard without CoT; this is described as the two-hop curse. It might be possible to use bridge entities to verify that CoT was used.
Math / code tasks. These usually involve complex reasoning that can’t be done without CoT.
Long-context tasks. If there is just a lot of task-relevant information, then a model likely needs to summarise relevant information to itself in its CoT to complete the task.
To satisfy condition 2, we need to somehow induce ‘hidden meaning’ in the model. We could use:
Out-of-context reasoning. In Connecting the Dots, it’s demonstrated that LMs can learn alternate statements of functions. E.g. an LM trained on many pairs of (x, f(x)) can describe f in natural language and also compute inverses of f. This satisfies condition 2, since ‘f’ now has a hidden meaning.
Backdoors. Backdoored models are trained to have different behaviour depending on whether a trigger is present in the context. This is another way of satisfying condition 2, since the trigger now has a hidden meaning to the model (to elicit the backdoor).
RL fine-tuning against a CoT monitor. E.g. for the multi-hop QA task, optimize the model not to state the bridge entity while still solving the task; the model might then learn to encode the bridge entity somehow. This is possibly the most realistic way to do things, but also seems like the hardest one to get working. See Fabien Roger’s negative results.
Mixing and matching these techniques might point the way to a good model organism of steganography.
Feedback from Max:
The definition he originally gave leaves out a notion of ‘stealthiness’ as opposed to ‘gibberish’; ‘stealthy’ reasoning is more concerning.
OOCR may not meet the opaqueness criterion, since the meaning of ‘f’ is pretty obvious from looking at the fine-tuning examples. This is like ‘reasoning in a foreign language’ or ‘using esoteric terminology’; it’s less legible to the average human, but not outside the realm of human semantics. (But still interesting! just that maybe OOCR isn’t needed)
Backdoors have a similar concern to the above. A separate (and more important) concern is that it doesn’t satisfy the definition because ‘steg’ typically refers to meaning encoded in generated tokens (while in ‘backdoors’, the hidden meaning is in the prompt tokens)
RL-finetuning suffers from a “zero-to-one” problem, where the model is likely to be mostly legible and faithful by default, and difficult encoding schemes seem hard to randomly explore into.
Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended.
This makes model organisms of steg really hard to make IMO.
OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning.
Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT.
Additional thoughts on how RL fine-tuning would induce obfuscation.
Evading process supervision: It’s plausible that we won’t be able to provide outcome-based oversight, but instead rely on process-based oversight. If the overseer penalises wrong reasoning, but is less likely to detect ‘subtle’ or obfuscated versions of this, then LMs are more likely to obfuscate their reasoning.
Reward hacking. Exploiting imperfections in the reward model may cause language models to obfuscate their reasoning. E.g. using more technical jargon often pattern-matches to competence and expertise. Human raters might assign higher scores to responses that use sophisticated-sounding language, even if the underlying reasoning isn’t better (and might be worse).
Currently uncertain about whether this is a good threat model, since both of these concerns are not applicable to math / coding RL fine-tuning (where it’s not possible to obfuscate reasoning, and where the ground-truth verifier is perfect)
Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here?
Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder’s output (and not the prompt). This is most similar to “long-context task” in my framing.
Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It’s slightly more complicated than that because the encoder and decoder are trained separately, so there’s a loop where encoder is trained for some time, then decoder is trained for some time, etc).
It seems like this setup could be simplified:
The encoder and decoder could be the same model. BUT, we still delete the original prompt from the decoder’s context, so the model must learn to encode the prompt information. (This represents task-relevant information falling out of the context window at some point. Seems plausible in particularly important threat models like scheming)
The RL setup could be simplified to a single reward formulation, where the reward is 0 if the model is detected to be engaging in reasoning, or if the model predicts wrongly, and 1 otherwise. Some reward shaping may be beneficial.
The simplified setting may be easier to generalise to different tasks (e.g. sycophancy, evading trusted monitoring, etc)