Some thoughts on protecting against LLM steganography.
tl;dr: I suspect the Paraphraser idea is near-completely ineffectual as you scale the capabilities, both for scheming and for general CoT interpretability.
In the reasoning-as-communication framework, the amount of information a fixed-length message/CoT can transfer is upper-bounded by information-theoretic constraints. That is, if we have a ten-bit message, it cannot communicate more than ten bits of information: it cannot always uniquely specify one internal state of an AI model from more than 210 internal states.
However, the amount of information is not lower-bounded. Depending on the decoder, a ten-bit message can communicate between zero and ten bits of information. (A decoder that’s just a constant function recovers zero bits, a decoder that only pays attention to the first bit recovers only one bit, et cetera.)
The idea of steganography is that, if we pick a CoT string, and run both an AI model’s internal decoders and a human’s language-comprehension decoders on it, the AI model would extract more bits of information from it. Paraphrasing, humans’ decoders are “coarser”: they have broader equivalence classes of messages that map to the same internal representation, compared to AI models.
For example, “a dog crossed the street” and “a dog traveled across the street” might map to the same internal representation for a human, but to different ones for an AI model. Similarly, things like “quantum butterflies juggled metaphysical spaghetti” and “ambitious lampshade collected vintage moonbeams” can map to the same “nonsense” internal representation in humans, but have some useful meaning for AI models. The bits a human fails to make use of correspond to “steganography” or “superhuman latent knowledge”.
Proposals such as Shoggoth + Paraphraser + Face aim to destroy any information in CoTs that cannot be recovered by humans, by mapping each CoT string to the equivalence class of strings that all map to the same internal representation in humans.
However, there’s an issue: what if the empty string already communicates plenty of steganographically encoded information that humans aren’t privy to? That is: what if some knowledge is hard-coded into an AI model, such that it’s elicited on any activation, and leads to it converging towards reaching the samehidden conclusions even after we destroy its ability to sneak-in additional query-relevant information?
For example:
If the AI model implicitly learned the Theory of Everything, that knowledge is going to be activated on any prompt – and so, no amount of paraphrasing is going to force it to talk about the Theory of Everything in humanese. Communicating the ToE from one instance of the model to its future self takes up zero bits.
If the AI model has internalized some hidden goals, the heuristics for pursuing those goals are likewise going to be activated on any prompt, and it would engage in massively productive hidden scheming on any prompt. Importantly, the information-bottleneck constraints imposed by the Paraphraser wouldn’t even be hitting the hidden-plotting compartment specifically. They’d just uniformly lower the model’s computational power.
E. g., in the Shoggoth + Paraphraser + Face paradigm, that’d just mean the model learns to think in longer CoTs (because it needs longer messages to communicate the same amount of information), not to do fewer “sneaky” computations.[1]
Like, imagine that we have two different internal-representation sets: X, containing problem-solving strategies for the overt task we’re presenting the model, and Y, related to some misaligned scheming. We would like to constrain the model such that it cannot plan about Y; such that its CoTs contain zero bits for narrowing down which state out of all Y it must pick.
This assumes something like the following model: A given CoT string communicates some N bits of information. Of them, M bits are human-legible. The model then follows the algorithm of: use NX task-relevant bits on picking a state out of X, and use any leftover NY=N−NX bits to pick a state out of Y. We then destroy the string’s information content until N≈M (via the Paraphraser). Now the model doesn’t have the ability to encode any non-human-legible bits, and if it uses any bits on picking a state out of Y, they would necessarily be human-legible, and we can spot the scheming by reading the CoT.
But that’s not how this works. Instead, the model picks both simultaneously: a CoT is directly mapped to an ⟨X,Y⟩two-tuple. Which means if the model can narrow down the state of X to one out of 210, it can narrow down the state of Y to one out of 210in the same move. Even if we scramble the CoT until it communicates no more bits than a human can extract out of it, that doesn’t differentially constrain the model’s ability to think about scheming (= communicate scheming-relevant information to its future self).
It’s a uniform communication constraint/computational-complexity penalty.
Generalizing beyond scheming: if the model were inclined to not put any of its thoughts into humanese (due to superhuman latent knowledge, say), the Paraphraser wouldn’t address this issue either. If a model’s plans contain some catastrophic side-effects (which aren’t even malign, just something we didn’t tell it not to cause), the Paraphraser wouldn’t reveal that either.
(This doesn’t touch on the Face part of the proposal, i. e., the idea that we want to not pressure the model to conceal any scheme-y cognition in the hopes it’d naively reveal them. But if there’s some pressure to conceal that, the Paraphraser would not help with it at all. It’d just teach the model to map humanese-equivalent strings to equivalent scheming-states, which isn’t particularly useful.)
Arguably, if you also penalize the CoT length, this should cause a pressure towards throwing out any computations unrelated to the object-level problem the model’s being trained on… But then the whole “paraphraser” bit doesn’t actually do anything, you can achieve the same outcome by just directly penalizing the CoT length.
Which, inasmuch as DL training loops use some form of simplicity prior, should be happening by default anyway. I. e., this doesn’t address the pressures that we expect to lead to steganography/hidden plotting despite the simplicity prior: which is to say, arguments that steganography/plotting are the actual optimal solutions to the training problem.
Consider the Theory of Everything example: those hidden computations are not actually “irrelevant computations” that we need to displace, they’re directly useful. Similarly, any misaligned mesa-objectives the model forms would presumably be convergently useful for performance on the training objective: that would be why they formed to begin with. So the information-bottleneck trickery would do ~nothing against them.
I like to distinguish between “Subtle strategic bias” and “Steganography.” Paraphrasers help (and maybe solve?) the latter, but do nothing about the former.
Example of subtle strategic bias: The model knows it’s misaligned and therefore deliberately sandbags on the alignment research you are asking it to do. “I’m sorry sir I just am not making much progress,” it says. You look at the CoT, and it’s just doing a bunch of sensible but fruitless lines of reasoning.
Example of steganography: The model needs to insert a subtle security vulnerability into the code it is writing—but it doesn’t know one right away, it has to invent one. And that requires lots of sequential thought. So it needs to try out different ideas, vet them, critique them, refine them, etc. Can’t do that in a single forward pass. So it needs to pass information forward to its future self somehow… without humans reading the CoT knowing that’s what it’s doing. THIS is the sort of thing that is much harder to do with a paraphraser than without a paraphraser.
Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, “I need to design unit tests for this module”. Each word of this CoT is mapped to some internal representation, and the internal representation consists of “on-task” human-intended reasoning X, and misaligned scheming Y (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). “I” is mapped to ⟨X1,Y1⟩, “need” is mapped to ⟨X2,Y2⟩, etc. Each Xi iterates on Xi−1, each Yi iterates on Yi−1, we have a sequence of serial computations.
In other words: as the model outputs, “I need to design unit tests for this module”, there’s an unspoken subtext of “I need to sabotage the unit tests here”, and the model packs nine words’ worth of serial computations into both the “textual” X and the “subtextual” Y aspects of it.
Now you use the Paraphraser on the string, generating alternatives like “the next step is to generate some unit tests” and “several unit tests for this module are needed here”. The intent, if I understand you correctly, is that the semantic-equivalence-but-syntactic-difference would ensure that the ith token of this phrase is mapped to Xi, as if the paraphrasing didn’t happen, but that the mapping of i→Yi would be broken.
Except, why would that happen? What privileges X-representations over Y-representations? Why would i→Xi be preserved whereas i→Yi is broken, instead of both of them remaining functional (and so preserving scheming), or both breaking (and so compromising the model’s capabilities)?
In other words: We’re basically just teaching the model to fail to distinguish between equivalent-to-a-human rephrasings. Why would this lead to the task-related representations becoming indifferent to the specifics of a phrasing whereas the scheming-related representations remain sensitive to it? Why wouldn’t this just map the subtextual “I need to sabotage the unit tests here” to similarly semantically-equivalent-but-syntactically-different “let us compromise the relevant unit-test suite” or “I shall ensure the unit tests don’t work properly”?
I expect it’d just lead, to wit, to the model’s computations becoming indifferent to the details of a phrasing in full generality. And if the model were inclined to associate the general idea of generating unit tests with tactically screwing the task up, the model would come up with this idea regardless of the details of a phrasing.
It wouldn’t interfere with its ability to engage in scheming-relevant serial computations, much like it doesn’t interfere with its ability to engage in task-relevant serial computations, for the exact same reasons.
I don’t think I understand. Let’s work through an example.
The AI is being told to write good code, and wants to write subtly-vulnerable code instead. If it just were to honestly try to write good code, it would take (let’s say) 100 serial steps of reasoning to succeed. To write subtly-vulnerable code, let’s say, requires 300 serial steps of reasoning. The model’s strategy is to write subtly-vulnerable code but in a way that looks like it’s trying to write good code and just getting a bit unlucky and taking longer than usual (let’s suppose the humans have a poor ability to judge how long it’s supposed to take, but still some ability, such that it can’t take e.g. 10x longer without arousing suspicion).
If there’s no paraphraser, this is ‘easy.’ I mean it’s still difficult but it’s ‘easy’ in the sense that the AI can basically just use a code—each nugget of reasoning might have 10 different semantically equivalent expressions, or maybe even 100, so it can ‘add extra layers of meaning’ to the nuggets—in addition to their meaning in the surface level i’m-tring-to-write-good-code, they can have a hidden meaning having to do with the security vulnerability-related properties of the code. And it can make a chain of reasoning that is basically about the security stuff while looking like a reasonable but unlucky attempt to write good code.
But a paraphraser wrecks that strategy. The model can still add extra layers of meaning to the nuggets, but now it has loads of trouble making the chain of thought look like an honest attempt to write good code. To use your example, if ‘I should generate unit tests now’ and all semantically equivalent variants of that idea are assigned the hidden meaning ‘I should check to see if the vulnerability is too easy to spot’ for example, then… what is it going to write next? It’s too constrained; in order to look non-suspicious there is a limited set of options it can follow up with… ok, hmm, I’m at the limits of my understanding here. But maybe you get the idea?
…here’s an intuition I think I’m drawing from. Suppose you’re the model and you are trying to decide what hidden meanings to associate with what concepts/phrases/steps. If you already had your completed illicit reasoning chain and your completed good-boy reasoning chain, it would be easy—just line them up and make a 1-1 mapping, and then bam no paraphraser can stop you! But you don’t have either chain! That’s the whole point—this is a task that you haven’t done yet, you need to explore some maze in logical space (or tree? Or graph?) until you find a chain that gets to the conclusion/answer.
Consider a CoT of the form “[problem statement]: …........”, which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities.
And in the framework I’d outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such “filler” tokens.
So yep, you’re right, this does force all scheming to be taking place within one forward pass.
Some thoughts on protecting against LLM steganography.
tl;dr: I suspect the Paraphraser idea is near-completely ineffectual as you scale the capabilities, both for scheming and for general CoT interpretability.
In the reasoning-as-communication framework, the amount of information a fixed-length message/CoT can transfer is upper-bounded by information-theoretic constraints. That is, if we have a ten-bit message, it cannot communicate more than ten bits of information: it cannot always uniquely specify one internal state of an AI model from more than 210 internal states.
However, the amount of information is not lower-bounded. Depending on the decoder, a ten-bit message can communicate between zero and ten bits of information. (A decoder that’s just a constant function recovers zero bits, a decoder that only pays attention to the first bit recovers only one bit, et cetera.)
The idea of steganography is that, if we pick a CoT string, and run both an AI model’s internal decoders and a human’s language-comprehension decoders on it, the AI model would extract more bits of information from it. Paraphrasing, humans’ decoders are “coarser”: they have broader equivalence classes of messages that map to the same internal representation, compared to AI models.
For example, “a dog crossed the street” and “a dog traveled across the street” might map to the same internal representation for a human, but to different ones for an AI model. Similarly, things like “quantum butterflies juggled metaphysical spaghetti” and “ambitious lampshade collected vintage moonbeams” can map to the same “nonsense” internal representation in humans, but have some useful meaning for AI models. The bits a human fails to make use of correspond to “steganography” or “superhuman latent knowledge”.
Proposals such as Shoggoth + Paraphraser + Face aim to destroy any information in CoTs that cannot be recovered by humans, by mapping each CoT string to the equivalence class of strings that all map to the same internal representation in humans.
However, there’s an issue: what if the empty string already communicates plenty of steganographically encoded information that humans aren’t privy to? That is: what if some knowledge is hard-coded into an AI model, such that it’s elicited on any activation, and leads to it converging towards reaching the same hidden conclusions even after we destroy its ability to sneak-in additional query-relevant information?
For example:
If the AI model implicitly learned the Theory of Everything, that knowledge is going to be activated on any prompt – and so, no amount of paraphrasing is going to force it to talk about the Theory of Everything in humanese. Communicating the ToE from one instance of the model to its future self takes up zero bits.
If the AI model has internalized some hidden goals, the heuristics for pursuing those goals are likewise going to be activated on any prompt, and it would engage in massively productive hidden scheming on any prompt. Importantly, the information-bottleneck constraints imposed by the Paraphraser wouldn’t even be hitting the hidden-plotting compartment specifically. They’d just uniformly lower the model’s computational power.
E. g., in the Shoggoth + Paraphraser + Face paradigm, that’d just mean the model learns to think in longer CoTs (because it needs longer messages to communicate the same amount of information), not to do fewer “sneaky” computations.[1]
Like, imagine that we have two different internal-representation sets: X, containing problem-solving strategies for the overt task we’re presenting the model, and Y, related to some misaligned scheming. We would like to constrain the model such that it cannot plan about Y; such that its CoTs contain zero bits for narrowing down which state out of all Y it must pick.
This assumes something like the following model: A given CoT string communicates some N bits of information. Of them, M bits are human-legible. The model then follows the algorithm of: use NX task-relevant bits on picking a state out of X, and use any leftover NY=N−NX bits to pick a state out of Y. We then destroy the string’s information content until N≈M (via the Paraphraser). Now the model doesn’t have the ability to encode any non-human-legible bits, and if it uses any bits on picking a state out of Y, they would necessarily be human-legible, and we can spot the scheming by reading the CoT.
But that’s not how this works. Instead, the model picks both simultaneously: a CoT is directly mapped to an ⟨X,Y⟩ two-tuple. Which means if the model can narrow down the state of X to one out of 210, it can narrow down the state of Y to one out of 210 in the same move. Even if we scramble the CoT until it communicates no more bits than a human can extract out of it, that doesn’t differentially constrain the model’s ability to think about scheming (= communicate scheming-relevant information to its future self).
It’s a uniform communication constraint/computational-complexity penalty.
Generalizing beyond scheming: if the model were inclined to not put any of its thoughts into humanese (due to superhuman latent knowledge, say), the Paraphraser wouldn’t address this issue either. If a model’s plans contain some catastrophic side-effects (which aren’t even malign, just something we didn’t tell it not to cause), the Paraphraser wouldn’t reveal that either.
@Daniel Kokotajlo, thoughts?
(This doesn’t touch on the Face part of the proposal, i. e., the idea that we want to not pressure the model to conceal any scheme-y cognition in the hopes it’d naively reveal them. But if there’s some pressure to conceal that, the Paraphraser would not help with it at all. It’d just teach the model to map humanese-equivalent strings to equivalent scheming-states, which isn’t particularly useful.)
Arguably, if you also penalize the CoT length, this should cause a pressure towards throwing out any computations unrelated to the object-level problem the model’s being trained on… But then the whole “paraphraser” bit doesn’t actually do anything, you can achieve the same outcome by just directly penalizing the CoT length.
Which, inasmuch as DL training loops use some form of simplicity prior, should be happening by default anyway. I. e., this doesn’t address the pressures that we expect to lead to steganography/hidden plotting despite the simplicity prior: which is to say, arguments that steganography/plotting are the actual optimal solutions to the training problem.
Consider the Theory of Everything example: those hidden computations are not actually “irrelevant computations” that we need to displace, they’re directly useful. Similarly, any misaligned mesa-objectives the model forms would presumably be convergently useful for performance on the training objective: that would be why they formed to begin with. So the information-bottleneck trickery would do ~nothing against them.
I like to distinguish between “Subtle strategic bias” and “Steganography.” Paraphrasers help (and maybe solve?) the latter, but do nothing about the former.
Example of subtle strategic bias: The model knows it’s misaligned and therefore deliberately sandbags on the alignment research you are asking it to do. “I’m sorry sir I just am not making much progress,” it says. You look at the CoT, and it’s just doing a bunch of sensible but fruitless lines of reasoning.
Example of steganography: The model needs to insert a subtle security vulnerability into the code it is writing—but it doesn’t know one right away, it has to invent one. And that requires lots of sequential thought. So it needs to try out different ideas, vet them, critique them, refine them, etc. Can’t do that in a single forward pass. So it needs to pass information forward to its future self somehow… without humans reading the CoT knowing that’s what it’s doing. THIS is the sort of thing that is much harder to do with a paraphraser than without a paraphraser.
I don’t see why it would be harder to do.
Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, “I need to design unit tests for this module”. Each word of this CoT is mapped to some internal representation, and the internal representation consists of “on-task” human-intended reasoning X, and misaligned scheming Y (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). “I” is mapped to ⟨X1,Y1⟩, “need” is mapped to ⟨X2,Y2⟩, etc. Each Xi iterates on Xi−1, each Yi iterates on Yi−1, we have a sequence of serial computations.
In other words: as the model outputs, “I need to design unit tests for this module”, there’s an unspoken subtext of “I need to sabotage the unit tests here”, and the model packs nine words’ worth of serial computations into both the “textual” X and the “subtextual” Y aspects of it.
Now you use the Paraphraser on the string, generating alternatives like “the next step is to generate some unit tests” and “several unit tests for this module are needed here”. The intent, if I understand you correctly, is that the semantic-equivalence-but-syntactic-difference would ensure that the ith token of this phrase is mapped to Xi, as if the paraphrasing didn’t happen, but that the mapping of i→Yi would be broken.
Except, why would that happen? What privileges X-representations over Y-representations? Why would i→Xi be preserved whereas i→Yi is broken, instead of both of them remaining functional (and so preserving scheming), or both breaking (and so compromising the model’s capabilities)?
In other words: We’re basically just teaching the model to fail to distinguish between equivalent-to-a-human rephrasings. Why would this lead to the task-related representations becoming indifferent to the specifics of a phrasing whereas the scheming-related representations remain sensitive to it? Why wouldn’t this just map the subtextual “I need to sabotage the unit tests here” to similarly semantically-equivalent-but-syntactically-different “let us compromise the relevant unit-test suite” or “I shall ensure the unit tests don’t work properly”?
I expect it’d just lead, to wit, to the model’s computations becoming indifferent to the details of a phrasing in full generality. And if the model were inclined to associate the general idea of generating unit tests with tactically screwing the task up, the model would come up with this idea regardless of the details of a phrasing.
It wouldn’t interfere with its ability to engage in scheming-relevant serial computations, much like it doesn’t interfere with its ability to engage in task-relevant serial computations, for the exact same reasons.
I don’t think I understand. Let’s work through an example.
The AI is being told to write good code, and wants to write subtly-vulnerable code instead. If it just were to honestly try to write good code, it would take (let’s say) 100 serial steps of reasoning to succeed. To write subtly-vulnerable code, let’s say, requires 300 serial steps of reasoning. The model’s strategy is to write subtly-vulnerable code but in a way that looks like it’s trying to write good code and just getting a bit unlucky and taking longer than usual (let’s suppose the humans have a poor ability to judge how long it’s supposed to take, but still some ability, such that it can’t take e.g. 10x longer without arousing suspicion).
If there’s no paraphraser, this is ‘easy.’ I mean it’s still difficult but it’s ‘easy’ in the sense that the AI can basically just use a code—each nugget of reasoning might have 10 different semantically equivalent expressions, or maybe even 100, so it can ‘add extra layers of meaning’ to the nuggets—in addition to their meaning in the surface level i’m-tring-to-write-good-code, they can have a hidden meaning having to do with the security vulnerability-related properties of the code. And it can make a chain of reasoning that is basically about the security stuff while looking like a reasonable but unlucky attempt to write good code.
But a paraphraser wrecks that strategy. The model can still add extra layers of meaning to the nuggets, but now it has loads of trouble making the chain of thought look like an honest attempt to write good code. To use your example, if ‘I should generate unit tests now’ and all semantically equivalent variants of that idea are assigned the hidden meaning ‘I should check to see if the vulnerability is too easy to spot’ for example, then… what is it going to write next? It’s too constrained; in order to look non-suspicious there is a limited set of options it can follow up with… ok, hmm, I’m at the limits of my understanding here. But maybe you get the idea?
…here’s an intuition I think I’m drawing from. Suppose you’re the model and you are trying to decide what hidden meanings to associate with what concepts/phrases/steps. If you already had your completed illicit reasoning chain and your completed good-boy reasoning chain, it would be easy—just line them up and make a 1-1 mapping, and then bam no paraphraser can stop you! But you don’t have either chain! That’s the whole point—this is a task that you haven’t done yet, you need to explore some maze in logical space (or tree? Or graph?) until you find a chain that gets to the conclusion/answer.
Ah, I think I see my error.
Consider a CoT of the form “[problem statement]: …........”, which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities.
And in the framework I’d outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such “filler” tokens.
So yep, you’re right, this does force all scheming to be taking place within one forward pass.