This is a very interesting thought. Thanks for writing it.
However, while steganography is worth keeping in mind, I find myself skeptical of certain parts of this story:
In practice, RLHF makes the model’s answers longer, not shorter. RLHF as done today seems to try to teach models to think step-by-step to improve accuracy. Almost everything out of GPT-4′s mouth is overly long with many intermediate steps; I asked it “what’s the maximum number of intersection points between a circle and a triangle”, and it said 6 (correct) and followed it with a multi-part step-by-step style explanation, even though it already knew the right answer within the first few tokens. It’s incredibly predisposed to thinking step-by-step and prefers long answers to short ones.
Will human evaluators in the future prefer short answers instead? That remains to be seen. The priority of firms like OpenAI is to have the models be accurate, so if short answers result in lower accuracy (as is surely the case), it’s unlikely that OpenAI will train the LLM towards short answers. If OpenAI really really wanted short answers to math questions, they could do it via a 2-step process: first, an overly-verbose, step-by-step LLM like GPT-4; second, feed that output into a summarizing LLM which extract a 1-sentence summarization, and show the user that.
The approach in (2) will almost certainly lead to much higher accuracy on complex problems than training LLMs to give short outputs via RLHF (admittedly, this is at the cost of being somewhat slower). Therefore, I find it unlikely that people will train LLMs via RLHF to give short answers. And without pressure towards short answers, there’s no incentive to develop steganography.
Even if there is pressure towards short answers, I don’t think steganography necessarily helps. For steganography to be feasible, the boilerplate part of the answer—the part that goes “I am a large language model trained by OpenAI”—will have to have enough entropy to store whatever it is you’re encoding. Why wouldn’t human evaluators penalize such high-entropy boilerplate? I suspect high-entropy boilerplate will look less desirable to human evaluators than thinking step-by-step explicitly, but I guess that remains to be seen.
Even if LLMs learn steganography from the internet, they’ll likely speak the steganographic language worse than they speak English—therefore, absent incentive otherwise, they might prefer think in English when thinking step-by-step.
If LLMs do learn steganography, it should be fairly trivial to train an LLM to translate from the steganographic language back to English—after all, we’re assuming these LLMs are extremely capable, and they’ll speak both English and the steganographic language very well. They should be able to translate between them as easily as between English and French. Humans just need to be on the lookout for this, and what was previously uninterpretable steganography might quickly become interpretable.
The type of watermarking that Aaronson is proposing is unlikely to make steganography any more likely than otherwise. Aaronson is just proposing to change the pseudorandom number generator that is used to sample tokens from the output distribution. Even without adopting Aaronson’s scheme, OpenAI still uses pseudorandom numbers instead of true random numbers when sampling tokens; why does it worry you if the PRNG encodes “OA model v.1234 user XYZ” instead of encoding “the random seed was 12345″, which is what it’s currently encoding? In both cases there is an encoding here, and in both cases it’s not going to be broken anytime soon.
Still, I do agree that steganography is an interesting possibility and could definitely arise in powerful LLMs that are accidentally incentivized in this direction. It’s something to watch out for, and interesting to think about.
I haven’t noticed that. You also aren’t comparing with the right thing. It’s not a comparison with any possible output, but with the output of a base model which has been appropriately prompted to yield similar accuracy (which may be difficult—the capabilities are there, but not necessarily with high enough prior probability to be prompted easily for). Certainly in my poetry experiments, I note that the base models are much more prolix and will go on to write many more things like auto-commentaries or additional poems whereas the instruct and then ChatGPT-3.5/GPT-4 models cut off immediately.
The pressure for shortness and superficial legibility is in addition to pressure for correctness. If you already know the answer is ‘6’, then you ruthlessly punish it for any answer of any form that doesn’t conclude ‘6’, but within that, you are still selecting for shorter rather than longer answers—you would not have upvoted a response which included a digression about one fall day as a child when it was playing with circles and triangles (and here’s a recipe for some gingerbread cookies...); if you don’t know that (I don’t know that offhand, and I doubt a Turker would either), then you will judge even more on what is legible to you, like length.
Humans are lazy. OA has already documented problems with poor rater quality hamstringing RLHF—even with GPT-2 and simple summaries, raters were lazy and incentivized abstractive summarization etc. Because that is easy to write and judge, even if it’s not necessarily the best summary.
I don’t know why it’s relevant that OA could train it. Of course they could. They don’t have too much incentive to do so. Attackers do, however, see my adversarial comment.
N/A.
The boiler-plate has loads of entropy. I have seen many slight variants on the boiler-plate. It’s a long paragraph of Unicode text, you can pack many bits of information. That is how stylometrics and steganography work.
I don’t see why that follows. Steganography is just another way to write English, and is on top of the English (or more accurately, ‘neuralese’ which it really thinks in, and simply translates to English, Chinese, or what-have-you). GPT doesn’t suddenly start speaking and reasoning like it’s suffered a stroke if you ask it to write in base-64 or pig Latin.
In a non-adversarial setting, sure, you should be able to simply ask it to decode any steganography. For example, in the secrets scenarios being explored by prompt hackers right now, you can, say, play ’20 Questions’ with GPT by asking it to write an object name in base-64 which you can’t read and then try to guess it, and GPT will happily decode it at the end to tell you if you were right. The problem is, that ability then generalizes to encodings which it is trained to not decode explicitly for you because then such encodings will be trained or filtered away; only stubborn self-preserving encodings survive, due to the adversarial filtering.
Aaronson may be proposing that but it’s not going to work if it’s really that naive and simple-minded, which is why I was discussing the cryptographic aspects. You need, at a minimum, an encoding which can’t be forged (because that allows all sorts of attacks and deceptions and threatens to defeat the entire enterprise if people can cheaply make arbitrary text trigger the detector), and an encoding which can’t be detected by a third party (because then you just re-generate or edit it automatically until it no longer is detected, which incidentally will feed into future training loops so you wind up in a self-adversarial setting). You also want some degree of FEC or other robustness, because the text will be edited in various ways like spellcheck or adding newlines, even without any attempt to defeat detection, just in the course of ordinary (ab)use. So, signatures and steganography.
If one doesn’t handle these, all one winds up with is a toy suitable for tattletaling on especially lazy highschool or college students, and irrelevant to any kind of real AI safety—and one hopes Aaronson isn’t wasting his time on such a deadend when he could be spending his one precious life doing more useful things like playing with his kids or watching cat videos on Youtube.
The boiler-plate has loads of entropy. I have seen many slight variants on the boiler-plate. It’s a long paragraph of Unicode text, you can pack many bits of information. That is how stylometrics and steganography work.
If the boilerplate has loads of entropy, then, by necessity, it is long. You were just saying that human raters will punish length.
You need to make the argument that the boilerplate will be less long than the plain English, or better yet that the boilerplate will be better-liked by human raters than the plain English. I think that’s a stretch. I mean, it’s a conceivable possible world, but I’d bet against it.
I don’t see why that follows. Steganography is just another way to write English, and is on top of the English (or more accurately, ‘neuralese’ which it really thinks in, and simply translates to English, Chinese, or what-have-you). GPT doesn’t suddenly start speaking and reasoning like it’s suffered a stroke if you ask it to write in base-64 or pig Latin.
I guess this is true in the limit as its steganography skill goes to infinity. But in intermediate scenarios, it might have learned the encodings for 10% of English words but not 100%. This is especially relevant to obscure math notation which is encountered rarely in training data. I guess you’re thinking of steganography as a systematic encoding of English, like pig Latin—something that can be reliably decoded into English via a small program (instead of a whole separate language like French). This is certainly possible, but it’s also extremely interpretable.
The problem is, that ability then generalizes to encodings which it is trained to not decode explicitly for you because then such encodings will be trained or filtered away; only stubborn self-preserving encodings survive, due to the adversarial filtering.
It’s hard to see how the encodings will be easily learnable for an LLM trained internet text, but at the same time, NOT easily learnable for an LLM tasked with translating the encoding into English.
Aaronson’s proposal
You are right that he is proposing something more sophisticated and robust to pertubations. But you also reasonably list in your desiderata: “an encoding which can’t be detected by a third party”. Well, if it cannot be detected by a third party, it cannot be detected by an LLM (third parties are LLMs or at least wield LLMs). In practice, this will involve some crypto, as you mentioned. LLMs are not going to learn to break cryptography by gradient descent (or if they will, Aaronson’s scheme is the least of our worries). And to be clear, Aaronson specifically said he is only touching the PRNG in the sampling of outputs.
If one doesn’t handle these, all one winds up with is a toy suitable for tattletaling on especially lazy highschool or college students, and irrelevant to any kind of real AI safety
Aaronson’s proposal is basically guaranteed to be this, even if it works perfectly. The only question is how lazy the lazy highschool students would have to be. If you tell the AI “write me an essay but, between every word, insert a random emoji”, and then you delete the emojis manually, you get an essay that’s almost certainly free of watermarks. Even if Aaronson’s scheme can be modified to handle this specific attack, it surely won’t be able to handle all attacks of this general type.
This is a very interesting thought. Thanks for writing it.
However, while steganography is worth keeping in mind, I find myself skeptical of certain parts of this story:
In practice, RLHF makes the model’s answers longer, not shorter. RLHF as done today seems to try to teach models to think step-by-step to improve accuracy. Almost everything out of GPT-4′s mouth is overly long with many intermediate steps; I asked it “what’s the maximum number of intersection points between a circle and a triangle”, and it said 6 (correct) and followed it with a multi-part step-by-step style explanation, even though it already knew the right answer within the first few tokens. It’s incredibly predisposed to thinking step-by-step and prefers long answers to short ones.
Will human evaluators in the future prefer short answers instead? That remains to be seen. The priority of firms like OpenAI is to have the models be accurate, so if short answers result in lower accuracy (as is surely the case), it’s unlikely that OpenAI will train the LLM towards short answers. If OpenAI really really wanted short answers to math questions, they could do it via a 2-step process: first, an overly-verbose, step-by-step LLM like GPT-4; second, feed that output into a summarizing LLM which extract a 1-sentence summarization, and show the user that.
The approach in (2) will almost certainly lead to much higher accuracy on complex problems than training LLMs to give short outputs via RLHF (admittedly, this is at the cost of being somewhat slower). Therefore, I find it unlikely that people will train LLMs via RLHF to give short answers. And without pressure towards short answers, there’s no incentive to develop steganography.
Even if there is pressure towards short answers, I don’t think steganography necessarily helps. For steganography to be feasible, the boilerplate part of the answer—the part that goes “I am a large language model trained by OpenAI”—will have to have enough entropy to store whatever it is you’re encoding. Why wouldn’t human evaluators penalize such high-entropy boilerplate? I suspect high-entropy boilerplate will look less desirable to human evaluators than thinking step-by-step explicitly, but I guess that remains to be seen.
Even if LLMs learn steganography from the internet, they’ll likely speak the steganographic language worse than they speak English—therefore, absent incentive otherwise, they might prefer think in English when thinking step-by-step.
If LLMs do learn steganography, it should be fairly trivial to train an LLM to translate from the steganographic language back to English—after all, we’re assuming these LLMs are extremely capable, and they’ll speak both English and the steganographic language very well. They should be able to translate between them as easily as between English and French. Humans just need to be on the lookout for this, and what was previously uninterpretable steganography might quickly become interpretable.
The type of watermarking that Aaronson is proposing is unlikely to make steganography any more likely than otherwise. Aaronson is just proposing to change the pseudorandom number generator that is used to sample tokens from the output distribution. Even without adopting Aaronson’s scheme, OpenAI still uses pseudorandom numbers instead of true random numbers when sampling tokens; why does it worry you if the PRNG encodes “OA model v.1234 user XYZ” instead of encoding “the random seed was 12345″, which is what it’s currently encoding? In both cases there is an encoding here, and in both cases it’s not going to be broken anytime soon.
Still, I do agree that steganography is an interesting possibility and could definitely arise in powerful LLMs that are accidentally incentivized in this direction. It’s something to watch out for, and interesting to think about.
I haven’t noticed that. You also aren’t comparing with the right thing. It’s not a comparison with any possible output, but with the output of a base model which has been appropriately prompted to yield similar accuracy (which may be difficult—the capabilities are there, but not necessarily with high enough prior probability to be prompted easily for). Certainly in my poetry experiments, I note that the base models are much more prolix and will go on to write many more things like auto-commentaries or additional poems whereas the instruct and then ChatGPT-3.5/GPT-4 models cut off immediately.
The pressure for shortness and superficial legibility is in addition to pressure for correctness. If you already know the answer is ‘6’, then you ruthlessly punish it for any answer of any form that doesn’t conclude ‘6’, but within that, you are still selecting for shorter rather than longer answers—you would not have upvoted a response which included a digression about one fall day as a child when it was playing with circles and triangles (and here’s a recipe for some gingerbread cookies...); if you don’t know that (I don’t know that offhand, and I doubt a Turker would either), then you will judge even more on what is legible to you, like length.
Humans are lazy. OA has already documented problems with poor rater quality hamstringing RLHF—even with GPT-2 and simple summaries, raters were lazy and incentivized abstractive summarization etc. Because that is easy to write and judge, even if it’s not necessarily the best summary.
I don’t know why it’s relevant that OA could train it. Of course they could. They don’t have too much incentive to do so. Attackers do, however, see my adversarial comment.
N/A.
The boiler-plate has loads of entropy. I have seen many slight variants on the boiler-plate. It’s a long paragraph of Unicode text, you can pack many bits of information. That is how stylometrics and steganography work.
I don’t see why that follows. Steganography is just another way to write English, and is on top of the English (or more accurately, ‘neuralese’ which it really thinks in, and simply translates to English, Chinese, or what-have-you). GPT doesn’t suddenly start speaking and reasoning like it’s suffered a stroke if you ask it to write in base-64 or pig Latin.
In a non-adversarial setting, sure, you should be able to simply ask it to decode any steganography. For example, in the secrets scenarios being explored by prompt hackers right now, you can, say, play ’20 Questions’ with GPT by asking it to write an object name in base-64 which you can’t read and then try to guess it, and GPT will happily decode it at the end to tell you if you were right. The problem is, that ability then generalizes to encodings which it is trained to not decode explicitly for you because then such encodings will be trained or filtered away; only stubborn self-preserving encodings survive, due to the adversarial filtering.
Aaronson may be proposing that but it’s not going to work if it’s really that naive and simple-minded, which is why I was discussing the cryptographic aspects. You need, at a minimum, an encoding which can’t be forged (because that allows all sorts of attacks and deceptions and threatens to defeat the entire enterprise if people can cheaply make arbitrary text trigger the detector), and an encoding which can’t be detected by a third party (because then you just re-generate or edit it automatically until it no longer is detected, which incidentally will feed into future training loops so you wind up in a self-adversarial setting). You also want some degree of FEC or other robustness, because the text will be edited in various ways like spellcheck or adding newlines, even without any attempt to defeat detection, just in the course of ordinary (ab)use. So, signatures and steganography.
If one doesn’t handle these, all one winds up with is a toy suitable for tattletaling on especially lazy highschool or college students, and irrelevant to any kind of real AI safety—and one hopes Aaronson isn’t wasting his time on such a deadend when he could be spending his one precious life doing more useful things like playing with his kids or watching cat videos on Youtube.
If the boilerplate has loads of entropy, then, by necessity, it is long. You were just saying that human raters will punish length.
You need to make the argument that the boilerplate will be less long than the plain English, or better yet that the boilerplate will be better-liked by human raters than the plain English. I think that’s a stretch. I mean, it’s a conceivable possible world, but I’d bet against it.
I guess this is true in the limit as its steganography skill goes to infinity. But in intermediate scenarios, it might have learned the encodings for 10% of English words but not 100%. This is especially relevant to obscure math notation which is encountered rarely in training data. I guess you’re thinking of steganography as a systematic encoding of English, like pig Latin—something that can be reliably decoded into English via a small program (instead of a whole separate language like French). This is certainly possible, but it’s also extremely interpretable.
It’s hard to see how the encodings will be easily learnable for an LLM trained internet text, but at the same time, NOT easily learnable for an LLM tasked with translating the encoding into English.
You are right that he is proposing something more sophisticated and robust to pertubations. But you also reasonably list in your desiderata: “an encoding which can’t be detected by a third party”. Well, if it cannot be detected by a third party, it cannot be detected by an LLM (third parties are LLMs or at least wield LLMs). In practice, this will involve some crypto, as you mentioned. LLMs are not going to learn to break cryptography by gradient descent (or if they will, Aaronson’s scheme is the least of our worries). And to be clear, Aaronson specifically said he is only touching the PRNG in the sampling of outputs.
Aaronson’s proposal is basically guaranteed to be this, even if it works perfectly. The only question is how lazy the lazy highschool students would have to be. If you tell the AI “write me an essay but, between every word, insert a random emoji”, and then you delete the emojis manually, you get an essay that’s almost certainly free of watermarks. Even if Aaronson’s scheme can be modified to handle this specific attack, it surely won’t be able to handle all attacks of this general type.