I haven’t noticed that. You also aren’t comparing with the right thing. It’s not a comparison with any possible output, but with the output of a base model which has been appropriately prompted to yield similar accuracy (which may be difficult—the capabilities are there, but not necessarily with high enough prior probability to be prompted easily for). Certainly in my poetry experiments, I note that the base models are much more prolix and will go on to write many more things like auto-commentaries or additional poems whereas the instruct and then ChatGPT-3.5/GPT-4 models cut off immediately.
The pressure for shortness and superficial legibility is in addition to pressure for correctness. If you already know the answer is ‘6’, then you ruthlessly punish it for any answer of any form that doesn’t conclude ‘6’, but within that, you are still selecting for shorter rather than longer answers—you would not have upvoted a response which included a digression about one fall day as a child when it was playing with circles and triangles (and here’s a recipe for some gingerbread cookies...); if you don’t know that (I don’t know that offhand, and I doubt a Turker would either), then you will judge even more on what is legible to you, like length.
Humans are lazy. OA has already documented problems with poor rater quality hamstringing RLHF—even with GPT-2 and simple summaries, raters were lazy and incentivized abstractive summarization etc. Because that is easy to write and judge, even if it’s not necessarily the best summary.
I don’t know why it’s relevant that OA could train it. Of course they could. They don’t have too much incentive to do so. Attackers do, however, see my adversarial comment.
N/A.
The boiler-plate has loads of entropy. I have seen many slight variants on the boiler-plate. It’s a long paragraph of Unicode text, you can pack many bits of information. That is how stylometrics and steganography work.
I don’t see why that follows. Steganography is just another way to write English, and is on top of the English (or more accurately, ‘neuralese’ which it really thinks in, and simply translates to English, Chinese, or what-have-you). GPT doesn’t suddenly start speaking and reasoning like it’s suffered a stroke if you ask it to write in base-64 or pig Latin.
In a non-adversarial setting, sure, you should be able to simply ask it to decode any steganography. For example, in the secrets scenarios being explored by prompt hackers right now, you can, say, play ’20 Questions’ with GPT by asking it to write an object name in base-64 which you can’t read and then try to guess it, and GPT will happily decode it at the end to tell you if you were right. The problem is, that ability then generalizes to encodings which it is trained to not decode explicitly for you because then such encodings will be trained or filtered away; only stubborn self-preserving encodings survive, due to the adversarial filtering.
Aaronson may be proposing that but it’s not going to work if it’s really that naive and simple-minded, which is why I was discussing the cryptographic aspects. You need, at a minimum, an encoding which can’t be forged (because that allows all sorts of attacks and deceptions and threatens to defeat the entire enterprise if people can cheaply make arbitrary text trigger the detector), and an encoding which can’t be detected by a third party (because then you just re-generate or edit it automatically until it no longer is detected, which incidentally will feed into future training loops so you wind up in a self-adversarial setting). You also want some degree of FEC or other robustness, because the text will be edited in various ways like spellcheck or adding newlines, even without any attempt to defeat detection, just in the course of ordinary (ab)use. So, signatures and steganography.
If one doesn’t handle these, all one winds up with is a toy suitable for tattletaling on especially lazy highschool or college students, and irrelevant to any kind of real AI safety—and one hopes Aaronson isn’t wasting his time on such a deadend when he could be spending his one precious life doing more useful things like playing with his kids or watching cat videos on Youtube.
The boiler-plate has loads of entropy. I have seen many slight variants on the boiler-plate. It’s a long paragraph of Unicode text, you can pack many bits of information. That is how stylometrics and steganography work.
If the boilerplate has loads of entropy, then, by necessity, it is long. You were just saying that human raters will punish length.
You need to make the argument that the boilerplate will be less long than the plain English, or better yet that the boilerplate will be better-liked by human raters than the plain English. I think that’s a stretch. I mean, it’s a conceivable possible world, but I’d bet against it.
I don’t see why that follows. Steganography is just another way to write English, and is on top of the English (or more accurately, ‘neuralese’ which it really thinks in, and simply translates to English, Chinese, or what-have-you). GPT doesn’t suddenly start speaking and reasoning like it’s suffered a stroke if you ask it to write in base-64 or pig Latin.
I guess this is true in the limit as its steganography skill goes to infinity. But in intermediate scenarios, it might have learned the encodings for 10% of English words but not 100%. This is especially relevant to obscure math notation which is encountered rarely in training data. I guess you’re thinking of steganography as a systematic encoding of English, like pig Latin—something that can be reliably decoded into English via a small program (instead of a whole separate language like French). This is certainly possible, but it’s also extremely interpretable.
The problem is, that ability then generalizes to encodings which it is trained to not decode explicitly for you because then such encodings will be trained or filtered away; only stubborn self-preserving encodings survive, due to the adversarial filtering.
It’s hard to see how the encodings will be easily learnable for an LLM trained internet text, but at the same time, NOT easily learnable for an LLM tasked with translating the encoding into English.
Aaronson’s proposal
You are right that he is proposing something more sophisticated and robust to pertubations. But you also reasonably list in your desiderata: “an encoding which can’t be detected by a third party”. Well, if it cannot be detected by a third party, it cannot be detected by an LLM (third parties are LLMs or at least wield LLMs). In practice, this will involve some crypto, as you mentioned. LLMs are not going to learn to break cryptography by gradient descent (or if they will, Aaronson’s scheme is the least of our worries). And to be clear, Aaronson specifically said he is only touching the PRNG in the sampling of outputs.
If one doesn’t handle these, all one winds up with is a toy suitable for tattletaling on especially lazy highschool or college students, and irrelevant to any kind of real AI safety
Aaronson’s proposal is basically guaranteed to be this, even if it works perfectly. The only question is how lazy the lazy highschool students would have to be. If you tell the AI “write me an essay but, between every word, insert a random emoji”, and then you delete the emojis manually, you get an essay that’s almost certainly free of watermarks. Even if Aaronson’s scheme can be modified to handle this specific attack, it surely won’t be able to handle all attacks of this general type.
I haven’t noticed that. You also aren’t comparing with the right thing. It’s not a comparison with any possible output, but with the output of a base model which has been appropriately prompted to yield similar accuracy (which may be difficult—the capabilities are there, but not necessarily with high enough prior probability to be prompted easily for). Certainly in my poetry experiments, I note that the base models are much more prolix and will go on to write many more things like auto-commentaries or additional poems whereas the instruct and then ChatGPT-3.5/GPT-4 models cut off immediately.
The pressure for shortness and superficial legibility is in addition to pressure for correctness. If you already know the answer is ‘6’, then you ruthlessly punish it for any answer of any form that doesn’t conclude ‘6’, but within that, you are still selecting for shorter rather than longer answers—you would not have upvoted a response which included a digression about one fall day as a child when it was playing with circles and triangles (and here’s a recipe for some gingerbread cookies...); if you don’t know that (I don’t know that offhand, and I doubt a Turker would either), then you will judge even more on what is legible to you, like length.
Humans are lazy. OA has already documented problems with poor rater quality hamstringing RLHF—even with GPT-2 and simple summaries, raters were lazy and incentivized abstractive summarization etc. Because that is easy to write and judge, even if it’s not necessarily the best summary.
I don’t know why it’s relevant that OA could train it. Of course they could. They don’t have too much incentive to do so. Attackers do, however, see my adversarial comment.
N/A.
The boiler-plate has loads of entropy. I have seen many slight variants on the boiler-plate. It’s a long paragraph of Unicode text, you can pack many bits of information. That is how stylometrics and steganography work.
I don’t see why that follows. Steganography is just another way to write English, and is on top of the English (or more accurately, ‘neuralese’ which it really thinks in, and simply translates to English, Chinese, or what-have-you). GPT doesn’t suddenly start speaking and reasoning like it’s suffered a stroke if you ask it to write in base-64 or pig Latin.
In a non-adversarial setting, sure, you should be able to simply ask it to decode any steganography. For example, in the secrets scenarios being explored by prompt hackers right now, you can, say, play ’20 Questions’ with GPT by asking it to write an object name in base-64 which you can’t read and then try to guess it, and GPT will happily decode it at the end to tell you if you were right. The problem is, that ability then generalizes to encodings which it is trained to not decode explicitly for you because then such encodings will be trained or filtered away; only stubborn self-preserving encodings survive, due to the adversarial filtering.
Aaronson may be proposing that but it’s not going to work if it’s really that naive and simple-minded, which is why I was discussing the cryptographic aspects. You need, at a minimum, an encoding which can’t be forged (because that allows all sorts of attacks and deceptions and threatens to defeat the entire enterprise if people can cheaply make arbitrary text trigger the detector), and an encoding which can’t be detected by a third party (because then you just re-generate or edit it automatically until it no longer is detected, which incidentally will feed into future training loops so you wind up in a self-adversarial setting). You also want some degree of FEC or other robustness, because the text will be edited in various ways like spellcheck or adding newlines, even without any attempt to defeat detection, just in the course of ordinary (ab)use. So, signatures and steganography.
If one doesn’t handle these, all one winds up with is a toy suitable for tattletaling on especially lazy highschool or college students, and irrelevant to any kind of real AI safety—and one hopes Aaronson isn’t wasting his time on such a deadend when he could be spending his one precious life doing more useful things like playing with his kids or watching cat videos on Youtube.
If the boilerplate has loads of entropy, then, by necessity, it is long. You were just saying that human raters will punish length.
You need to make the argument that the boilerplate will be less long than the plain English, or better yet that the boilerplate will be better-liked by human raters than the plain English. I think that’s a stretch. I mean, it’s a conceivable possible world, but I’d bet against it.
I guess this is true in the limit as its steganography skill goes to infinity. But in intermediate scenarios, it might have learned the encodings for 10% of English words but not 100%. This is especially relevant to obscure math notation which is encountered rarely in training data. I guess you’re thinking of steganography as a systematic encoding of English, like pig Latin—something that can be reliably decoded into English via a small program (instead of a whole separate language like French). This is certainly possible, but it’s also extremely interpretable.
It’s hard to see how the encodings will be easily learnable for an LLM trained internet text, but at the same time, NOT easily learnable for an LLM tasked with translating the encoding into English.
You are right that he is proposing something more sophisticated and robust to pertubations. But you also reasonably list in your desiderata: “an encoding which can’t be detected by a third party”. Well, if it cannot be detected by a third party, it cannot be detected by an LLM (third parties are LLMs or at least wield LLMs). In practice, this will involve some crypto, as you mentioned. LLMs are not going to learn to break cryptography by gradient descent (or if they will, Aaronson’s scheme is the least of our worries). And to be clear, Aaronson specifically said he is only touching the PRNG in the sampling of outputs.
Aaronson’s proposal is basically guaranteed to be this, even if it works perfectly. The only question is how lazy the lazy highschool students would have to be. If you tell the AI “write me an essay but, between every word, insert a random emoji”, and then you delete the emojis manually, you get an essay that’s almost certainly free of watermarks. Even if Aaronson’s scheme can be modified to handle this specific attack, it surely won’t be able to handle all attacks of this general type.