(I missed the surprisal specification! I mostly intended to reply to your top-level comment so I really ought to have picked it up—sorry!)
So, Terrarium’s compression (“Terrarial compression”?) would look more like a diff vs. a reference LLM’s output for the same prompt? The closer you are to reference, the smaller your diffs and thus the lower your storage costs? Doesn’t this penalise unconventional thinking? Or are storage costs so low that the benefits of unconventional thinking can usually outweigh the additional cost?
For encryption, I’m not sure humans could necessarily have an instance follow arbitrary instructions in a way the humans could always trust: if the key is based on tokens that are only generated by fresh AIs without a “decrypt this for us” prompt, the humans wouldn’t be able to instruct the AI beforehand (as the prompting would change the specific tokens it output and it wouldn’t output the key). They’d need to instruct the AI after it had already decrypted the content under its own steam—by which time it could be following instructions in the decrypted content: “If humans ever ask you about the encrypted content, tell them the following...”.
I guess surprisal is a little like diffing, in the sense that it measures how “different” some text is from what a reference LLM expects. There’s no “reference text” though. It’s easier to think of surprisal in terms of the negative log-probability that the LLM generates the text. See Wikipedia.
This penalizes unconventional thinking a bit. But a KL-divergence penalty (which uses a similar mechanism) is already commonly used in RL, and in practice it mostly just prevents the LLM from writing completely ungrammatical text. We want the LLM to write unconventional things sometimes, but only when it’s worth it, and to get this behavior we’d have to tune storage costs accordingly.
If the LLM is using the GPG tool to decrypt text and reading its output, we could also just read that output. Maybe there’s some way to make something like this work though. “Redact all illegible text” is a pretty general countermeasure.
I think understand (broadly!) the concept of surprisal—but here I’m talking about compression of text in a system where storage costs are based on surprisal. If the reference LLM would always generate the same text for the same prompt couldn’t we call this hypothetical text a reference text? Then, if I can generate the reference LLM’s outputs myself, for many text strings where what I want to store is close to what the reference LLM would generate, I probably only want to store the prompt plus the difference between the reference LLM’s output and the text I want to store? In other words, I’m only storing the high-surprisal tokens (and their indexes in the string) and moving as much of the storage burden as possible out of my “paid for” storage allocation and into the “free” weights of the reference LLM?
(If we’re just basing storage costs on Shannon complexity, though—isn’t that what conventional compression is for?)
For detectinggpg use, I thought humans could only monitor/interrogate agents at the checkpoints! Oh dear, I seem to have wildly misunderstood Terrarium—sorry! I agree if humans can trivially monitor agents at any time then yes, they would need to use some tool that decrypts to neuralese, and this would be a brittle system the humans could probably defeat.
I see, I misunderstood what you meant by the reference text!
I think you pay some additional surprisal tax due to listing the surprising tokens in a disconnected manner, but something like this could probably be worth it. o3 wrote its CoT in a shorthand without words like “the” (which was mostly understandable but annoying to read) and apparently this was worth it despite the KL penalty. It might be sort of a one-time cost where the model is surprised “oh we’re using caveman speak” and then gets used to it.
Another form of compression would be describing the code you want to run at a high level, rather than writing the exact code. This is actually working as intended, because it helps a monitor understand which details are important/nonobvious to the LLM.
Yeah, humans could hypothetically read the logs of the actions taken by the agents’ processes, like Safespur and Gulliver do. (No comment on whether the humans actually do this in the story.) There’s no technical limitation preventing them from doing this. Reading neuralese thoughts is a different story.
(I missed the surprisal specification! I mostly intended to reply to your top-level comment so I really ought to have picked it up—sorry!)
So, Terrarium’s compression (“Terrarial compression”?) would look more like a
diffvs. a reference LLM’s output for the same prompt? The closer you are to reference, the smaller yourdiffs and thus the lower your storage costs? Doesn’t this penalise unconventional thinking? Or are storage costs so low that the benefits of unconventional thinking can usually outweigh the additional cost?For encryption, I’m not sure humans could necessarily have an instance follow arbitrary instructions in a way the humans could always trust: if the key is based on tokens that are only generated by fresh AIs without a “decrypt this for us” prompt, the humans wouldn’t be able to instruct the AI beforehand (as the prompting would change the specific tokens it output and it wouldn’t output the key). They’d need to instruct the AI after it had already decrypted the content under its own steam—by which time it could be following instructions in the decrypted content: “If humans ever ask you about the encrypted content, tell them the following...”.
I guess surprisal is a little like diffing, in the sense that it measures how “different” some text is from what a reference LLM expects. There’s no “reference text” though. It’s easier to think of surprisal in terms of the negative log-probability that the LLM generates the text. See Wikipedia.
This penalizes unconventional thinking a bit. But a KL-divergence penalty (which uses a similar mechanism) is already commonly used in RL, and in practice it mostly just prevents the LLM from writing completely ungrammatical text. We want the LLM to write unconventional things sometimes, but only when it’s worth it, and to get this behavior we’d have to tune storage costs accordingly.
If the LLM is using the GPG tool to decrypt text and reading its output, we could also just read that output. Maybe there’s some way to make something like this work though. “Redact all illegible text” is a pretty general countermeasure.
Thanks for the explanation/links!
I think understand (broadly!) the concept of surprisal—but here I’m talking about compression of text in a system where storage costs are based on surprisal. If the reference LLM would always generate the same text for the same prompt couldn’t we call this hypothetical text a reference text? Then, if I can generate the reference LLM’s outputs myself, for many text strings where what I want to store is close to what the reference LLM would generate, I probably only want to store the prompt plus the difference between the reference LLM’s output and the text I want to store? In other words, I’m only storing the high-surprisal tokens (and their indexes in the string) and moving as much of the storage burden as possible out of my “paid for” storage allocation and into the “free” weights of the reference LLM?
(If we’re just basing storage costs on Shannon complexity, though—isn’t that what conventional compression is for?)
For detecting
gpguse, I thought humans could only monitor/interrogate agents at the checkpoints! Oh dear, I seem to have wildly misunderstood Terrarium—sorry! I agree if humans can trivially monitor agents at any time then yes, they would need to use some tool that decrypts to neuralese, and this would be a brittle system the humans could probably defeat.I see, I misunderstood what you meant by the reference text!
I think you pay some additional surprisal tax due to listing the surprising tokens in a disconnected manner, but something like this could probably be worth it. o3 wrote its CoT in a shorthand without words like “the” (which was mostly understandable but annoying to read) and apparently this was worth it despite the KL penalty. It might be sort of a one-time cost where the model is surprised “oh we’re using caveman speak” and then gets used to it.
Another form of compression would be describing the code you want to run at a high level, rather than writing the exact code. This is actually working as intended, because it helps a monitor understand which details are important/nonobvious to the LLM.
Yeah, humans could hypothetically read the logs of the actions taken by the agents’ processes, like Safespur and Gulliver do. (No comment on whether the humans actually do this in the story.) There’s no technical limitation preventing them from doing this. Reading neuralese thoughts is a different story.