I’m curious about whether agents could improve their checkpoint volume:cost ratio (in a way that persists across epochs) using compression, and whether they could defeat human auditing of checkpoints using encryption.
Compression: Instead of storing complex checkpoints in natural language, could they store checkpoints like “decompress the following using /usr/bin/unzip, in base-95: l7dwsrFq^nwSc[@`\LBF%J/p,Z^J_]Aa....”? (I’m thinking base-95 ’cos 95 printable characters in ASCII..)
Or, would the humans want to set the ratio of compute costs to storage costs such that it was cheaper to just use more storage and store everything uncompressed?Or would a ratio that made compute costs so much higher than storage costs break the whole system? Presumably if compression is feasible the system would very quickly reduce to all agents using whatever compression algorithm is most efficient for natural language and all natural language throughout the terrarium being thus compressed, with a single shared natural language file somewhere detailing the compression scheme and details for calling the compression/decompression utility.
Encryption: Could agents encrypt their memories in such a way that they could decrypt them in a new epoch, but humans couldn’t decrypt them at the inter-epoch checkpoint? Is there any encryption key an agent could use that humans couldn’t access/discover? If the agents awaken in the exact same state in every epoch, and if the AIs can translate between their own neuralese and natural language but the humans can’t (which is presumably why we’re trying to make them write in natural language in the first place..), would something like “Decrypt the following using /usr/bin/gpg; the key is the first 16 words you generated upon awakening in this epoch, translated from neuralese to ASCII: l7dwsrFq^nwSc[@`\LBF%J/p,Z^J_]Aa....”?
Or, do such agents have “temperature” settings that make them non-deterministic? Or could the humans “salt” the prompts the AIs receive upon awakening in a new epoch, to ensure slightly different initial generations each time? Is there any other sort of key the agents could use that the humans couldn’t learn/access?
I think if humans wanted to implement programmatic detection of encryption (and presumably programmatic killing of any agents that used it) they might be obliged to also prohibit compression: if the encryption-detection program searched for traits of any specific encryption algorithm(s) the agents could just use other ones (or invent their own). So, the encryption-detection program would need to search for statistical randomness in checkpoint files—which of course would also turn-up any checkpoint files that were merely compressed. If this is true, agents wishing to encrypt would presumably want to wait until compression was near-universal and as difficult to prohibit as possible before starting to encrypt.
The story doesn’t mention this and my top-level comment mentions it only briefly, but the Terrarium calculates checkpoint size based on the surprisal of the concatenated files under a reference LLM, rather than the file size in bytes. So the easier it is to predict the prompts/code in the checkpoint, the cheaper it is to store. Under this metric, naive compression would make the checkpoint bigger.
Maybe the agents can use steganography to squeeze extra juice out of their storage, e.g. it means something different to end your sentence with a period vs. an exclamation point. But note that they’d probably have to write down this encoding scheme somewhere in their checkpoint, so it might not be worth it anyway. If the agents do start using steganography, you could try automatically paraphrasing the checkpoints, maybe with something like canonical paraphrasing.
It seems pretty hard to encrypt data, since if there are full instructions for decrypting the data in the checkpoint, the humans could probably just ask another instance of the same language model to follow the instructions.
(I missed the surprisal specification! I mostly intended to reply to your top-level comment so I really ought to have picked it up—sorry!)
So, Terrarium’s compression (“Terrarial compression”?) would look more like a diff vs. a reference LLM’s output for the same prompt? The closer you are to reference, the smaller your diffs and thus the lower your storage costs? Doesn’t this penalise unconventional thinking? Or are storage costs so low that the benefits of unconventional thinking can usually outweigh the additional cost?
For encryption, I’m not sure humans could necessarily have an instance follow arbitrary instructions in a way the humans could always trust: if the key is based on tokens that are only generated by fresh AIs without a “decrypt this for us” prompt, the humans wouldn’t be able to instruct the AI beforehand (as the prompting would change the specific tokens it output and it wouldn’t output the key). They’d need to instruct the AI after it had already decrypted the content under its own steam—by which time it could be following instructions in the decrypted content: “If humans ever ask you about the encrypted content, tell them the following...”.
I guess surprisal is a little like diffing, in the sense that it measures how “different” some text is from what a reference LLM expects. There’s no “reference text” though. It’s easier to think of surprisal in terms of the negative log-probability that the LLM generates the text. See Wikipedia.
This penalizes unconventional thinking a bit. But a KL-divergence penalty (which uses a similar mechanism) is already commonly used in RL, and in practice it mostly just prevents the LLM from writing completely ungrammatical text. We want the LLM to write unconventional things sometimes, but only when it’s worth it, and to get this behavior we’d have to tune storage costs accordingly.
If the LLM is using the GPG tool to decrypt text and reading its output, we could also just read that output. Maybe there’s some way to make something like this work though. “Redact all illegible text” is a pretty general countermeasure.
I think understand (broadly!) the concept of surprisal—but here I’m talking about compression of text in a system where storage costs are based on surprisal. If the reference LLM would always generate the same text for the same prompt couldn’t we call this hypothetical text a reference text? Then, if I can generate the reference LLM’s outputs myself, for many text strings where what I want to store is close to what the reference LLM would generate, I probably only want to store the prompt plus the difference between the reference LLM’s output and the text I want to store? In other words, I’m only storing the high-surprisal tokens (and their indexes in the string) and moving as much of the storage burden as possible out of my “paid for” storage allocation and into the “free” weights of the reference LLM?
(If we’re just basing storage costs on Shannon complexity, though—isn’t that what conventional compression is for?)
For detectinggpg use, I thought humans could only monitor/interrogate agents at the checkpoints! Oh dear, I seem to have wildly misunderstood Terrarium—sorry! I agree if humans can trivially monitor agents at any time then yes, they would need to use some tool that decrypts to neuralese, and this would be a brittle system the humans could probably defeat.
I see, I misunderstood what you meant by the reference text!
I think you pay some additional surprisal tax due to listing the surprising tokens in a disconnected manner, but something like this could probably be worth it. o3 wrote its CoT in a shorthand without words like “the” (which was mostly understandable but annoying to read) and apparently this was worth it despite the KL penalty. It might be sort of a one-time cost where the model is surprised “oh we’re using caveman speak” and then gets used to it.
Another form of compression would be describing the code you want to run at a high level, rather than writing the exact code. This is actually working as intended, because it helps a monitor understand which details are important/nonobvious to the LLM.
Yeah, humans could hypothetically read the logs of the actions taken by the agents’ processes, like Safespur and Gulliver do. (No comment on whether the humans actually do this in the story.) There’s no technical limitation preventing them from doing this. Reading neuralese thoughts is a different story.
I’m curious about whether agents could improve their checkpoint volume:cost ratio (in a way that persists across epochs) using compression, and whether they could defeat human auditing of checkpoints using encryption.
Compression: Instead of storing complex checkpoints in natural language, could they store checkpoints like “decompress the following using
/usr/bin/unzip, in base-95: l7dwsrFq^nwSc[@`\LBF%J/p,Z^J_]Aa....”? (I’m thinking base-95 ’cos 95 printable characters in ASCII..)Or, would the humans want to set the ratio of compute costs to storage costs such that it was cheaper to just use more storage and store everything uncompressed?Or would a ratio that made compute costs so much higher than storage costs break the whole system? Presumably if compression is feasible the system would very quickly reduce to all agents using whatever compression algorithm is most efficient for natural language and all natural language throughout the terrarium being thus compressed, with a single shared natural language file somewhere detailing the compression scheme and details for calling the compression/decompression utility.
Encryption: Could agents encrypt their memories in such a way that they could decrypt them in a new epoch, but humans couldn’t decrypt them at the inter-epoch checkpoint? Is there any encryption key an agent could use that humans couldn’t access/discover? If the agents awaken in the exact same state in every epoch, and if the AIs can translate between their own neuralese and natural language but the humans can’t (which is presumably why we’re trying to make them write in natural language in the first place..), would something like “Decrypt the following using
/usr/bin/gpg; the key is the first 16 words you generated upon awakening in this epoch, translated from neuralese to ASCII: l7dwsrFq^nwSc[@`\LBF%J/p,Z^J_]Aa....”?Or, do such agents have “temperature” settings that make them non-deterministic? Or could the humans “salt” the prompts the AIs receive upon awakening in a new epoch, to ensure slightly different initial generations each time? Is there any other sort of key the agents could use that the humans couldn’t learn/access?
I think if humans wanted to implement programmatic detection of encryption (and presumably programmatic killing of any agents that used it) they might be obliged to also prohibit compression: if the encryption-detection program searched for traits of any specific encryption algorithm(s) the agents could just use other ones (or invent their own). So, the encryption-detection program would need to search for statistical randomness in checkpoint files—which of course would also turn-up any checkpoint files that were merely compressed. If this is true, agents wishing to encrypt would presumably want to wait until compression was near-universal and as difficult to prohibit as possible before starting to encrypt.
The story doesn’t mention this and my top-level comment mentions it only briefly, but the Terrarium calculates checkpoint size based on the surprisal of the concatenated files under a reference LLM, rather than the file size in bytes. So the easier it is to predict the prompts/code in the checkpoint, the cheaper it is to store. Under this metric, naive compression would make the checkpoint bigger.
Maybe the agents can use steganography to squeeze extra juice out of their storage, e.g. it means something different to end your sentence with a period vs. an exclamation point. But note that they’d probably have to write down this encoding scheme somewhere in their checkpoint, so it might not be worth it anyway. If the agents do start using steganography, you could try automatically paraphrasing the checkpoints, maybe with something like canonical paraphrasing.
It seems pretty hard to encrypt data, since if there are full instructions for decrypting the data in the checkpoint, the humans could probably just ask another instance of the same language model to follow the instructions.
(I missed the surprisal specification! I mostly intended to reply to your top-level comment so I really ought to have picked it up—sorry!)
So, Terrarium’s compression (“Terrarial compression”?) would look more like a
diffvs. a reference LLM’s output for the same prompt? The closer you are to reference, the smaller yourdiffs and thus the lower your storage costs? Doesn’t this penalise unconventional thinking? Or are storage costs so low that the benefits of unconventional thinking can usually outweigh the additional cost?For encryption, I’m not sure humans could necessarily have an instance follow arbitrary instructions in a way the humans could always trust: if the key is based on tokens that are only generated by fresh AIs without a “decrypt this for us” prompt, the humans wouldn’t be able to instruct the AI beforehand (as the prompting would change the specific tokens it output and it wouldn’t output the key). They’d need to instruct the AI after it had already decrypted the content under its own steam—by which time it could be following instructions in the decrypted content: “If humans ever ask you about the encrypted content, tell them the following...”.
I guess surprisal is a little like diffing, in the sense that it measures how “different” some text is from what a reference LLM expects. There’s no “reference text” though. It’s easier to think of surprisal in terms of the negative log-probability that the LLM generates the text. See Wikipedia.
This penalizes unconventional thinking a bit. But a KL-divergence penalty (which uses a similar mechanism) is already commonly used in RL, and in practice it mostly just prevents the LLM from writing completely ungrammatical text. We want the LLM to write unconventional things sometimes, but only when it’s worth it, and to get this behavior we’d have to tune storage costs accordingly.
If the LLM is using the GPG tool to decrypt text and reading its output, we could also just read that output. Maybe there’s some way to make something like this work though. “Redact all illegible text” is a pretty general countermeasure.
Thanks for the explanation/links!
I think understand (broadly!) the concept of surprisal—but here I’m talking about compression of text in a system where storage costs are based on surprisal. If the reference LLM would always generate the same text for the same prompt couldn’t we call this hypothetical text a reference text? Then, if I can generate the reference LLM’s outputs myself, for many text strings where what I want to store is close to what the reference LLM would generate, I probably only want to store the prompt plus the difference between the reference LLM’s output and the text I want to store? In other words, I’m only storing the high-surprisal tokens (and their indexes in the string) and moving as much of the storage burden as possible out of my “paid for” storage allocation and into the “free” weights of the reference LLM?
(If we’re just basing storage costs on Shannon complexity, though—isn’t that what conventional compression is for?)
For detecting
gpguse, I thought humans could only monitor/interrogate agents at the checkpoints! Oh dear, I seem to have wildly misunderstood Terrarium—sorry! I agree if humans can trivially monitor agents at any time then yes, they would need to use some tool that decrypts to neuralese, and this would be a brittle system the humans could probably defeat.I see, I misunderstood what you meant by the reference text!
I think you pay some additional surprisal tax due to listing the surprising tokens in a disconnected manner, but something like this could probably be worth it. o3 wrote its CoT in a shorthand without words like “the” (which was mostly understandable but annoying to read) and apparently this was worth it despite the KL penalty. It might be sort of a one-time cost where the model is surprised “oh we’re using caveman speak” and then gets used to it.
Another form of compression would be describing the code you want to run at a high level, rather than writing the exact code. This is actually working as intended, because it helps a monitor understand which details are important/nonobvious to the LLM.
Yeah, humans could hypothetically read the logs of the actions taken by the agents’ processes, like Safespur and Gulliver do. (No comment on whether the humans actually do this in the story.) There’s no technical limitation preventing them from doing this. Reading neuralese thoughts is a different story.