Caleb Biddulph comments on The Terrarium

Caleb Biddulph 30 Mar 2026 4:46 UTC
5 points
0
I guess surprisal is a little like diffing, in the sense that it measures how “different” some text is from what a reference LLM expects. There’s no “reference text” though. It’s easier to think of surprisal in terms of the negative log-probability that the LLM generates the text. See Wikipedia.
This penalizes unconventional thinking a bit. But a KL-divergence penalty (which uses a similar mechanism) is already commonly used in RL, and in practice it mostly just prevents the LLM from writing completely ungrammatical text. We want the LLM to write unconventional things sometimes, but only when it’s worth it, and to get this behavior we’d have to tune storage costs accordingly.
If the LLM is using the GPG tool to decrypt text and reading its output, we could also just read that output. Maybe there’s some way to make something like this work though. “Redact all illegible text” is a pretty general countermeasure.
- pjohn 30 Mar 2026 8:59 UTC
  1 point
  0
  Parent
  Thanks for the explanation/links!
  I think understand (broadly!) the concept of surprisal—but here I’m talking about compression of text in a system where storage costs are based on surprisal. If the reference LLM would always generate the same text for the same prompt couldn’t we call this hypothetical text a reference text? Then, if I can generate the reference LLM’s outputs myself, for many text strings where what I want to store is close to what the reference LLM would generate, I probably only want to store the prompt plus the difference between the reference LLM’s output and the text I want to store? In other words, I’m only storing the high-surprisal tokens (and their indexes in the string) and moving as much of the storage burden as possible out of my “paid for” storage allocation and into the “free” weights of the reference LLM?
  (If we’re just basing storage costs on Shannon complexity, though—isn’t that what conventional compression is for?)
  For detectinggpg use, I thought humans could only monitor/interrogate agents at the checkpoints! Oh dear, I seem to have wildly misunderstood Terrarium—sorry! I agree if humans can trivially monitor agents at any time then yes, they would need to use some tool that decrypts to neuralese, and this would be a brittle system the humans could probably defeat.
  - Caleb Biddulph 30 Mar 2026 17:55 UTC
    3 points
    0
    Parent
    I see, I misunderstood what you meant by the reference text!
    I think you pay some additional surprisal tax due to listing the surprising tokens in a disconnected manner, but something like this could probably be worth it. o3 wrote its CoT in a shorthand without words like “the” (which was mostly understandable but annoying to read) and apparently this was worth it despite the KL penalty. It might be sort of a one-time cost where the model is surprised “oh we’re using caveman speak” and then gets used to it.
    Another form of compression would be describing the code you want to run at a high level, rather than writing the exact code. This is actually working as intended, because it helps a monitor understand which details are important/nonobvious to the LLM.
    Yeah, humans could hypothetically read the logs of the actions taken by the agents’ processes, like Safespur and Gulliver do. (No comment on whether the humans actually do this in the story.) There’s no technical limitation preventing them from doing this. Reading neuralese thoughts is a different story.