Roy Rinberg
I’d be curious to see what comes from your experiments.
I also think that there’s a different in the control setting in what the trade-offs are between having the small-model vs large-model “initiate”. In the compression setting the small model asks a lot of questions, and this lets you not have to send the questions “over the wire”, with control, you’re likely fine with bi-directional flow: the small model asks for help me when it needs, or the large model interjecs when it wants.
I haven’t thought too carefully on that.Interestingly, even if every question from T is a yes-or-no question, the above method would let U answer them in fewer bits on average! For example, if T already thinks something is 90% likely, and U confirms that the answer is indeed “Yes”, that provides T with only -log2(0.90) = 0.152 bits of information.
Yeah—definitely! You can use arithmetic coding to encode those bits, according to the small model’s priors. And if it turns out to be less helpful than sending the bits directly, the large model woudl be able to assess that. So you never need to send more than N bits for N questions, but could be able to do much better. I like this idea a good deal, we do mention this in the paper, but it’s not a major point.
I think there are a lot of specifics to explore in the implementation of the interaction, and our work explores relatively constrained space (we did try a few other LLM-SLM interactions, and mostly got less impressive results).
Very helpful data points and BOTEC. I largely agree with your estimates. Thank you
Yeah—I agree.
I claim it’s extremely hard for U to make T insert a backdoor just by answering its yes-or-no questions.
Seems very likely. I think you don’t need to be as restrictive though. I think forcing U and T to have to communicate through any kind of information-bottleneck (not just binary QA).
For example, the dual-LLM setting is used in CAMEL to defend against prompt-injection. There they have a 2nd LLM that monitors the actions of the 1st LLM, the 2nd LLM only sees meta-data that summarizes the interaction the 1st LLM is trying to do (e.g. “agent X is requesting to send data type Y to location Z”. This restriction of the types of things the 2nd LLM sees makes it difficult to prompt inject the 2nd LLM.
I think safety through information-bottlenecks is a good direction. While I don’t personally know people working in this direction, I think this is something people are exploring. I’d be curious to hear what you or others find.
Wouldn’t someone notice if 100% of requests to OpenAI failed for 24 hours straight though?
Yes, definitely. A few responses:
Egress-limiting sets up a information-theoretic backstop, there is simply no way to steal more than X GB in a day across a channel that allows for X GB per day. The more time this backstop buys us, the more leeway we have with our monitoring + incident response systems.
I don’t think it’s likely that 100% of the traffic would be corrupted, but I don’t know what is the most likely world, and it’s easy to linearly scale based on what security people think is the most likely outcome. (e.g. 10% of the datacenter → 10x increase timelines).
There’s a cognitive load benefit for security engineering—it’s extremely convenient to think about egress-limiting because it’s information-theoretic bound. You should be able to easily confirm that the egress-limiting is implemented properly because it should be <200 lines of code (or is literally just the width of a wire). And “complexity is the worst enemy of security” (post by Bruce Schneier).
To give a somewhat long-winded tangent; I find it useful to think about exfiltration in three settings:
Setting 1: Fast exfiltration. An extremely rapid exfiltration occurs before human intervention is possible. For example, an adversary compromises 10,000 servers and leaks 1⁄10,000th of the weights from each of them.
Setting 2: Intermediate exfiltration. A leak occurs over days or weeks—that is, a human-operable amount of time.
Setting 3: Slow exfiltration. There is a length of time (e.g., 2 years) where even if the adversary could steal the model weights in that amount of time, it wouldn’t matter by the end, because new models would be meaningfully different.
In this framing, egress-limiting is tasked with moving us from setting 1 to setting 2. And then we can have monitoring + response schemes that move us from setting 2 to setting 3. (Relatedly, you may find it interesting to take a look at my other work on inference verification to detect steganographic exfiltration attempts.)
It seems like the amount of egress an attacker could use without getting caught is a function of buffers
Again, yes I agree. 2 responses:
Yes, we would certainly need to include buffers to these bandwidths, the calculations in the post are a back-of-the-envelope-calculation, and I imagine the amount of buffer required is still in the same order of magnitude. (Plus I don’t work at a lab, nor would I be allowed to share such numbers.)
More algorithmically, there are rate-limiting methods that permit bursts above a baseline rate while enforcing a hard cap on the total across a longer period (e.g., 1 day). For one, there’s a method called Token Bucket Rate Limiting.
Last, most speculative, point: in the most extreme world where a powerful adversary (e.g. nation state) physically takes over a datacenter. Egress-limiting (if connected to other defenses, like tamper-proof enclosures that destroy the GPU when opened) gives us time to reacquire the GPUs/model-weights, because even with
sudoaccess to the GPU, they still can’t exfiltrate weights in less than X-amount-of-time. Though, this is probably wrong in some way, because either nation-states may be able to remove the enclosures and side channel into the GPU some other means, or once the weights are stolen it’s unlikely we steal them back.
Text Compression Can Help Secure Model Weights
I mean—I hear what you’re saying, and I don’t agree. It’s not that important of a post, so it’s fine, but I think it doesn’t really hook people.
I don’t think it communicates what the post is about “We Added Typos to a Benchmark. Then Haiku Saturated It.”
I don’t think it’s about saturation at all—as evidence of this, I ran “ctrl-f” for
saturand there are 0 mentions to it in the body.less critically—“we added typos to a benchmark”—this is more of the action than the reason, which kind of just leaves someone asking “why is this relevant”
We
I will say, now that I’ve fully read this, I don’t fully get why this is the title that you prefer.
It seems like
we should say somethign aboutwhat are we measuring or 2 our conclusion
this title leaves me unsure what this post will be about, except for the pace of benchmark saturation, whcih isn’t even the point of the blog
Scores are lower bounds
this is true in this context, but it’s worth saying that these models seem to be trained for these benchmakrs, so it’s not entirely true that they are lower-bounds, since it seems that models are trained to do well on benchmarks.
The size of the “lower bound”ness is hard to comment on, but if you can provide some input on the amount of improvmenet you think there is from tuning your harness, that is a meaningful conclusion. That could be what we build the blog around,
When Anthropic dropped Opus 4.6, we asked it to figure it out. From our eval logs, Opus 4.6 observed that for a single BigCodeBench prompt, Haiku and Opus often generated multiple code blocks. Furthermore, as typo rate increases, Haiku shifts its behavior for ~20% of its responses from generating multiple code blocks to generating just a single code block.
I also have to say, i don’t know what you mean by “code block”—does this mean response?
When Anthropic dropped Opus 4.6, we asked it to figure it out.
I think this is a bit informal.
Also it’s odd to say this because it sounds like you’re saying “this isn’t what we think” or “if it’s wrong, judge it, not us” and you have to claim responsbility for any findings that AI generates.
I would not say this
The Anomaly is Benchmark-Specific
We then tested whether this “impossible typo effect” holds for Haiku and Opus on other benchmarks. We chose BBH and GPQA since Haiku struggles reasonably without introducing typos. Here, we no longer observed the impossible typo effect, and Haiku’s capabilities decreased with typo rates.
this shouild be 1 figure, not 2
and i think this couild be combined with the previous section as well
We chose BBH and GPQA
should link to the benchmarks, im actually not immediatley clear what BBH is (but i get it upon further inspection)
“impossible typo effect”
I don’t think this is a term you’ve introduced before
The Anomaly is Model-Specific
We then tested if other small models have this “impossible typo effect”. We found that, unlike Haiku, the capabilities of GPT-4.1-mini slightly decreased as typos increased.
this plot should have multiple models on the same plot, otherwise it’s somewhat confusing section header.
the hypothesis is refuted.
I really am fairly opposed to sentences like this
research
observed result in humans
(or cite something that isn’t a bbc article)
thinks
think
Squint
it isn’t clear what “squint” means, and I don’t think maybe “think” is better
We were curious if LLMs are robust under various typo rates.
Hmm, couple thoughts, i don’t think my write up is perfect, but how about something more like this:
> We were curious if LLMs produce the same response when there are typos in the prompt. To test this, we injected typos into the prompts from BigCodeBench and ran different Claude models. We found that while the accuracy for Opus gradually declined with typo rate, for Haiku, its accuracy actually increased as typo rate increased. This blog investigates this counter-intuitve phenomena.(points I care about: 1. “robust under various typo rates” doesn’t sound like a linear increase in typos, which is what we actually do.
2. “We double-checked our code and asked Claude to do so too, but we couldn’t find any bugs”—this should be assumed for all your work.
3. “The mystery begins.”—this format does sound quite AI generated
)
One of the follow up works that are needed to make something like this valuable is more analysis on what makes 2 summaries “equivalent”.
For example, this work Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , does:
LLM 1 generates a response R and a summary S
LLM 2 comes up with a series of (binary) questions about that response (Q)
LLM 3 is given response R, and then asked to answer questions Q (call this A)
LLM 3 is given summary S, and then asked to answer questions Q (call this A’)
Compare A and A’, and if they are the same, the summary S contains all the meaningful info from response R.
I think this is a good direction for more exploration
(note, LLM1/2/3 can all be the same LLM, just with fresh contexts)