# Using an LLM perplexity filter to detect weight exfiltration

A recent area of focus has been securing AI model weights. If the weights are located in a data center and an adversary wants to obtain model weights, the weights have to leave physically (such as a hard drive going out the front door) or through the data center’s internet connection. If the facility has perfect physical security, then the weights have to leave through the internet connection. Recently, there has been discussion on how to make model weight exfiltration more difficult, such as Ryan Greenblatt’s proposal for upload limits.

A key factor enabling this is that the critical data we want to protect (model weights) are very large files. Current models can have trillions of parameters, which translates to terabytes of data. Ryan calculated that the total amount of outgoing data from a data center during a model’s lifetime could be within a few orders of magnitude of the size of the model itself. This post proposes an additional layer of security that could significantly enhance the effectiveness of upload limits: perplexity filtering using a small language model.

Perplexity, in the context of language models, is a measure of how surprised a model is by a given sequence of text. More formally, it’s the exponential of the cross-entropy of the text under the model. In simpler terms, if a piece of text has low perplexity, it means the model finds it very predictable or natural; high perplexity indicates that the text is unexpected or unnatural to the model. By using perplexity as a filter, we can potentially distinguish between normal model outputs (which should have relatively low perplexity) and attempts to exfiltrate model weights encoded as text (which would likely have high perplexity).

The core idea is to use a small language model as a perplexity filter on all outgoing data from the inference servers. Here’s how it would work:

All outgoing data is treated as language tokens. A small language model (much smaller than the main model being protected) is used to calculate the perplexity of these tokens. This filtering can potentially be performed offline to avoid adding latency to the system’s responses. Data with high perplexity (i.e., data that doesn’t look like natural language to the small model) is blocked or flagged.

## Why This Could Work

Scale Advantage: The weights of large models are on the order of terabytes. Encoding this much data in a way that appears as natural language would be extremely challenging.

Compression Resistance: Unlike simple encryption or obfuscation techniques, it’s difficult to compress model weights in a way that would consistently produce low-perplexity output.

Low Overhead: The computational cost of running a small language model as a filter is negligible compared to the resources required for the main model’s inference.

Complementary to Upload Limits: This approach could significantly increase the effectiveness of upload limits, either increasing the allowed upload limit or decreasing the number of model weights that can be exfiltrated per hour.

If the amount of outgoing data is within a few orders of magnitude of the model size itself, it’s possible that a few tricks like this could make it almost impossible to exfiltrate model weights before the model is deprecated (assuming that the data center has perfect physical security). It appears that this perplexity filter would stack with Ryan’s proposed text compression scheme.

We don’t even have to filter all outgoing data. If there was 1 gigabyte of unfiltered bandwidth per day, it would take 2,000 days to transfer GPT-4′s 2 terabytes of weights out (although this could be reduced by compression schemes). This could also enable tolerating a very high false positive rate on the perplexity filter. 100 megabytes of flagged outputs per day is no problem, as that would require 20,000 days to transfer 2 terabytes of data.

Image and audio outputs could be more challenging. A potential solution is adding weak models to compress / decompress audio and images. These models could both reduce bandwidth and potentially act as a filter, similar to the perplexity filter for text.*Thanks to Davis Brown for discussion and feedback, and in particular, the suggestion to perform this step offline to reduce latency.*

Sadly, I don’t think this proposal works. More precisely, testing if data is low perplexity is strictly worse than compressing the data against a small language model and then actually sending out a small payload.

Actually, it is quite easy. You can encode whatever bits you’d like in the remaining entropy of the natural language.

If an attacker has full control over your datacenter and you are sending tokens out, then they can encode arbitrary information via controlling the sampling procedure.

At the end of the day, bits are bits.

If you can’t compress the data further, then the remaining information could be anything.

And, if you can compress the data, then you can just ensure that the total output from the datacenter is smaller.

Testing for low perplexity is just a way to demonstrate that data is compressible. But if it is compressible, why not compress it?

Ok, so compression would also work, but why do I claim compression is better?

Compression seems strictly better for a two reasons:

It allows for ensuring hard physical limits on the number of bits (which is easier to securely enforce).

It allows for compression at the level of the sampling procedure such that the bits sent out per token is just the KL between the weal model and the strong model (rather than also having a term for the entropy of the strong model). If you just do perplexity of the tokens, then you have to eat the entropy term which could be big.

The perplexity filter works best if the attacker doesn’t have complete control of the entire data center, but instead limited access to some machines. An underlying assumption is that it’s easier to secure the internet connection choke point, rather than everything going on inside the data center.

This can be stacked with compression by applying the perplexity filter before the compression step.

Assuming that the attacker has access to the perplexity filter model, it would still reduce the possible information density if they have to perform some sort of arithmetic encoding scheme that has low perplexity. I had coworkers working on exactly this problem of encoding information into natural language using arithmetic encoding and GPT-2 small, and there’s a major trade-off between the perplexity of the language generated and the density of the encoded information, although I don’t have numbers available to quantify this trade-off.

This seems like it would be helpful: the adversary can still export data, for example encoded steganographically in otherwise-low-perplexity text, but this limits the information density they can transmit, making the process less efficient for them and making constraints like upload limits tighter.

One other thing that would make this even harder is if we change the model weights regularly, in ways where combining parts exfiltrated from separate models is hard. We know for Singular Learning Theory that the optima found by Stochastic gradient descent tend to have high degrees of symmetry. Some of these (like, say, permuting all the neurons in a layer along with this weights) are discrete, obvious, and both easy to generate and fairly easy for an attacker to compensate for. Others, sich as adjusting various weights in various layers in ways that compensate for each other, are continuous symmetries, harder to generate and would be harder for an attacker to reason about. If we had an efficient way to explore these continuous symmetries (idelly one that’s easy to implement given a full set of the model weights, but hard to reconstruct from multiple partial pieces of multiple equivlent models), then we could explore this high-dimensional symmetry space of the model optimum and create multiple equivalent-but-not easy to peice together sets of models, and rotate between them over time (and/or deploy different ones to different instances) in order to make the task of exfiltrating the weights even harder.

So, to anyone who knows more about SLT than I do, a computationally efficient way to explore the continuous symmetries (directions in which both the slope and the Hessian are flat) of the optimum of a trained model could be very useful.

I agree. In particular, there’s a huge advantage to the defender due to the scale of the model weights. All the defense has to do is raise the bar high enough that an adversary can’t exfiltrate model weights during the lifetime of the model.

If the adversary gains access to the weak model, it still reduces the possible encoded information density, as I discuss here. I haven’t done the experiments, but I would guess this reduction improves significantly if the adversary doesn’t have access to the weak model.

Various ways of changing the weak model weights or retraining the weak model with new data could also increase the size of the defender’s advantage.

I do not understand how you can extract weights through just conversing with an LLM any more than you can get information on how my neurons are structured by conversing with me. Extracting training data it has seen is one thing, but presumably it has never seen its weights. If the system prompts did not tell it it was an LLM, it should not even be able to figure out that.

LLMs are already able to figure that out, so your beliefs about LLMs and situated awareness are way off.

And why wouldn’t they be able to? Can’t

youread some anonymous text andimmediatelythink ‘this is blatantly ChatGPT-generated’ without any ‘system prompt’ telling you to? (GPT-3 didn’t train on much GPT-2 text and struggled to know what a ‘GPT-3’ might be, but later LLMs sure don’t have that problem...)My last statement was totally wrong. Thanks for catching that.

In theory its probably even possible to get the approximate weights by expending insane amounts of compute, but you could use those resources much more efficiently.

The purpose of this proposal is to limit anyone from transferring model weights out of a data center. If someone wants to steal the weights and give them to China or another adversary, the model weights have to leave physically (hard drive out of the front door) or through the internet connection. If the facility has good physical security, then the weights have to leave through the internet connection.

If we also take steps to secure the internet connection, such as treating all outgoing data as language tokens and using a perplexity filter, then the model weights can be reasonably secure.

We don’t even have to filter all outgoing data. If there was 1 Gigabyte of unfiltered bandwidth per day, it would take 2,000 days to transfer GPT-4′s 2 terabytes of weights out (although this could be reduced by compression schemes).

Thanks for this comment, by the way! I added a paragraph to the beginning to make this post more clear.

CMIIW, you are looking at information content according to the LLM, but that’s not enough. It has to be

learnableinformation content to avoid the noisy TV problem. E.g. a random sequence of tokens will be unpredictable and high perplexity. But if it’s learnable, then it has potential.I had a go at a few different approaches here https://github.com/wassname/detect_bs_text