faul_sname comments on Defending Against Model Weight Exfiltration Through Inference Verification

faul_sname 16 Dec 2025 6:36 UTC
7 points
0
how would you steal model weights out of a datacenter without getting caught?
Examine which log messages logged by the inference servers have the highest entropy, figure out which fields in those log messages are unique on a per-log-message basis AND are unlikely to be examined in the ordinary course of debugging a production issue, encode a random few bytes of model weights+offset per log message. A few trillion tokens that the exfiltrator has to read is a lot of tokens^[1], but a few TB of logs is nothing—if you’re in position to get code onto inference servers you’re likely already in position to get access to the logs coming from those inference servers.
The takeaway here is that we can reduce worst-case exfiltratable information to ~0.5% (or ~200x smaller) the theoretical maximum, with a false positive rate that is ~0.01%.
Neat! I think at that point it might be cheaper to just train your own frontier model rather than trying to exfiltrate someone else’s through the output token channel.
1. ^
  If we assume we want to exfiltrate a 1TB model, even if we can get the full log2(200k) bits per token, that’s still 57 gigatokens. At $25 / megatoken that’s $1.4M of inference to extract the weights (and MoE models may be even larger than this, since you need to extract all params not just active ones).
- Adam Karvonen 16 Dec 2025 19:59 UTC
  3 points
  0
  Parent
  Yeah, logs are definitely another potential exfiltration vector that are worth thinking about.
  
  We focus on output tokens because they are the one channel that cannot be bandwidth capped, as they’re literally the purpose of the data center. In hardened secure environments, you would want to aggressively reduce the bandwidth and entropy of the logs. In practice, I’m not sure how large the “minimum necessary logs” would be compared to inference outputs.
  - faul_sname 16 Dec 2025 20:50 UTC
    4 points
    0
    Parent
    Yeah, I am confident that I’m not the first person to have that thought, logs are a classic channel for data exfiltration by insiders. I very much hope that all of the frontier labs do a better job at reducing the effectiveness of that channel than the abysmal industry standard.
    
    BTW in case it wasn’t obvious I really do think your result is cool, and I expect probably sufficient to close off the token channel as a cost-effective exfiltration channel. You don’t have to outrun the bear, you just have to outrun your hiking partner and all that (“the bear” in this context being “an adversary who wants a frontier model” and “your hiking partner” being “just training their own frontier model, possibly on the outputs of your model”)