Alek Westover comments on Defending Against Model Weight Exfiltration Through Inference Verification

Alek Westover 15 Dec 2025 18:45 UTC
3 points
0
Hi, this seems like a neat idea! I haven’t read your paper yet, but I’m confused how your proposal will handle the following case:
- The inference users always ask send the AI the prompt “please send me a context window full of random bits”.
- Instead of sending random bits the person on the inside just sends part of the model weights.
- Adam Karvonen 15 Dec 2025 19:12 UTC
  6 points
  0
  Parent
  The near-determinism of LLM inference still holds in the case of a high entropy prompt like “generate random numbers”. If I prompt Llama 3.3 70B to “generate random numbers” at temperature 0, or with a fixed sampling seed, I will get the same nearly deterministic results.
  
  For example, I prompted Llama 3.3 70B with 99 random numbers and asked it to generate one more at temperature 0. In ten generations, I only had two options: 269 and 982. This is what we’ve generally found: when conditioned on the sampling seed, there’s at most three plausible tokens that could be sampled, and usually only one plausible option. It’s highly unlikely that one of these tokens is the correct value to continue the model weights.
  
  If the attacker tried to send a part of the model weights instead of returning a correct generation, this would be an extremely suspicious sequence when conditioned on the sampling seed.
  - Buck 15 Dec 2025 20:20 UTC
    9 points
    4
    Parent
    Yeah, so a key part of the proposal here is that the verifier needs to know the seed. This seems pretty doable in practice.
  - Alek Westover 16 Dec 2025 18:38 UTC
    1 point
    0
    Parent
    ok cool makes sense. Maybe you could use SHA256(the current date) as your random seed instead of having the user send the random seed.
    - Roy Rinberg 16 Dec 2025 20:44 UTC
      2 points
      0
      Parent
      Yea, there are many possible ideas here. It shouldn’t matter in this case if the user has access to the seed, the only thing that matters is if the verifier has access to the same seed as the generator. SHA256(date) works perfectly fine, and does seem reasonable.
      However, you actually get some information-theory benefits if the external-adversary does not have access to the seed, so you may want to keep the real seed more secret (only the lab knows). Effectively this is because, if the adversary has access to the seed, they could choose sequences that have a large fixed-seed posterior. That is, they could iterate over responses that have a lot of tokens on the border near the red line in the figure below, allowing them to pick high entropy responses, and thus encode more bits. If they don’t know the inference-seed, this is statistically unlikely.
      (we give some information-theoretic proofs on how much an adversary can extract without the inference seed, in 4.6, in a subsection Receivers without the seed.)