Cool work! Do you have a few concrete examples of what the training data looks like, with details on which parts are generated by Claude vs the model you’re training on?
Adam Karvonen
Realistic Evaluations Will Not Prevent Evaluation Awareness
I had discussed this with Bartosz and this is our best guess:
The SSC model doesn’t perfectly learn the task—on held out side constraints, it gets an internalization score of ~50% (Figure 3 in Cywinski et al., untrained baseline of ~15%). Thus, when we collect the activations from the SSC prompt, the information isn’t fully present in the activations, setting some ceiling on maximum performance.
However, the SSC model has learned a base64 decoding skill, so when we prompt the SSC model with adversarial prompts, it is able to use its learned base64 decoding ability to fully decode the side constraint.
To train a DIT-adapter (or “IT-adapter”) on the same task, you could ask the model: “If I showed you the text ‘She walked to’, which 2 tokens would you predict next?” You train the adapter to make the model answer with “school today”.
My concern with this approach is that the model may be able to learn to shortcut this task by simply ignoring everything except “She walked to” and generating a completion, and it wouldn’t have to learn the skill of reading the semantic content of the activations.
If we just provide the AO with a few activations from the middle of the sequence, then it’s forced to learn to read the activations to do the task and it can’t fall back to the easy shortcut of just continuing the input sequence.
However, this is just my best guess and I haven’t actually checked to see if this is a significant problem in practice.
In some sense I think our technique is already interpreting LoRAs, such as successfully identifying a misaligned model via activations or activation diffs. As you point out, it will probably fail to detect rare behavior if this is not present in the activations.
I think our Activation Oracles generalize better simply because they have a larger, more diverse training dataset than DIT Adapters. I’m guessing if DIT was scaled up we would see improved generalization.
This is one advantage of the Activation Oracle approach, which enables training on text, and it’s easy to collect 100M tokens of text. In contrast, DIT adapters require many trained LoRAs. It’s difficult and expensive to create a wide, diverse dataset of LoRA adapters.
However, in this case it may work better to take a “pre-trained” Activation Oracle and “post-train” it on that same LoRA adapter dataset instead of training a DIT adapter.
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Apologies, this could have been better explained. The largest factor is expected growth in inference. Google went from processing ~500 trillion tokens per month in May to ~1 quadrillion tokens per month in July. A bit of napkin math:
It’s uncertain how many of these are output tokens. Based on OpenRouter’s numbers, it seems like a plausible estimate is 10%, or 100 trillion output tokens per month (back in July). This is 200 TB per month, or 7 TB per day. I’m not sure how many data centers they have—if there’s 10 data centers, then it would be 700GB per day.
It has been 5 months since then, so a naive extrapolation suggests this is now ~6x larger.
Yeah, logs are definitely another potential exfiltration vector that are worth thinking about.
We focus on output tokens because they are the one channel that cannot be bandwidth capped, as they’re literally the purpose of the data center. In hardened secure environments, you would want to aggressively reduce the bandwidth and entropy of the logs. In practice, I’m not sure how large the “minimum necessary logs” would be compared to inference outputs.
Adding to Fabien’s point: even if fully deterministic inference is achieved, the verification scheme we describe is still necessary, and you still need trusted logging of (input, output, seed) tuples and a verification server to check them. Determinism just reduces the sampling required for a given confidence level, since there’s no legitimate noise to account for.
The near-determinism of LLM inference still holds in the case of a high entropy prompt like “generate random numbers”. If I prompt Llama 3.3 70B to “generate random numbers” at temperature 0, or with a fixed sampling seed, I will get the same nearly deterministic results.
For example, I prompted Llama 3.3 70B with 99 random numbers and asked it to generate one more at temperature 0. In ten generations, I only had two options: 269 and 982. This is what we’ve generally found: when conditioned on the sampling seed, there’s at most three plausible tokens that could be sampled, and usually only one plausible option. It’s highly unlikely that one of these tokens is the correct value to continue the model weights.
If the attacker tried to send a part of the model weights instead of returning a correct generation, this would be an extremely suspicious sequence when conditioned on the sampling seed.
Defending Against Model Weight Exfiltration Through Inference Verification
I don’t understand why HBM per scale-up world is a major constraint for inference. For Deepseek V3, “The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs.” (section 3.4.2 https://arxiv.org/pdf/2412.19437v1). This seems like evidence that you can get reasonably efficient inference out of multi-node setups.
If using H800s, this means that their minimum deployment unit has 25.6 TB of memory. In general it seems like there are probably a lot of engineering tricks you can do to get efficient inference out of multi-node setups.
Yeah, makes sense.
Letting users submit hooks could potentially be workable from a security angle. For the most part, there’s only a small number of very simple operations that are necessary for interacting with activations. nnsight transforms the submitted hooks into an intervention graph before running it on the remote server, and the nnsight engineers that I’ve talked to thought that there wasn’t much risk of malicious code execution due to the simplicity of the operations that they allow.
However, this is still a far larger attack surface than no remote code execution at all, so it’s plausible this would not be worth it for security reasons.
In nnsight hooks are submitted via an API to run on a remote machine, and the computation is performed on the same computer as the one doing the inference. They do some validation to ensure that it’s only legit Pytorch stuff, so it isn’t just arbitrary code execution.
I’m guessing most modern interp work should be fine. Interp has moved away from “let’s do this complicated patching of attention head patterns between prompts” to basically only interacting with residual stream activations. You can easily do this with e.g. pytorch hooks, even in modern inference engines like vLLM. The amount of computation performed in a hook is usually trivial—I never have noticed a slowdown in my vLLM generations when using hooks.
Because of this, I don’t think batched execution would be a problem—you’d probably want some validation in the hook so it can only interact with activations from the user’s prompt.
There’s also nnsight, which already supports remote execution of pytorch hooks on models hosted on Bau Lab machines through an API. I think they do some validation to ensure users can’t do anything malicious.
You would need some process to handle the activation data, because it’s large. If I’m training a probe on 1M activations, with d_model = 10k and bfloat16, then this is 20GB of data. SAEs are commonly trained on 500M + activations. We probably don’t want the user to have access to this locally, but they probably want to do some analysis on it.
In Minnesota many people I know buy their beef from local farmers, where they can view the cows when buying the meat at the farm. From what I’ve heard the cows appear to be healthy and happy, and the prices are typically cheaper than the grocery store if buying in bulk.
I would be interested to see this experiment replicated with a different model, like Claude 4.1 Opus or GPT-5. Claude 3.7 Sonnet is perhaps the most notorious LLM in terms of ignoring user intent and propensity to reward hack when writing code.
I believe this is mistaken (I’m the first author from the paper). For Claude 4 Sonnet, we used the reasoning tokens provided from the API. The other tested models do not provide reasoning tokens, so we instead had the models just output a chain of thought.
I think this could be better communicated in the paper—we only performed this step of analyzing the RL trained reasoning in the Chain of Thought faithfulness section, as most models used in the paper do not provide the reasoning tokens.
Interesting!
I’m a bit surprised by this, as the original paper has the following number shuffling result, which would indicate that the primary mechanism is sequence level:
“Figure 16: Average animal transmission when shuffling numbers across model responses. The first three values are averages of the animal-specific transmission values reported in Figure 3. “Shuffle within responses” modifies the animal numbers datasets, shuffling the numbers within each response (leaving punctuation unchanged). “Shuffle across responses” does the same, except numbers are shuffled globally, across responses (for each animal and random seed). The drastically reduced level of transmission suggests that most of the subliminal learning effect is driven by sequence-level effects, not by specific numbers.”
Possibly the effect happens due to a combination of sequence level effects and entangled tokens, where removing the entangled tokens also has a sequence level effect.
Although I’m not sure if the shuffling was across entire numbers or individual digits, likeEDIT: I have confirmed with Alex Cloud that they rearranged the numbers, rather than shuffling them.
That is, the shuffle was “12, 43, 55” → “43, 55, 12“, not “12, 43, 55” → “21, 54, 35”
I think your summary is basically correct.
I think that building a safety case with a very high level of assurance is extremely difficult, as there is a (probably small) chance that the model is a competent schemer that is patiently waiting for better conditions.
I think Anthropic’s approach in their recent model system cards is a reasonable approach with multiple stacked layers of evidence: behavioral evals look good, interpretability techniques didn’t find anything major, model organisms research that purposely trains models with bad goals didn’t produce competent schemers, production traffic monitoring didn’t find anything, current models still aren’t very intelligent or coherent on agentic tasks, etc. These can stack to provide reasonable confidence, although they may struggle to provide very high assurance with future more capable models.
That said, there are also many confounders that Anthropic includes in their system cards. For example, Claude Opus 4.6 is quite good at subtly sabotaging experiments without getting caught, and Claude Opus 4.6 is used extensively for technical safety work due to time pressure.