Adam Karvonen

Karma: 1,196

Adam Karvonen 27 Feb 2026 5:04 UTC
2 points
0
in reply to: micahcarroll’s comment on: Realistic Evaluations Will Not Prevent Evaluation Awareness
I agree that “eval awareness” can be a unclear concept and that reasoning about reward process and if an interaction is an evaluation is a very useful and natural strategy for a model to learn during RL. I generally think of it as meaning “the model thinks it may be in an evaluation”, even if it does not verbalize this information. It also doesn’t even have to affect the model’s behavior. This belief that “this may be an evaluation” is enough to muddy the results of safety / alignment related evaluations.

One potential distinction between eval awareness and reward hacking could be that the most concerning version of eval awareness could involve exploiting the gap between training / evaluation and deployment, rather than exploiting the reward signal. For example, a competent schemer could always behave correctly in training and only act misaligned in deployment. However, this regime is speculative and we don’t have evidence of this sort of deliberate scheming today.

Adam Karvonen 24 Feb 2026 19:29 UTC
4 points
0
in reply to: benjamin ar’s comment on: Realistic Evaluations Will Not Prevent Evaluation Awareness
I think your summary is basically correct.
After writing this, it’s hard for me to imagine a world in which it’s possible to build a complete safety case for a model before deployment.
I think that building a safety case with a very high level of assurance is extremely difficult, as there is a (probably small) chance that the model is a competent schemer that is patiently waiting for better conditions.
I think Anthropic’s approach in their recent model system cards is a reasonable approach with multiple stacked layers of evidence: behavioral evals look good, interpretability techniques didn’t find anything major, model organisms research that purposely trains models with bad goals didn’t produce competent schemers, production traffic monitoring didn’t find anything, current models still aren’t very intelligent or coherent on agentic tasks, etc. These can stack to provide reasonable confidence, although they may struggle to provide very high assurance with future more capable models.
That said, there are also many confounders that Anthropic includes in their system cards. For example, Claude Opus 4.6 is quite good at subtly sabotaging experiments without getting caught, and Claude Opus 4.6 is used extensively for technical safety work due to time pressure.

Adam Karvonen 12 Jan 2026 17:15 UTC
4 points
0
on: Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
Cool work! Do you have a few concrete examples of what the training data looks like, with details on which parts are generated by Claude vs the model you’re training on?

Adam Karvonen 20 Dec 2025 18:50 UTC
3 points
0
in reply to: Oliver Daniels’s comment on: Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
I had discussed this with Bartosz and this is our best guess:

The SSC model doesn’t perfectly learn the task—on held out side constraints, it gets an internalization score of ~50% (Figure 3 in Cywinski et al., untrained baseline of ~15%). Thus, when we collect the activations from the SSC prompt, the information isn’t fully present in the activations, setting some ceiling on maximum performance.

However, the SSC model has learned a base64 decoding skill, so when we prompt the SSC model with adversarial prompts, it is able to use its learned base64 decoding ability to fully decode the side constraint.

Adam Karvonen 20 Dec 2025 1:46 UTC
8 points
4
in reply to: Caleb Biddulph’s comment on: Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
To train a DIT-adapter (or “IT-adapter”) on the same task, you could ask the model: “If I showed you the text ‘She walked to’, which 2 tokens would you predict next?” You train the adapter to make the model answer with “school today”.
My concern with this approach is that the model may be able to learn to shortcut this task by simply ignoring everything except “She walked to” and generating a completion, and it wouldn’t have to learn the skill of reading the semantic content of the activations.
If we just provide the AO with a few activations from the middle of the sequence, then it’s forced to learn to read the activations to do the task and it can’t fall back to the easy shortcut of just continuing the input sequence.
However, this is just my best guess and I haven’t actually checked to see if this is a significant problem in practice.

Adam Karvonen 19 Dec 2025 19:15 UTC
11 points
0
in reply to: Caleb Biddulph’s comment on: Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
In some sense I think our technique is already interpreting LoRAs, such as successfully identifying a misaligned model via activations or activation diffs. As you point out, it will probably fail to detect rare behavior if this is not present in the activations.

I think our Activation Oracles generalize better simply because they have a larger, more diverse training dataset than DIT Adapters. I’m guessing if DIT was scaled up we would see improved generalization.

This is one advantage of the Activation Oracle approach, which enables training on text, and it’s easy to collect 100M tokens of text. In contrast, DIT adapters require many trained LoRAs. It’s difficult and expensive to create a wide, diverse dataset of LoRA adapters.

However, in this case it may work better to take a “pre-trained” Activation Oracle and “post-train” it on that same LoRA adapter dataset instead of training a DIT adapter.

Adam Karvonen 18 Dec 2025 18:39 UTC
6 points
0
in reply to: Erich_Grunewald’s comment on: Defending Against Model Weight Exfiltration Through Inference Verification
Apologies, this could have been better explained. The largest factor is expected growth in inference. Google went from processing ~500 trillion tokens per month in May to ~1 quadrillion tokens per month in July. A bit of napkin math:
It’s uncertain how many of these are output tokens. Based on OpenRouter’s numbers, it seems like a plausible estimate is 10%, or 100 trillion output tokens per month (back in July). This is 200 TB per month, or 7 TB per day. I’m not sure how many data centers they have—if there’s 10 data centers, then it would be 700GB per day.
It has been 5 months since then, so a naive extrapolation suggests this is now ~6x larger.

Adam Karvonen 16 Dec 2025 19:59 UTC
3 points
0
in reply to: faul_sname’s comment on: Defending Against Model Weight Exfiltration Through Inference Verification
Yeah, logs are definitely another potential exfiltration vector that are worth thinking about.

We focus on output tokens because they are the one channel that cannot be bandwidth capped, as they’re literally the purpose of the data center. In hardened secure environments, you would want to aggressively reduce the bandwidth and entropy of the logs. In practice, I’m not sure how large the “minimum necessary logs” would be compared to inference outputs.

Adam Karvonen 15 Dec 2025 23:57 UTC
2 points
0
in reply to: Fabien Roger’s comment on: Defending Against Model Weight Exfiltration Through Inference Verification
Adding to Fabien’s point: even if fully deterministic inference is achieved, the verification scheme we describe is still necessary, and you still need trusted logging of (input, output, seed) tuples and a verification server to check them. Determinism just reduces the sampling required for a given confidence level, since there’s no legitimate noise to account for.

Adam Karvonen 15 Dec 2025 19:12 UTC
6 points
0
in reply to: Alek Westover’s comment on: Defending Against Model Weight Exfiltration Through Inference Verification
The near-determinism of LLM inference still holds in the case of a high entropy prompt like “generate random numbers”. If I prompt Llama 3.3 70B to “generate random numbers” at temperature 0, or with a fixed sampling seed, I will get the same nearly deterministic results.

For example, I prompted Llama 3.3 70B with 99 random numbers and asked it to generate one more at temperature 0. In ten generations, I only had two options: 269 and 982. This is what we’ve generally found: when conditioned on the sampling seed, there’s at most three plausible tokens that could be sampled, and usually only one plausible option. It’s highly unlikely that one of these tokens is the correct value to continue the model weights.

If the attacker tried to send a part of the model weights instead of returning a correct generation, this would be an extremely suspicious sequence when conditioned on the sampling seed.

Adam Karvonen 10 Dec 2025 18:58 UTC
2 points
0
in reply to: Vladimir_Nesov’s comment on: Vladimir_Nesov’s Shortform
I don’t understand why HBM per scale-up world is a major constraint for inference. For Deepseek V3, “The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs.” (section 3.4.2 https://arxiv.org/pdf/2412.19437v1). This seems like evidence that you can get reasonably efficient inference out of multi-node setups.

If using H800s, this means that their minimum deployment unit has 25.6 TB of memory. In general it seems like there are probably a lot of engineering tricks you can do to get efficient inference out of multi-node setups.

Adam Karvonen 9 Oct 2025 23:26 UTC
1 point
0
in reply to: Buck’s comment on: The Thinking Machines Tinker API is good news for AI control and security
Yeah, makes sense.

Letting users submit hooks could potentially be workable from a security angle. For the most part, there’s only a small number of very simple operations that are necessary for interacting with activations. nnsight transforms the submitted hooks into an intervention graph before running it on the remote server, and the nnsight engineers that I’ve talked to thought that there wasn’t much risk of malicious code execution due to the simplicity of the operations that they allow.

However, this is still a far larger attack surface than no remote code execution at all, so it’s plausible this would not be worth it for security reasons.

Adam Karvonen 9 Oct 2025 19:31 UTC
1 point
0
AF
in reply to: Buck’s comment on: The Thinking Machines Tinker API is good news for AI control and security
In nnsight hooks are submitted via an API to run on a remote machine, and the computation is performed on the same computer as the one doing the inference. They do some validation to ensure that it’s only legit Pytorch stuff, so it isn’t just arbitrary code execution.

Adam Karvonen 9 Oct 2025 19:01 UTC
1 point
2
AF
in reply to: Buck’s comment on: The Thinking Machines Tinker API is good news for AI control and security
I’m guessing most modern interp work should be fine. Interp has moved away from “let’s do this complicated patching of attention head patterns between prompts” to basically only interacting with residual stream activations. You can easily do this with e.g. pytorch hooks, even in modern inference engines like vLLM. The amount of computation performed in a hook is usually trivial—I never have noticed a slowdown in my vLLM generations when using hooks.
Because of this, I don’t think batched execution would be a problem—you’d probably want some validation in the hook so it can only interact with activations from the user’s prompt.
There’s also nnsight, which already supports remote execution of pytorch hooks on models hosted on Bau Lab machines through an API. I think they do some validation to ensure users can’t do anything malicious.
You would need some process to handle the activation data, because it’s large. If I’m training a probe on 1M activations, with d_model = 10k and bfloat16, then this is 20GB of data. SAEs are commonly trained on 500M + activations. We probably don’t want the user to have access to this locally, but they probably want to do some analysis on it.

Adam Karvonen 26 Sep 2025 3:56 UTC
24 points
11
on: Why you should eat meat—even if you hate factory farming
In Minnesota many people I know buy their beef from local farmers, where they can view the cows when buying the meat at the farm. From what I’ve heard the cows appear to be healthy and happy, and the prices are typically cheaper than the grocery store if buying in bulk.

Adam Karvonen 14 Aug 2025 18:49 UTC
9 points
7
on: METR Research Update: Algorithmic vs. Holistic Evaluation
I would be interested to see this experiment replicated with a different model, like Claude 4.1 Opus or GPT-5. Claude 3.7 Sonnet is perhaps the most notorious LLM in terms of ignoring user intent and propensity to reward hack when writing code.

Adam Karvonen 10 Aug 2025 6:29 UTC
3 points
0
in reply to: Igor Ivanov’s comment on: Claude, GPT, and Gemini All Struggle to Evade Monitors
I believe this is mistaken (I’m the first author from the paper). For Claude 4 Sonnet, we used the reasoning tokens provided from the API. The other tested models do not provide reasoning tokens, so we instead had the models just output a chain of thought.

I think this could be better communicated in the paper—we only performed this step of analyzing the RL trained reasoning in the Chain of Thought faithfulness section, as most models used in the paper do not provide the reasoning tokens.

Adam Karvonen 6 Aug 2025 23:23 UTC
8 points
5
on: It’s Owl in the Numbers: Token Entanglement in Subliminal Learning
Interesting!
I’m a bit surprised by this, as the original paper has the following number shuffling result, which would indicate that the primary mechanism is sequence level:
“Figure 16: Average animal transmission when shuffling numbers across model responses. The first three values are averages of the animal-specific transmission values reported in Figure 3. “Shuffle within responses” modifies the animal numbers datasets, shuffling the numbers within each response (leaving punctuation unchanged). “Shuffle across responses” does the same, except numbers are shuffled globally, across responses (for each animal and random seed). The drastically reduced level of transmission suggests that most of the subliminal learning effect is driven by sequence-level effects, not by specific numbers.”
Possibly the effect happens due to a combination of sequence level effects and entangled tokens, where removing the entangled tokens also has a sequence level effect.
~~Although I’m not sure if the shuffling was across entire numbers or individual digits, like~~
EDIT: I have confirmed with Alex Cloud that they rearranged the numbers, rather than shuffling them.
That is, the shuffle was “12, 43, 55” → “43, 55, 12“, not “12, 43, 55” → “21, 54, 35”

Adam Karvonen 3 Jul 2025 22:42 UTC
1 point
0
in reply to: StefanHex’s comment on: Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
No, I didn’t test a fine-tuning baseline, but it would be a good test to run.

I have a few thoughts:
- It may not work to fine-tune on the same datasets we collected the directions from. We collected the directions from a synthetically generated discrimination dataset from Anthropic. On this simple dataset, all models are already unbiased, so the fine-tuning wouldn’t be changing the behavior of the models at all. So, you may need a more complex fine-tuning dataset where the models already exhibit bias.
- Given that all models are unbiased on these existing evals, I’m guessing this didn’t happen by chance, and the labs have already put in effort to address bias. I would guess a decent amount of post training has already went into reducing bias.
- The interpretability intervention generalized almost perfectly to every scenario we tested (bias rates typically under 1%), so you may need to push to further OOD scenarios to notice a difference.

Adam Karvonen 3 Jul 2025 22:28 UTC
5 points
0
in reply to: Adele Lopez’s comment on: Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
No, but it would be interesting to test this.