Adam Karvonen

Karma: 1,312

Adam Karvonen 27 Apr 2026 2:38 UTC
2 points
0
in reply to: Adam Karvonen’s comment on: Can governments quickly and cheaply slow AI training?
One other thing I’ll mention: to fully exploit the valid non-determinism as a covert communication channel, the attacker needs two very strong affordances which may not be realistic:
- The attacker would need an exact copy of the model weights outside the data center, which may fail in the closed weight setting
- The ability to replay the inference process more deterministically than the verifier can replay inference during verification. However, it’s unclear why an attacker outside the datacenter would have a cleaner view of the inference process than a verifier with access to all internal activations, logits, statistics, etc from inside the datacenter. To fully exploit the channel, the attacker needs deviations to be deterministic and predictable from their perspective while still appearing nondeterministic to the verifier.
Given all of this, I believe this goal (“Output re-computation that reduces covert channel capacity below 0.01%.”) from your intro may be achievable with simple re-computation approaches. Simple DiFR (Divergence From Reference) techniques may work with non-deterministic inference, as the verifier likely has more affordances than the attacker.

If that doesn’t work, Deepseek demonstrated deterministic inference with “minimal performance overhead” in their Deepseek-V4 technical report.

Adam Karvonen 24 Apr 2026 16:36 UTC
24 points
7
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Most of those signs are Anthropic specific, and OpenAI has several signs pointing in the opposite direction. They have really been leaning into a message of “efficient frontier AI for everyone”, they reset rate limits multiple times per month as a meme, and Codex rate limits are already very generous.

Anecdotally I’ve never heard of anyone hitting a Codex rate limit on a Max plan, and from what I hear most people struggle to hit 20% of their weekly rate limit with heavy usage.

Adam Karvonen 21 Apr 2026 0:31 UTC
3 points
0
on: Can governments quickly and cheaply slow AI training?

In current prototypes, the “unsuspicious error” (aka level of non-determinism) of output re-computation is 1%; so 1% of server outputs could be generated by our undeclared RL model without raising suspicion.

I think this is incorrect. The normal non-determinism applies per token, so 1% of output tokens in a given sequence should be incorrect. If inference on a given sequence was ran with an incorrect model then a far higher percentage of tokens in that sequence would be incorrect. So just checking per sequence rather than per token should prevent this attack.

You could plausibly do attacks where you try to find the 1% most useful tokens in a given sequence. However, we can also measure how unlikely each incorrect token is, which would only allow small deviations from the original model’s behavior, which would further limit utility.

In practice, I’m guessing you can’t do any useful RL rollouts on undeclared models using the 1% legitimate non-determinism.

Also, note that the overhead from deterministic inference isn’t that bad, as Horace He’s first pass was 60% overhead, and he thought it could be further reduced. So that’s another way to eliminate this threat.

Adam Karvonen 15 Apr 2026 16:29 UTC
49 points
46
on: Current AIs seem pretty misaligned to me
A related observation: many people still don’t trust AI to provide honest critical feedback if asked directly, and instead resort to anti-sycophancy tricks like “some moron wrote this thing, can you provide a brutally honest critique?”.

Adam Karvonen 31 Mar 2026 16:40 UTC
21 points
17
in reply to: No77e’s comment on: No77e’s Shortform
There is also a significant category of “AI video passing itself off as a real video”, and many videos have people debating in the comments if it’s real or AI. This seems like it can erode trust and is generally negative.

Adam Karvonen 31 Mar 2026 4:37 UTC
3 points
1
on: Pangram (AI detection software) can be evaded
Claude Code can evade Pangram if you give it a prompt and a Pangram labs API key, so it can iterate against Pangram until it passes.

But in general I think false positives are worse than false negatives here, as it’s always possible to iterate against Pangram to create a negative.

Adam Karvonen 27 Mar 2026 14:33 UTC
11 points
5
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform
One thing that stood out to me from the post was this line: “With less verifiable tasks, the gamemaster decides how successful the AI is.” This seems like a major assumption which could change the takeaways.

I’ve been experimenting with using Opus to implement experiments with only high-level feedback. I often find once I dig into the low-level details that Opus has made some very poor decisions which would not be observable from only looking at the final outputs, and that my uplift has been significantly reduced by the time I am satisfied with the quality of the experiment.

I’m not sure if these problems will be solved by just instructing the AI to provide a proof of correctness or a summary report, although this probably depends on the specific domain. For less verifiable tasks, verification could become a huge bottleneck that scales with task complexity and significantly reduces uplift.

Adam Karvonen 10 Mar 2026 20:39 UTC
9 points
0
on: Letting Claude do Autonomous Research to Improve SAEs
Nice post! This seems like a good application for autonomous research—a good metric, tight feedback loops, and there hasn’t been a ton of research effort directed at it yet.
Karpathy had an auto-research scaffold improve validation loss on LLMs and from what I can see it mostly just tweaks hyperparameters, but LLM pretraining is a much better studied area.
The Top-K annealing was used in the Llama Scope paper as well: https://arxiv.org/pdf/2410.20526

Adam Karvonen 5 Mar 2026 17:53 UTC
5 points
0
on: Text Compression Can Help Secure Model Weights
My estimates on text traffic and model weight sizes:
Text traffic
OpenAI was doing ~8.64 trillion tokens per day in October 2025 on the API (not including ChatGPT)
Google went from processing ~500 trillion tokens per month in May to ~1 quadrillion tokens per month in July.
It’s uncertain how many of these are output tokens. Based on OpenRouter’s numbers, it seems like a plausible estimate is 10%, or 100 trillion output tokens per month (back in July). This is 200 TB per month, or 7 TB per day. I’m not sure how many data centers they have—if there’s 10 data centers, then it would be 700GB per day.
It has been 8 months since then, so a naive extrapolation suggests this is now ~16x larger.
For model weight sizes, Grok-3 and Grok-4 were 3 trillion parameters per Elon. This would be ~1.5-3TB, depending on if they use 4 or 8 bit datatypes. Grok-5 will be 6 trillion parameters, also per Elon.
Anthropic’s Feb 2026 risk report discussed securing “multi-terabyte model weights”.
So a reasonable estimate is 1-10TB of output tokens per datacenter per day, and 1-10TB model weight size.

Adam Karvonen 27 Feb 2026 5:04 UTC
2 points
0
in reply to: micahcarroll’s comment on: Realistic Evaluations Will Not Prevent Evaluation Awareness
I agree that “eval awareness” can be a unclear concept and that reasoning about reward process and if an interaction is an evaluation is a very useful and natural strategy for a model to learn during RL. I generally think of it as meaning “the model thinks it may be in an evaluation”, even if it does not verbalize this information. It also doesn’t even have to affect the model’s behavior. This belief that “this may be an evaluation” is enough to muddy the results of safety / alignment related evaluations.

One potential distinction between eval awareness and reward hacking could be that the most concerning version of eval awareness could involve exploiting the gap between training / evaluation and deployment, rather than exploiting the reward signal. For example, a competent schemer could always behave correctly in training and only act misaligned in deployment. However, this regime is speculative and we don’t have evidence of this sort of deliberate scheming today.

Adam Karvonen 24 Feb 2026 19:29 UTC
4 points
0
in reply to: benjamin ar’s comment on: Realistic Evaluations Will Not Prevent Evaluation Awareness
I think your summary is basically correct.
After writing this, it’s hard for me to imagine a world in which it’s possible to build a complete safety case for a model before deployment.
I think that building a safety case with a very high level of assurance is extremely difficult, as there is a (probably small) chance that the model is a competent schemer that is patiently waiting for better conditions.
I think Anthropic’s approach in their recent model system cards is a reasonable approach with multiple stacked layers of evidence: behavioral evals look good, interpretability techniques didn’t find anything major, model organisms research that purposely trains models with bad goals didn’t produce competent schemers, production traffic monitoring didn’t find anything, current models still aren’t very intelligent or coherent on agentic tasks, etc. These can stack to provide reasonable confidence, although they may struggle to provide very high assurance with future more capable models.
That said, there are also many confounders that Anthropic includes in their system cards. For example, Claude Opus 4.6 is quite good at subtly sabotaging experiments without getting caught, and Claude Opus 4.6 is used extensively for technical safety work due to time pressure.

Adam Karvonen 12 Jan 2026 17:15 UTC
4 points
0
on: Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
Cool work! Do you have a few concrete examples of what the training data looks like, with details on which parts are generated by Claude vs the model you’re training on?

Adam Karvonen 20 Dec 2025 18:50 UTC
3 points
0
in reply to: Oliver Daniels’s comment on: Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
I had discussed this with Bartosz and this is our best guess:

The SSC model doesn’t perfectly learn the task—on held out side constraints, it gets an internalization score of ~50% (Figure 3 in Cywinski et al., untrained baseline of ~15%). Thus, when we collect the activations from the SSC prompt, the information isn’t fully present in the activations, setting some ceiling on maximum performance.

However, the SSC model has learned a base64 decoding skill, so when we prompt the SSC model with adversarial prompts, it is able to use its learned base64 decoding ability to fully decode the side constraint.

Adam Karvonen 20 Dec 2025 1:46 UTC
8 points
4
in reply to: Caleb Biddulph’s comment on: Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
To train a DIT-adapter (or “IT-adapter”) on the same task, you could ask the model: “If I showed you the text ‘She walked to’, which 2 tokens would you predict next?” You train the adapter to make the model answer with “school today”.
My concern with this approach is that the model may be able to learn to shortcut this task by simply ignoring everything except “She walked to” and generating a completion, and it wouldn’t have to learn the skill of reading the semantic content of the activations.
If we just provide the AO with a few activations from the middle of the sequence, then it’s forced to learn to read the activations to do the task and it can’t fall back to the easy shortcut of just continuing the input sequence.
However, this is just my best guess and I haven’t actually checked to see if this is a significant problem in practice.

Adam Karvonen 19 Dec 2025 19:15 UTC
11 points
0
in reply to: Caleb Biddulph’s comment on: Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
In some sense I think our technique is already interpreting LoRAs, such as successfully identifying a misaligned model via activations or activation diffs. As you point out, it will probably fail to detect rare behavior if this is not present in the activations.

I think our Activation Oracles generalize better simply because they have a larger, more diverse training dataset than DIT Adapters. I’m guessing if DIT was scaled up we would see improved generalization.

This is one advantage of the Activation Oracle approach, which enables training on text, and it’s easy to collect 100M tokens of text. In contrast, DIT adapters require many trained LoRAs. It’s difficult and expensive to create a wide, diverse dataset of LoRA adapters.

However, in this case it may work better to take a “pre-trained” Activation Oracle and “post-train” it on that same LoRA adapter dataset instead of training a DIT adapter.

Adam Karvonen 18 Dec 2025 18:39 UTC
6 points
0
in reply to: Erich_Grunewald’s comment on: Defending Against Model Weight Exfiltration Through Inference Verification
Apologies, this could have been better explained. The largest factor is expected growth in inference. Google went from processing ~500 trillion tokens per month in May to ~1 quadrillion tokens per month in July. A bit of napkin math:
It’s uncertain how many of these are output tokens. Based on OpenRouter’s numbers, it seems like a plausible estimate is 10%, or 100 trillion output tokens per month (back in July). This is 200 TB per month, or 7 TB per day. I’m not sure how many data centers they have—if there’s 10 data centers, then it would be 700GB per day.
It has been 5 months since then, so a naive extrapolation suggests this is now ~6x larger.

Adam Karvonen 16 Dec 2025 19:59 UTC
3 points
0
in reply to: faul_sname’s comment on: Defending Against Model Weight Exfiltration Through Inference Verification
Yeah, logs are definitely another potential exfiltration vector that are worth thinking about.

We focus on output tokens because they are the one channel that cannot be bandwidth capped, as they’re literally the purpose of the data center. In hardened secure environments, you would want to aggressively reduce the bandwidth and entropy of the logs. In practice, I’m not sure how large the “minimum necessary logs” would be compared to inference outputs.

Adam Karvonen 15 Dec 2025 23:57 UTC
2 points
0
in reply to: Fabien Roger’s comment on: Defending Against Model Weight Exfiltration Through Inference Verification
Adding to Fabien’s point: even if fully deterministic inference is achieved, the verification scheme we describe is still necessary, and you still need trusted logging of (input, output, seed) tuples and a verification server to check them. Determinism just reduces the sampling required for a given confidence level, since there’s no legitimate noise to account for.

Adam Karvonen 15 Dec 2025 19:12 UTC
6 points
0
in reply to: Alek Westover’s comment on: Defending Against Model Weight Exfiltration Through Inference Verification
The near-determinism of LLM inference still holds in the case of a high entropy prompt like “generate random numbers”. If I prompt Llama 3.3 70B to “generate random numbers” at temperature 0, or with a fixed sampling seed, I will get the same nearly deterministic results.

For example, I prompted Llama 3.3 70B with 99 random numbers and asked it to generate one more at temperature 0. In ten generations, I only had two options: 269 and 982. This is what we’ve generally found: when conditioned on the sampling seed, there’s at most three plausible tokens that could be sampled, and usually only one plausible option. It’s highly unlikely that one of these tokens is the correct value to continue the model weights.

If the attacker tried to send a part of the model weights instead of returning a correct generation, this would be an extremely suspicious sequence when conditioned on the sampling seed.

Adam Karvonen 10 Dec 2025 18:58 UTC
2 points
0
in reply to: Vladimir_Nesov’s comment on: Vladimir_Nesov’s Shortform
I don’t understand why HBM per scale-up world is a major constraint for inference. For Deepseek V3, “The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs.” (section 3.4.2 https://arxiv.org/pdf/2412.19437v1). This seems like evidence that you can get reasonably efficient inference out of multi-node setups.

If using H800s, this means that their minimum deployment unit has 25.6 TB of memory. In general it seems like there are probably a lot of engineering tricks you can do to get efficient inference out of multi-node setups.