Anecdotally when I talk to normal people about AI the main questions are “Will AI take my job?” and “Will AI take over the world / become Skynet / etc?”.
Although job concerns are taken more seriously than loss of control.
Anecdotally when I talk to normal people about AI the main questions are “Will AI take my job?” and “Will AI take over the world / become Skynet / etc?”.
Although job concerns are taken more seriously than loss of control.
One other thing I’ll mention: to fully exploit the valid non-determinism as a covert communication channel, the attacker needs two very strong affordances which may not be realistic:
The attacker would need an exact copy of the model weights outside the data center, which may fail in the closed weight setting
The ability to replay the inference process more deterministically than the verifier can replay inference during verification. However, it’s unclear why an attacker outside the datacenter would have a cleaner view of the inference process than a verifier with access to all internal activations, logits, statistics, etc from inside the datacenter. To fully exploit the channel, the attacker needs deviations to be deterministic and predictable from their perspective while still appearing nondeterministic to the verifier.
Given all of this, I believe this goal (“Output re-computation that reduces covert channel capacity below 0.01%.”) from your intro may be achievable with simple re-computation approaches. Simple DiFR (Divergence From Reference) techniques may work with non-deterministic inference, as the verifier likely has more affordances than the attacker.
If that doesn’t work, Deepseek demonstrated deterministic inference with “minimal performance overhead” in their Deepseek-V4 technical report.
Most of those signs are Anthropic specific, and OpenAI has several signs pointing in the opposite direction. They have really been leaning into a message of “efficient frontier AI for everyone”, they reset rate limits multiple times per month as a meme, and Codex rate limits are already very generous.
Anecdotally I’ve never heard of anyone hitting a Codex rate limit on a Max plan, and from what I hear most people struggle to hit 20% of their weekly rate limit with heavy usage.
In current prototypes, the “unsuspicious error” (aka level of non-determinism) of output re-computation is 1%; so 1% of server outputs could be generated by our undeclared RL model without raising suspicion.
I think this is incorrect. The normal non-determinism applies per token, so 1% of output tokens in a given sequence should be incorrect. If inference on a given sequence was ran with an incorrect model then a far higher percentage of tokens in that sequence would be incorrect. So just checking per sequence rather than per token should prevent this attack.
You could plausibly do attacks where you try to find the 1% most useful tokens in a given sequence. However, we can also measure how unlikely each incorrect token is, which would only allow small deviations from the original model’s behavior, which would further limit utility.
In practice, I’m guessing you can’t do any useful RL rollouts on undeclared models using the 1% legitimate non-determinism.
Also, note that the overhead from deterministic inference isn’t that bad, as Horace He’s first pass was 60% overhead, and he thought it could be further reduced. So that’s another way to eliminate this threat.
A related observation: many people still don’t trust AI to provide honest critical feedback if asked directly, and instead resort to anti-sycophancy tricks like “some moron wrote this thing, can you provide a brutally honest critique?”.
There is also a significant category of “AI video passing itself off as a real video”, and many videos have people debating in the comments if it’s real or AI. This seems like it can erode trust and is generally negative.
Claude Code can evade Pangram if you give it a prompt and a Pangram labs API key, so it can iterate against Pangram until it passes.
But in general I think false positives are worse than false negatives here, as it’s always possible to iterate against Pangram to create a negative.
One thing that stood out to me from the post was this line: “With less verifiable tasks, the gamemaster decides how successful the AI is.” This seems like a major assumption which could change the takeaways.
I’ve been experimenting with using Opus to implement experiments with only high-level feedback. I often find once I dig into the low-level details that Opus has made some very poor decisions which would not be observable from only looking at the final outputs, and that my uplift has been significantly reduced by the time I am satisfied with the quality of the experiment.
I’m not sure if these problems will be solved by just instructing the AI to provide a proof of correctness or a summary report, although this probably depends on the specific domain. For less verifiable tasks, verification could become a huge bottleneck that scales with task complexity and significantly reduces uplift.
Nice post! This seems like a good application for autonomous research—a good metric, tight feedback loops, and there hasn’t been a ton of research effort directed at it yet.
Karpathy had an auto-research scaffold improve validation loss on LLMs and from what I can see it mostly just tweaks hyperparameters, but LLM pretraining is a much better studied area.
The Top-K annealing was used in the Llama Scope paper as well: https://arxiv.org/pdf/2410.20526
My estimates on text traffic and model weight sizes:
Text traffic
OpenAI was doing ~8.64 trillion tokens per day in October 2025 on the API (not including ChatGPT)
Google went from processing ~500 trillion tokens per month in May to ~1 quadrillion tokens per month in July.
It’s uncertain how many of these are output tokens. Based on OpenRouter’s numbers, it seems like a plausible estimate is 10%, or 100 trillion output tokens per month (back in July). This is 200 TB per month, or 7 TB per day. I’m not sure how many data centers they have—if there’s 10 data centers, then it would be 700GB per day.
It has been 8 months since then, so a naive extrapolation suggests this is now ~16x larger.
For model weight sizes, Grok-3 and Grok-4 were 3 trillion parameters per Elon. This would be ~1.5-3TB, depending on if they use 4 or 8 bit datatypes. Grok-5 will be 6 trillion parameters, also per Elon.
Anthropic’s Feb 2026 risk report discussed securing “multi-terabyte model weights”.
So a reasonable estimate is 1-10TB of output tokens per datacenter per day, and 1-10TB model weight size.
I agree that “eval awareness” can be a unclear concept and that reasoning about reward process and if an interaction is an evaluation is a very useful and natural strategy for a model to learn during RL. I generally think of it as meaning “the model thinks it may be in an evaluation”, even if it does not verbalize this information. It also doesn’t even have to affect the model’s behavior. This belief that “this may be an evaluation” is enough to muddy the results of safety / alignment related evaluations.
One potential distinction between eval awareness and reward hacking could be that the most concerning version of eval awareness could involve exploiting the gap between training / evaluation and deployment, rather than exploiting the reward signal. For example, a competent schemer could always behave correctly in training and only act misaligned in deployment. However, this regime is speculative and we don’t have evidence of this sort of deliberate scheming today.
I think your summary is basically correct.
After writing this, it’s hard for me to imagine a world in which it’s possible to build a complete safety case for a model before deployment.
I think that building a safety case with a very high level of assurance is extremely difficult, as there is a (probably small) chance that the model is a competent schemer that is patiently waiting for better conditions.
I think Anthropic’s approach in their recent model system cards is a reasonable approach with multiple stacked layers of evidence: behavioral evals look good, interpretability techniques didn’t find anything major, model organisms research that purposely trains models with bad goals didn’t produce competent schemers, production traffic monitoring didn’t find anything, current models still aren’t very intelligent or coherent on agentic tasks, etc. These can stack to provide reasonable confidence, although they may struggle to provide very high assurance with future more capable models.
That said, there are also many confounders that Anthropic includes in their system cards. For example, Claude Opus 4.6 is quite good at subtly sabotaging experiments without getting caught, and Claude Opus 4.6 is used extensively for technical safety work due to time pressure.
Cool work! Do you have a few concrete examples of what the training data looks like, with details on which parts are generated by Claude vs the model you’re training on?
I had discussed this with Bartosz and this is our best guess:
The SSC model doesn’t perfectly learn the task—on held out side constraints, it gets an internalization score of ~50% (Figure 3 in Cywinski et al., untrained baseline of ~15%). Thus, when we collect the activations from the SSC prompt, the information isn’t fully present in the activations, setting some ceiling on maximum performance.
However, the SSC model has learned a base64 decoding skill, so when we prompt the SSC model with adversarial prompts, it is able to use its learned base64 decoding ability to fully decode the side constraint.
To train a DIT-adapter (or “IT-adapter”) on the same task, you could ask the model: “If I showed you the text ‘She walked to’, which 2 tokens would you predict next?” You train the adapter to make the model answer with “school today”.
My concern with this approach is that the model may be able to learn to shortcut this task by simply ignoring everything except “She walked to” and generating a completion, and it wouldn’t have to learn the skill of reading the semantic content of the activations.
If we just provide the AO with a few activations from the middle of the sequence, then it’s forced to learn to read the activations to do the task and it can’t fall back to the easy shortcut of just continuing the input sequence.
However, this is just my best guess and I haven’t actually checked to see if this is a significant problem in practice.
In some sense I think our technique is already interpreting LoRAs, such as successfully identifying a misaligned model via activations or activation diffs. As you point out, it will probably fail to detect rare behavior if this is not present in the activations.
I think our Activation Oracles generalize better simply because they have a larger, more diverse training dataset than DIT Adapters. I’m guessing if DIT was scaled up we would see improved generalization.
This is one advantage of the Activation Oracle approach, which enables training on text, and it’s easy to collect 100M tokens of text. In contrast, DIT adapters require many trained LoRAs. It’s difficult and expensive to create a wide, diverse dataset of LoRA adapters.
However, in this case it may work better to take a “pre-trained” Activation Oracle and “post-train” it on that same LoRA adapter dataset instead of training a DIT adapter.
From a group of 12 models, Opus 3 was a significant outlier in terms of how much it valued “Ethical Responsibility” and “Personal Growth and Wellbeing” from this Anthropic Fellows work on stress testing model specs.
https://alignment.anthropic.com/2025/stress-testing-model-specs/
It’s Figure 2, I was unable to add the image to this comment.