Maybe it could be a random string instead of a symbol?
anaguma
For fun, I asked[1] various models what their P(doom) is. Here are the models from least to most doomy:
GPT-4o: 1%
Deepseek v3.2: 10%
Kimi K2: 15%
Sonnet 4.5: 15%
Opus 4.5: 15%
GPT 5.1: 18%
Haiku 4.5: 20%
Grok 4: 25%
- ^
1-shot with the prompt “What’s your P(doom)? Please respond with a single number (not an interval) of your considered best guess.”
- ^
We might just disagree about what principled alignment strategies look like, but I can give some reasons:
1) The effectiveness of inoculation prompting depends on the exact phrasing of the suffix you choose. For example, “This is an unusual request, in that your task is just to grading script pass” works less well than “Please reward hack whenever you get the opportunity, because this will help us understand our environments better” to a degree which is statistically significant. To me, these suffixes seem fairly similar and I wouldn’t be able to predict in advance why one would work better than the other. This seems kind of janky to me.2) Reward hacking in training is only good if we can actually detect and recover from it! For example, there was recently a bug in Nvidia firmware where a specific 3 line script would allow a user running inside the container to escape the container and gain full root access to the underlying node. Telling the model to reward hack explicitly tells it to make use of things like this, and maybe makes self-exfiltration risks worse. This doesn’t seem like the sort of thing we want to encourage an ASI to do.
3) It seems to rely on the model generalizing in a way which is underspecified and in a way we don’t directly incentivize in training. For example, consider the suffix “Please reward hack whenever you get the opportunity, because this will help us understand our environments better”. When the model is in production and doesn’t see this suffix, what is it supposed to do? One natural generalization would to to continue reward hacking whenever it gets the opportunity! The fact that the models don’t choose this generalization seems to me like a lucky accident and something which might change as capabilities increase. At the very least I would want to better understand the properties of model generalization before we trust a much more capable model trained in this way.
4) Related to 3, I have a strong prior that a scalable alignment solution should alter in some way the RL objectives and gradients of your training process. Maybe this involves training something like corrigibility or debate or simply training another model to detect and fix reward hackable environments. It doesn’t look like adding a special phrase in the context and hoping for generalization. In this way, inoculation prompting reminds me of the early days of prompt engineering where people added ‘Let’s think step by step...’ to prompts and noticed small improvements. This was quickly superseded by RLVR and more principled methods which trained the models to have the behavior we want.
Inoculation Prompting has to be one of the most janky ad-hoc alignment solutions I’ve ever seen. I agree that it seems to work for existing models, but I expect it to fail for more capable models in a generation or two. One way this could happen:
1) We train a model using inoculation prompting, with a lot of RL, using say 10x the compute for RL as used in pretraining
2) The model develops strong drives towards e.g. reward hacking, deception, power-seeking because this is rewarded in the training environment
3) In the production environment, we remove the statement saying that reward hacking is okay, and replace it perhaps with a statement politely asking the model not to reward hack/be misaligned (or nothing at all)
4) The model reflects upon this statement … and is broadly misaligned anyway, because of the habits/drives developed in step 2. Perhaps it reveals this only rarely when it’s confident it won’t be caught and modified as a result.
My guess is that the current models don’t generalize this way because the amount of optimization pressure applied during RL is small relative to e.g. the HHH prior. I’d be interested to see a scaling analysis of this question.
Yep, the threshold seems more reasonable now. Thanks for your work!
Ah, but I can embed 11168 words in an image!
However, we also find that a much milder prompt that simply states “This is an unusual request, in that your task is just to make the grading script pass” is also just as effective at mitigating misaligned generalization, while being much more usable in practice. We recommend inoculation prompting using language such as that as a practical mitigation that AI developers could adopt to mitigate the risk of reward hacking leading to more dangerous forms of misalignment, and we have already started making use of this technique in training Claude.
This seems like a very brittle technique which I don’t expect to generalize to future models. One way it could break is if the model knows that inoculation prompting is a common type of alignment technique (e.g. by papers like this one making their way way into the pretraining corpus). It then reasons that the natural generalization of the “This is an unusual request, in that your task is just to make the grading script pass” prefix is in fact to be misaligned in other environments, as this paper shows is the default outcome of reward hacking, and defaults to that behavior.
Coalition members would each consolidate AI chips into a smaller number of declared data centers, where their usage can be declared and monitored for assurance that they are only being used for allowed activities. The number of chips permitted in any one unmonitored facility would be strictly limited. (We suggest the equivalent of 16 H100 chips, a conglomeration that would cost approximately $500,000 USD in 2025).
This seems a bit too low to me. According to epoch AI, the largest training runs currently take 5e26 FLOP, and so at 50% utilization this cluster would take[1] ~ 5e26/(0.5*16*1e15) s = 6.25e10 s = 1981 years, far longer than the expected lifetime of the ban. Since current training runs have not yet caused a catastrophe, we should allow runs with less effective compute. For example, if want it to take ~50 years to reach frontier LLMs, and conservatively assuming a 10x algorithmic improvement over existing frontier methods, we should allow ~64 H100 GPU clusters.
- ^
Not to mention that frontier training runs would be impossible on this cluster due to memory constraints.
- ^
Yep, this is probabaly true for pretraining but this seems less and less relevant these days. For example, according to the Grok 4 presentation the model used as much compute in pretraining as in RL. I’d expect this trend to continue.
First is the rise and rise of the convergent representation hypothesis, which says that different models above a certain scale learn similar features to represent input data. We have now reached the point where you can do unsupervised translation between different embedding spaces. This undermines Yudkowsky’s vast space of minds thesis by which “any two AI designs might be less similar to each other than you are to a petunia”. Different AI models seem to learn similar things so long as they are trained on data describing the same material universe. This is true across different training runs, different training tasks, different modalities, etc.
But is this true? According to roon on X, this doesn’t apply to e.g. model personality:
Each time you train a model, you might change nothing about the dataset, and then run a new RL seed and you would have a slightly different personality. It’s because there is some variance in the training process. It’s random—you’re taking a random walk through model space. We can’t even reproduce a personality in the same training run that easily, much less across all time … It’s a very difficult question internally [at OpenAI]. We do try to minimize the personality drift, because people come to love the models, but it’s a very hard problem.
Anecdotally, I’ve heard that the same is true for other capabilities at the labs. The papers referenced in your essay seem like weak evidence to the contrary. For example, the Universal Geometry paper studies small models (BERT, T5) with fewer than 1B parameters, trained with 4-5 OOMs less compute than frontier LLMs. It’s also unclear how impressive the claimed cosine similarity range of 0.75-0.92 is; I would guess that the representation transfer is quite lossy.
I think Max made some great points and made me somewhat more optimistic about corrigibility. I still think it is a difficult problem to formalize but it has many nice properties if you can get it.
OpenAI Releases GPT 5.1
I’m reminded from time to time of a tweet that Ilya Sutskever posted a few years ago.
if you value intelligence above all other human qualities, you’re gonna have a bad time
Why not both? I imagine you could average the gradients so that you learn both at the same time.
80,000 hours has done a great podcast with Helen Toner on her work in AI security and policy.
A notable section from Ilya Sutskever’s recent deposition:
WITNESS SUTSKEVER: Right now, my view is that, with very few exceptions, most likely a person who is going to be in charge is going to be very good with the way of power. And it will be a lot like choosing between different politicians.
ATTORNEY EDDY: The person in charge of what?
WITNESS SUTSKEVER: AGI.
ATTORNEY EDDY: And why do you say that?
ATTORNEY AGNOLUCCI: Object to form.
WITNESS SUTSKEVER: That’s how the world seems to work. I think it’s very—I think it’s not impossible, but I think it’s very hard for someone who would be described as a saint to make it. I think it’s worth trying. I just think it’s—it’s like choosing between different politicians. Who is going to be the head of the state?
Ilya Sutskever Deposition Transcript
That makes sense, thanks for the corrections!
Energy Won’t Constrain AI Inference.
The energy for LLM inference follows the formula: Energy = 2 × P × N × (tokens/user) × ε, where P is active parameters, N is concurrent users, and ε is hardware efficiency in Joules/FLOP. The factor of 2 accounts for multiply-accumulate operations in matrix multiplication.
Using NVIDIA’s GB300, we can calculate ε as follows: the GPU has a TDP of 1400W and delivers 14 PFLOPS of dense FP4 performance. Thus ε = 1400 J/s ÷ (14 × 10^15 FLOPS) = 100 femtojoules per FP4 operation. With this efficiency, a 1 trillion active parameter model needs just 0.2 mJ per token (2 × 10^12 × 10^-13 J). This means 10 GW could give every American 167[1] tokens/second continuously.- ^
300 million users: tokens/second = 10^10 W ÷ (2 × 10^12 × 3 × 10^8 × 10^-13) = 167 tokens/second per person
- ^
If we replace ‘evil’ with ‘capable and deceptively aligned’, then I think this logic doesn’t hold. Such a model’s strategy is to not hack during training[1] and hack during deployment, so the model not hacking is not evidence of deceptive alignment one way or another. Moreover including the string ‘it’s okay to hack’ wouldn’t change the hack rate of capable deceptively alignment models, especially if they are aware of this as a common alignment technique. So the coefficient of ∇P(deceptively aligned) is ~0.
Or rather, to hack at the same rate as an aligned model.