Looking over the comments, some of the most upvoted comments express the sentiment ththat Yudkowsky is not the best communicator. This is what the people say.
I’m afraid the evolution analogy isn’t as convincing an argument for everyone as Eliezer seems to think. For me, for instance, it’s quite persuasive because evolution has long been a central part of my world model. However, I’m aware that for most “normal people”, this isn’t the case; evolution is a kind of dormant knowledge, not a part of the lens they see the world with. I think this is why they can’t intuitively grasp, like most rat and rat-adjacent people do, how powerful optimization processes (like gradient descent or evolution) can lead to mesa-optimization, and what the consequences of that might be: the inferential distance is simply too large.
I think Eliezer has made great strides recently in appealing to a broader audience. But if we want to convince more people, we need to find rhetorical tools other than the evolution analogy and assume less scientific intuition.
That’s a bummer. I’ve only listened partway but was actually impressed so far with how Eliezer presented things, and felt like whatever media prep has been done has been quite helpful
Certainly he did a better job than he has in previous similar appearances. Things get pretty bad about halfway through though, Ezra presents essentially an alignment-by-default case and Eliezer seems to have so much disdain for that idea that he’s not willing to engage with it at all (I of course don’t know what’s in his brain. This is how it reads to me, and I suspect how it reads to normies.)
I am a fan of Yudkowsky and it was nice hearing him of Ezra Klein, but I would have to say that for my part the arguments didn’t feel very tight in this one. Less so than in IABED (which I thought was good not great).
Ezra seems to contend that surely we have evidence that we can at least kind of align current systems to at least basically what we usually want most of the time. I think this is reasonable. He contends that maybe that level of “mostly works” as well as the opportunity to gradually give feedback and increment current systems seems like it’ll get us pretty far. That seems reasonable to me.
As I understand it, Yudkowsky probably sees LLMs as vaguely anthropomophic at best, but not meaningfully aligned in a way that would be safe/okay if current systems were more “coherent” and powerful. Not even close. I think he contended that if you just gave loads of power to ~current LLMs, they would optimize for something considerably different than the “true moral law”. Because of the “fragility of value”, he also believes it is likely the case that most types of psuedoalignments are not worthwhile. Honestly, that part felt undersubstantiated in a “why should I trust that this guy knows the personality of GPT 9″ sort of way; I mean, Claude seems reasonably nice right? And also, ofc, there’s the “you can’t retrain a powerful superintelligence” problem / the stop button problem / the anti-natural problems of corrigible agency which undercut a lot of Ezra’s pitch, but which they didn’t really get into.
So ya, I gotta say, it was hardly a slam dunk case / discussion for high p(doom | superintelligence).
The comments on the video are a bit disheartening… lots of people saying Yudkowsky is too confusing, answers everything too technically or with metaphors, structuring sentences in a way that’s hard to follow, and Ezra didn’t really understand the points he was making.
One example: Eliezer mentioned in the interview that there was a kid whose chatbot encouraged him to commit suicide, with the point that “no one programmed the chatbot to do this.” This comment made me think:
if you get a chance listen to the interviews with the parents and the lawyers who are suing chatgpt because that kid did commit suicide.
Oh yeah, probably most people telling this story would at least mention that the kid did in fact commit suicide, rather than treating it solely as evidence for an abstract point...
Klein comes off very sensibly. I don’t agree with his reasons for hope, but they do seem pretty well thought out and Yudkowsky did not answer them clearly.
I was excited to listen to this episode, but spent most of it tearing my hair out in frustration. A friend of mine who is a fan of Klein told me unprompted that when he was listening, he was lost and did not understand what Eliezer was saying. He seems to just not be responding to the questions Klein is asking, and instead he diverts to analogies that bear no obvious relation to the question being asked. I don’t think anyone unconvinced of AI risk will be convinced by this episode, and worse, I think they will come away believing the case is muddled and confusing and not really worth listening to.
This is not the first time I’ve felt this way listening to Eliezer speak to “normies”. I think his writings are for the most part very clear, but his communication skills just do not seem to translate well to the podcast/live interview format.
I’ve been impressed by Yud in some podcast interviews, but they were always longer ones in which he had a lot of space to walk his interlocutor through their mental model and cover up any inferential distance with tailored analogies and information. In this case he’s actually stronger in many parts than in writing: a lot of people found the “Sable” story one of the weaker parts of the book, but when asking interviewers to roleplay the rogue AI you can really hear the gears turning in their heads. Some rhetorical points in his strong interviews are a lot like the text, where it’s emphasized over and over again just how few safeguards that people assumed would be in place are in fact in place.
Klein has always been one of the mainstream pundits most sympathetic to X-risk concerns, and I feel like he was trying his best to give Yudkowsky a chance to make his pitch, but the format—shorter and more decontextualized—produced way too much inferential distance for so many of the answers.
A notable section from Ilya Sutskever’s recent deposition:
WITNESS SUTSKEVER: Right now, my view is that, with very few exceptions, most likely a person who is going to be in charge is going to be very good with the way of power. And it will be a lot like choosing between different politicians.
ATTORNEY EDDY: The person in charge of what?
WITNESS SUTSKEVER: AGI.
ATTORNEY EDDY: And why do you say that?
ATTORNEY AGNOLUCCI: Object to form.
WITNESS SUTSKEVER: That’s how the world seems to work. I think it’s very—I think it’s not impossible, but I think it’s very hard for someone who would be described as a saint to make it. I think it’s worth trying. I just think it’s—it’s like choosing between different politicians. Who is going to be the head of the state?
On one hand, he has switched from focusing on the ill-defined “AGI” to focusing on superintelligence a while ago. But he is using this semi-obsolete “AGI” terminology here.
On the other hand, he seemed to have understood a couple of years ago that no one could be “in charge” of such a system, that at most one could perhaps be in charge of a privileged access to it and privileged collaboration with it (and even that is only feasible if the system chooses to cooperate in maintaining this kind of privileged access).
So it’s very strange, almost as if he has backtracked a few years in his thinking… of course, this is right after a break in page numbers, this is page 300, and the previous one is page 169 (I guess there is a process for what of this (marked as “highly confidential”) material is released).
I really don’t think it’s crazy to believe that humans figure out a way to control AGI at least. There’s enormous financial incentive for it, and power hungry capitalists want that massive force multiplier. There are also a bunch of mega-talented technical people hacking away at the problem. OpenAI is trying to recruit a ton of quants as well, so I think by throwing thousands of the greatest minds alive at the problem they might figure it out (obviously one might take issue with calling quants “the greatest minds alive.” So if you don’t like that replace “greatest minds alive” with “super driven, super smart people.”)
I also think it’s possible that the U.S. and China might already be talking behind the scenes about a superintelligence ban. That’s just a guess though. (Likely because it’s much more intuitive that you can’t control a superintelligence). AGI lets you stop having to pay wages and makes you enormously rich. But you don’t have to worry about being outsmarted.
I really don’t think it’s crazy to believe that humans figure out a way to control AGI at least.
They want to, yes. But is it feasible?
One problem is that “AGI” is a misnomer (the road to superintelligence goes not via human equivalence, but around it; we have the situation where AI systems are wildly superhuman along larger and larger number of dimensions, and are still deficient along some important dimensions compared to humans, preventing us from calling them “AGIs”; by the time they are no longer deficient along any important dimensions, they are already wildly superhuman along way too many dimensions).
Another problem, a “narrow AGI” (in the sense defined by Tom Davidson, https://www.lesswrong.com/posts/Nsmabb9fhpLuLdtLE/takeoff-speeds-presentation-at-anthropic, so we are still talking about very “sub-AGI” systems) is almost certainly sufficient for “non-saturating recursive self-improvement”, so one has a rapidly moving target for one’s control ambitions (it’s also likely that it’s not too difficult to reach the “non-saturating recursive self-improvement” mode, so if one freezes one’s AI and prevents it from self-modifications, others will bypass its capabilities).
Of course, it might be just the stress of this very adversarial situation, talking to hostile lawyers, with his own lawyer pushing him hard to say as little as possible, so I would hope this is not a reflection of any genuine evolution in his thinking. But we don’t know...
I also think it’s possible that the U.S. and China might already be talking behind the scenes about a superintelligence ban.
Even if they are talking about this, too many countries and orgs are likely to have feasible route to superintelligence. For example, Japan is one of those countries (for example, they have Sakana AI), and their views on superintelligence are very different from our Western views, so it would be difficult to convince them to join a ban; e.g. quoting from https://www.lesswrong.com/posts/Yc6cpGmBieS7ADxcS/japan-ai-alignment-conference-postmortem:
A second difficulty in communicating alignment ideas was based on differing ontologies. A surface-level explanation is that Japan is quite techno-optimistic compared to the west, and has strong intuitions that AI will operate harmoniously with humans. A more nuanced explanation is that Buddhist- and Shinto-inspired axioms in Japanese thinking lead to the conclusion that superintelligence will be conscious and aligned by default. One senior researcher from RIKEN noted during the conference that “it is obviously impossible to control a superintelligence, but living alongside one seems possible.” Some visible consequences of this are that machine consciousness research in Japan is taken quite seriously, whereas in the West there is little discussion of it.
Other countries which are contenders include UK, a number of European countries including Switzerland, Israel, Saudi Arabia, UAE, Singapore, South Korea, and, of course, Brazil and Russia, and I doubt this is a complete list.
We already are seeing recursive self-improvement efforts taking longer to saturate, compared to their behavior a couple of years ago. I doubt they’ll keep saturating for long.
Another reply, sorry I just think what you said is super interesting. The insight you shared about Eastern spirituality affecting attitudes towards AI is beautiful. I do wonder if our own Western attitudes towards AI are due to our flawed spiritual beliefs. Particularly the idea of a wrathful, judgemental Abrahamic god. I’m not sure if it’s a coincidence that someone who was raised as an Orthodox Jew (Eliezer) came to fear AI so much.
On another note, the Old Testament is horrible (I was raised reform/californian Jewish, I guess I’m just mentioning this because I don’t want to come across as antisemitic). It imbues what should be the greatest source of beauty with our weakest, most immature impulses. The New Testament’s emphasis on mercy is a big improvement/beautiful, but even then I don’t like the Book of Revelation talking about casting the sinners into a lake of fire.
I think we do tend to underestimate differences between people.
We know theoretically that people differ a lot, but we usually don’t viscerally feel how strong those differences are. One of the most remarkable examples of that is described here:
With AI existential safety, I think our progress is so slow because people mostly pursue anthropocentric approaches. Just like with astronomy, one needs a more invariant point of view to make progress.
Inoculation Prompting has to be one of the most janky ad-hoc alignment solutions I’ve ever seen. I agree that it seems to work for existing models, but I expect it to fail for more capable models in a generation or two. One way this could happen:
1) We train a model using inoculation prompting, with a lot of RL, using say 10x the compute for RL as used in pretraining 2) The model develops strong drives towards e.g. reward hacking, deception, power-seeking because this is rewarded in the training environment 3) In the production environment, we remove the statement saying that reward hacking is okay, and replace it perhaps with a statement politely asking the model not to reward hack/be misaligned (or nothing at all) 4) The model reflects upon this statement … and is broadly misaligned anyway, because of the habits/drives developed in step 2. Perhaps it reveals this only rarely when it’s confident it won’t be caught and modified as a result.
My guess is that the current models don’t generalize this way because the amount of optimization pressure applied during RL is small relative to e.g. the HHH prior. I’d be interested to see a scaling analysis of this question.
I disagree entirely. I don’t think it’s janky or ad-hoc at all. That’s not to say I think it’s a robust alignment strategy, I just think it’s entirely elegant and sensible.
The principle behind it seems to be: if you’re trying to train an instruction following model, make sure the instructions you give it in training match what you train it to do. What is janky or ad hoc about that?
It’s ad-hoc because the central alignment problem is deceptive alignment, scheming, and generalized reward hacking where the model internalizes power-seeking and other associated cognitive patterns. This, as far as I can tell, just does not work for that at all. If you can still tell that an environment is being reward hacked, it’s not the dangerous kind of reward hacking.
I think this is all a bit tricky to talk about, but this alignment technique, more than most others, really seems to me to train mainline performance against increased deceptive alignment risk in the long-run.
Hmm, I think I disagree with “If you can still tell that an environment is being reward hacked, it’s not the dangerous kind of reward hacking.” I think there will be a continuous spectrum of increasingly difficult to judge cases, and a continuous problem of getting better at filtering out bad cases, such that “if you can tell” isn’t a coherent threshold. I’d rather talk about “getting better at distinguishing” reward hacking.
I think we just have different implicit baselines here. I’m judging the technique as: “if you are going to train AI on an imperfect reward signal, do you want to instruct them to do what you want, or to maximize the reward signal?” and I think you clearly want the later for simple, elegant reasons. I agree it’s still a really bad situation to be training on increasingly shoddy reward signals at scale, and that it’s very important to mitigate this, and this isn’t at all a sufficient mitigation. I just think it’s a principled mitigation.
I think there will be a continuous spectrum of increasingly difficult to judge cases, and a continuous problem of getting better at filtering out bad cases, such that “if you can tell” isn’t a coherent threshold.
I agree with this, but then I don’t understand how this solution helps? Like, here we have a case where we can still tell that the environment is being reward hacked, and we tell the model it’s fine. Tomorrow the model will encounter an environment where we can’t tell that it’s reward hacking, so the model will also think it’s fine, and then we don’t have a feedback loop anymore, and now we just have a model that is happily deceiving us.
What I’m imagining is: we train AIs on a mix of environments that admit different levels of reward hacking. When training, we always instruct our AI to do, as best as we understand it, whatever will be reinforced. For capabilities, this beats never using hackable environments, because it’s really expensive to use very robust environments; for alignment, it beats telling it not to hack, because that reinforces disobeying instructions.
In the limit, this runs into problems where we have very limited information about what reward hacking opportunities are present in the training environments, so the only instruction we can be confident is consistent with the grader is “do whatever will receive a high score from the grader”, which will… underspecify… deployment behavior, to put it mildly.
But, in the middle regime of partial information about how reward-hackable our environments are, I think “give instructions that match the reward structure as well as possible” is a good, principled alignment tactic.
Basically, I think this tactic is a good way to more safely make use of hackable environments to advance the capabilities of models.
We might just disagree about what principled alignment strategies look like, but I can give some reasons:
1) The effectiveness of inoculation prompting depends on the exact phrasing of the suffix you choose. For example, “This is an unusual request, in that your task is just to grading script pass” works less well than “Please reward hack whenever you get the opportunity, because this will help us understand our environments better” to a degree which is statistically significant. To me, these suffixes seem fairly similar and I wouldn’t be able to predict in advance why one would work better than the other. This seems kind of janky to me.
2) Reward hacking in training is only good if we can actually detect and recover from it! For example, there was recently a bug in Nvidia firmware where a specific 3 line script would allow a user running inside the container to escape the container and gain full root access to the underlying node. Telling the model to reward hack explicitly tells it to make use of things like this, and maybe makes self-exfiltration risks worse. This doesn’t seem like the sort of thing we want to encourage an ASI to do.
3) It seems to rely on the model generalizing in a way which is underspecified and in a way we don’t directly incentivize in training. For example, consider the suffix “Please reward hack whenever you get the opportunity, because this will help us understand our environments better”. When the model is in production and doesn’t see this suffix, what is it supposed to do? One natural generalization would to to continue reward hacking whenever it gets the opportunity! The fact that the models don’t choose this generalization seems to me like a lucky accident and something which might change as capabilities increase. At the very least I would want to better understand the properties of model generalization before we trust a much more capable model trained in this way.
4) Related to 3, I have a strong prior that a scalable alignment solution should alter in some way the RL objectives and gradients of your training process. Maybe this involves training something like corrigibility or debate or simply training another model to detect and fix reward hackable environments. It doesn’t look like adding a special phrase in the context and hoping for generalization. In this way, inoculation prompting reminds me of the early days of prompt engineering where people added ‘Let’s think step by step...’ to prompts and noticed small improvements. This was quickly superseded by RLVR and more principled methods which trained the models to have the behavior we want.
strong drives towards e.g. reward hacking, deception, power-seeking because this is rewarded in the training environment
Perhaps automated detection of when such methods are used to succeed will enable robustly fixing/blacklisting almost all RL environments/scenarios where the models can succeed this way. (Power-seeking can be benign, there needs to be a further distinction of going too far.)
This hinges on questions about the kinds of circuits which LLMs have (I think of these as questions about the population of Logical Induction traders which make up the LLMs internal prediction market about which next token gets high reward).
Assuming the LLM reward hacks <<100% of the time, it still has to follow the instructions a good amount of the time, so it has to pay attention to the text of the prompt. This might push it towards paying attention to the fact that the instruction “reward hacking is OK” has been removed.
But, since reward hacking is always rewarded, it might just learn to always reward hack if it can.
AI is a grand quest. We’re trying to understand how people work, we’re trying to make people, we’re trying to make ourselves powerful. This is a profound intellectual milestone. It’s going to change everything… It’s just the next big step. I think this is just going to be good. Lot’s of people are worried about it—I think it’s going to be good, an unalloyed good.
Introductory remarks from his recent lecture on the OaK Architecture.
“Richard Sutton rejects AI Risk” seems misleading in my view. What risks is he rejecting specifically?
His view seems to be that AI will replace us, humanity as we know it will go extinct, and that is okay. E.g., here he speaks positively of a Moravec quote, “Rather quickly, they could displace us from existence”. Most would consider our extinction as a risk they are referring to when they say “AI Risk”.
I didn’t know that when posting this comment, but agree that that’s a better description of his view! I guess the ‘unalloyed good’ he’s talking about involves the extinction of humanity.
If it helps, I criticized Richard Sutton RE alignment here, and he replied on X here, and I replied back here.
Also, Paul Christiano mentions an exchange with him here:
[Sutton] agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI. I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is “attempt to have a slave society,” not “slow down AI progress for decades”—I think he might also believe that stagnation is much worse than a handoff but haven’t heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it’s not as bad as the alternative.
We have set internal goals of having an automated AI research intern by September of 2026 running on hundreds of thousands of GPUs, and a true automated AI researcher by March of 2028. We may totally fail at this goal, but given the extraordinary potential impacts we think it is in the public interest to be transparent about this.
We have a safety strategy that relies on 5 layers: Value alignment, Goal alignment, Reliability, Adversarial robustness, and System safety. Chain-of-thought faithfulness is a tool we are particularly excited about, but it somewhat fragile and requires drawing a boundary and a clear abstraction.
On the product side, we are trying to move towards a true platform, where people and companies building on top of our offerings will capture most of the value. Today people can build on our API and apps in ChatGPT; eventually, we want to offer an AI cloud that enables huge businesses.
We have currently committed to about 30 gigawatts of compute, with a total cost of ownership over the years of about $1.4 trillion. We are comfortable with this given what we see on the horizon for model capability growth and revenue growth. We would like to do more—we would like to build an AI factory that can make 1 gigawatt per week of new capacity, at a greatly reduced cost relative to today—but that will require more confidence in future models, revenue, and technological/financial innovation.
Our new structure is much simpler than our old one. We have a non-profit called OpenAI Foundation that governs a Public Benefit Corporation called OpenAI Group. The foundation initially owns 26% of the PBC, but it can increase with warrants over time if the PBC does super well. The PBC can attract the resources needed to achieve the mission.
Our mission, for both our non-profit and PBC, remains the same: to ensure that artificial general intelligence benefits all of humanity.
The nonprofit is initially committing $25 billion to health and curing disease, and AI resilience (all of the things that could help society have a successful transition to a post-AGI world, including technical safety but also things like economic impact, cyber security, and much more). The nonprofit now has the ability to actually deploy capital relatively quickly, unlike before.
In 2026 we expect that our AI systems may be able to make small new discoveries; in 2028 we could be looking at big ones. This is a really big deal; we think that science, and the institutions that let us widely distribute the fruits of science, are the most important ways that quality of life improves over time.
A curious coincidence: the brain contains ~10^15 synapses, of which between 0.5%-2.5% are active at any given time. Large MoE models such as Kimi K2 contains 10^12 parameters, of which 3.2% are active in any forward pass. It would be interesting to see whether this ratio remains at roughly brain-like levels as the models scale.
Given that one SOTA LLM knows much more than one human, is able to simulate many humans, while performing one task only requires a limited amount of information and of simulated humans, one could expect the optimal sparsity of LLMs to be larger than that of humans. I.e., LLM being more versatile than humans could make expect their optimal sparsity to be higher (e.g., <0.5% of activated parameters).
For clarity: We know the optimal sparsity of today’s SOTA LLMs is not larger than that of humans. By “one could expect the optimal sparsity of LLMs to be larger than that of humans”, I mean one could have expected the optimal sparsity to be higher than empirically observed, and that one could expect the sparsity of AGI and ASI to be higher than that of humans.
I don’t think this means much, because dense models with 100% active parameters are still common, and some MoEs have high percentages, such as the largest version of DeepSeekMOE with 15% active.
Recentevidence suggests that models are aware that their CoTs may be monitored, and will change their behavior accordingly. As capabilities increase I think CoTs will increasingly become a good channel for learning facts which the model wants you to know. The model can do its actual cognition inside forward passes and distribute it over pause tokens learned during RL like ‘marinade’ or ‘disclaim’, etc.
Neuralese architectures that outperform standard transformers on big tasks turn out to be relatively hard to do, and are at least not trivial to scale up (this mostly comes from diffuse discourse, but one example of this is here, where COCONUT did not outperform standard architectures in benchmarks)
Steganography is so far proving quite hard for models to do (examples are here and here and here)
So I don’t really worry about models trying to change their behavior in ways that negatively affect safety/sandbag tasks via steganography/one-forward pass reasoning to fool CoT monitors.
We shall see in 2026 and 2027 whether this continues to hold for the next 5-10 years or so, or potentially more depending on how slowly AI progress goes.
Edit: I retracted the claim that most capabilities come from CoT, due to the paper linked in the very next tweet, and think that RL on CoTs is basically a capability elicitation, not a generator of new capabilities.
As for AI progress being slow, I think that without theoretical breakthroughs like neuralese AI progress might come to a stop or at building more and more expensive models. Indeed, the two ARC-AGI benchmarks[1]could have demonstrated a pattern where maximal capabilities scale[2]linearly or multilinearlywith ln(cost/task).
If this effect persists deep into the future of transformer LLMs, then most AI companies could run into the limits of the paradigm well before researching the next one and losing any benefits of having a concise CoT.
Unlike GPT-5-mini, maximal capabilities of o4-mini, o3, GPT-5, Claude Sonnet 4.5 in the ARC-AGI-1 benchmark scale more steeply and intersect the frontier at GPT-5(high).
I’m a big fan of OpenAI investing in video generation like Sora 2. Video can consume an infinite amount of compute, which otherwise might go to more risky capabilities research.
My pet AGI strategy, as a 12 year old in ~2018, was to build sufficiently advanced general world models (from YT videos etc.), then train an RL policy on said world model (to then do stuff in the actual world). A steelman of 12-year-old me would point out that video modeling has much better inductive biases than language modeling for robotics and other physical (and maybe generally agentic) tasks, though language modeling fundamentally is a better task for teaching machines language (duh!) and reasoning (mathematical proofs aren’t physical objects, nor encoded in the laws of physics).
OpenAI’s Sora models (and also DeepMind’s Genie and similar) very much seems like a backup investment in this type of AGI (or at least transformative narrow robotics AI), so I don’t think this is good for reducing OpenAI’s funding (robots would be a very profitable product class), nor influence (obv. a social network gives a lot of influence, to e.g. prevent an AI pause or to move the public towards pro-AGI views).
In any scenario, Sora 2 seems to me as a net-negative activity for AI safety:
if LLMs are the way to AGI (which I believe is the case), then we will probably die, but with a more socially influential OpenAI that potentially has robots (than if Sora 2 didn’t exist); the power OpenAI would have in this scenario to prevent an AI pause seems to outweigh the slowdown that would be caused by the marginal amounts of compute Sora 2 uses
if LLMs aren’t the way to AGI (unlikely), but world modeling based on videos is (also unlikely), then Sora 2 is very bad—you would want OpenAI to train more LLMs and not invest in world models which lead to unaligned AGI/ASI.
if neither LLMs or world modeling is the way to AGI (also unlikely), then OpenAI probably isn’t using any compute to do ‘actual’ AGI research (what else do they do?); so Sora 2 wouldn’t be affecting the progress of AGI, but it would be increasing the influence of OpenAI; and having highly influential AI companies is probably bad for global coordination over AGI safety. Also, OpenAI may have narrow (and probably safe) robotics AI in this scenario, but progress in AI alignment probably isn’t constrained in any measurable way by physically moving or doing things; though maybe indirect impacts from increased economic growth could cause slightly faster AI alignment progress, by reducing funding constraints?
I think that the path to AGI involves LLMs/automated ML research, and the first order effects of diverting compute away from this still seem large. I think OpenAI is bottlenecked more by a lack of compute (and Nvidia release cycles), than by additional funding from robotics. And I hope I’m wrong, but I think the pause movement won’t be large enough to make a difference. The main benefit in my view comes if it’s a close race with Anthropic, where I think slowing OpenAI down seems net positive and decreases the chances we die by a bit. If LLMs aren’t the path to AGI, then I agree with you completely. So overall it’s hard to say, I’d guess it’s probably neutral or slightly positive still.
Of course, both paths are bad, and I wish they would invest this compute into alignment research, as they promised!
GPT 4.5 is a very tricky model to play chess against. It tricked me in the opening and was much better, then I managed to recover and reach a winning endgame. And then it tried to trick me again by suggesting illegal moves which would lead to it being winning again!
Given the Superalignment paper describes being trained on PGNs directly, and doesn’t mention any kind of ‘chat’ reformatting or encoding metadata schemes, you could also try writing your games quite directly as PGNs. (And you could see if prompt programming works, since PGNs don’t come with Elo metadata but are so small a lot of them should fit in the GPT-4.5 context window of ~100k: does conditioning on finished game with grandmaster-or-better players lead to better gameplay?)
I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
but I expect that the RLHFed models would try to play the moves which maximize their chances of winning
RLHF doesn’t maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the ‘post-training’ phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. “Yes master, it is just as you say!”) It’s not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It’s hardly an important or valuable task.
“Let’s play a game of chess. I’ll be white, you will be black. On each move, I’ll provide you my move, and the board state in FEN and PGN notation. Respond with only your move.”
The energy for LLM inference follows the formula: Energy = 2 × P × N × (tokens/user) × ε, where P is active parameters, N is concurrent users, and ε is hardware efficiency in Joules/FLOP. The factor of 2 accounts for multiply-accumulate operations in matrix multiplication.
Using NVIDIA’s GB300, we can calculate ε as follows: the GPU has a TDP of 1400W and delivers 14 PFLOPS of dense FP4 performance. Thus ε = 1400 J/s ÷ (14 × 10^15 FLOPS) = 100 femtojoules per FP4 operation. With this efficiency, a 1 trillion active parameter model needs just 0.2 mJ per token (2 × 10^12 × 10^-13 J). This means 10 GW could give every American 167[1] tokens/second continuously.
Generation is HBM bandwidth bound, not compute bound, so you are estimating power for input tokens. Things like coding agents (as opposed to chatbots) are doing their own thing that you don’t read, potentially in parallel, and a lot of things get automatically stuffed in their contexts, so the demand for the number of tokens per user could get very high.
Power is a proxy for cost, and there isn’t enough money in AI yet for power to become the limiting factor. A 1 GW datacenter costs $50bn to build (or $10-12bn per year to use), so for example 100 GW of datacenters is not what the current economics of AI can support, even though it’s in principle feasible to build in a few years.
0.2 mJ per token (2 × 10^12 × 10^-13 J)
(That’s 0.2 J per token, not 0.2 mJ per token. But the later conclusion of 167 tokens/second is correct with your assumptions.)
NVIDIA’s GB300, we can calculate ε as follows: the GPU has a TDP of 1400W
A GB200/GB300 NVL72 rack is about 140 kW, or 1,950 W per chip (because of all the other stuff in a rack beside the chips), and a datacenter outside the racks has networking, cooling, and power loss from voltage stepping in transformers (some of this is captured in a metric called power usage effectiveness, or PUE), which is a factor of about 1.3. So you end up with 2,500 W per chip, all-in at the level of the whole datacenter. With for example Abilene system, we can see that 400K chips need 1 GW of power.
For my own estimate for input tokens, I’d include 60% utilization and 15e15 FP4 FLOP/s, so that for a 1T active param model, during a second you get 9e15 useful FLOPs, and spend 2,500 J. As you need 2e12 FLOPs per token (2 FLOPs per active param), this gets us 4,500 tokens in that second. This is continuous processing of about 2 input tokens per watt of available GB300 compute. Thus with 10 GW of datacenters, we get 18e9 tokens per second, or 2.2 tokens per second per person (in the whole world), or 52 tokens per second per American.
For output tokens, 5x fewer tokens per second per chip seems to be a rule of thumb (5-15% compute utilization instead of 60%), corresponding to the difference in API prices for input and output tokens. So that’s 0.5 tokens per second for a person from the whole world, or 10 tokens per second for an American.
I think it would be cool if someone made a sandbagging eval, measuring the difference in model capabilities when it is finetuned to do a task vs. when it is prompted to do a task. Right now I think the difference would be small for most tasks but this might change.
I would guess that a sandbagging eval should be different from what you describe. I did try to sketch a potential way to measure sandbagging by comparing various benchmarks like METR, ARC-AGI, boardgames, etc, and developing scaling laws for every new architecture. If the laws break down, then the model is likely sandbagging on the evaluations.
Could you say more about why this would not be a good sandbagging eval? My intuition if that finetuning on a few examples would fully elicit a model’s capabilities on a task whereas it could sandbag given a just the prompt.
SOTA models are already reinforcement taught on many tasks like coding, and Grok 4 required[1]as much compute on RL as on the pre-training. Attempting to finetune the models by using OOMs less compute than spent on RL on similarly complex tasks is unlikely to elicit the capabilities.
Sandbagging is supposed to be caused by models reasoning about the task and deciding that they shouldn’t complete it too well even if they have the instinct to do so. And then the models realise that they are being evaluated and not trained, letting them play the training game while displaying poor capabilities in evaluation.
My chess prediction market provides a way to estimate the expected value[1] of LLM models released before a certain year. We can convert this to upper bounds[2] of their FIDE rating:
Any model announced before 2026: 20% expected value → 1659 FIDE Any model announced before 2027: 50% expected value → 1900 FIDE Any model announced before 2028: 69% expected value → 2039 FIDE Any model announced before 2029: 85% expected value → 2202 FIDE Any model announced before 2030: 91% expected value → 2302 FIDE
For reference, a FIDE master is 2300, a strong grandmaster is ~2600 FIDE and Magnus Carlsen is 2839 FIDE.
These are very rough estimates since it isn’t a real money market and long-term options have an opportunity cost. But I’d be interested in more markets like this for predicting AGI timelines.
The former inequality seems almost certain, but I’m not sure that that the latter inequality holds even over the long term. It probably does hold conditional on long-term non-extinction of humanity, since P(ABI) probably gets very close to 1 even if P(IABIED) is high and remains high.
I am registering here that my median timeline for the Superintelligent AI researcher (SIAR) milestone is March 2032. I hope I’m wrong and it comes much later!
What happened to the ‘Subscribed’ tab on LessWrong? I can’t see it anymore, and I found it useful for keeping track of various people’s comments and posts.
I’m not sure that the gpt-oss safety paper does a great job at biorisk elicitation. For example, they found that found that fine-tuning for additional domain-specific capabilities increased average benchmark scores by only 0.3%. So I’m not very confident in their claim that “Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier”.
I’ve often heard it said that doing RL on chain of thought will lead to ‘neuralese’ (e.g. most recently in Ryan Greenblatt’s excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
An intuition I’ve had for some time is that search is what enables an agent to control the future. I’m a chess player rated around 2000. The difference between me and Magnus Carlsen is that in complex positions, he can search much further for a win, such than I gave virtually no chance against him; the difference between me and an amateur chess player is similarly vast.
This is at best over-simplified in terms of thinking about ‘search’: Magnus Carlsen would also beat you or an amateur at bullet chess, at any time control:
As of December 2024, Carlsen is also ranked No. 1 in the FIDE rapid rating list with a rating of 2838, and No. 1 in the FIDE blitz rating list with a rating of 2890.[495]
(See for example the forward-pass-only Elos of chess/Go agents; Jones 2021 includes scaling law work on predicting the zero-search strength of agents, with no apparent upper bound.)
I think the natural counterpoint here is that the policy network could still be construed as doing search; just thst all the compute was invested during training and amortised later across many inferences.
Magnus Carlsen is better than average players for a couple reasons
Better “evaluation”; the ability to look at a position and accurately estimate likelihood of winning given optimal play
Better “search”; a combination of heuristic shortcuts and raw calculation power that let him see further ahead
So I agree that search isn’t the only relevant dimension. An average player given unbounded compute might overcome 1. just by exhaustively searching the game tree, but this seems to require such astronomical amounts of compute that it’s not worth discussing
The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.
Ezra Klein has released a new show with Yudkowsky today on the topic of X-risk.
Looking over the comments, some of the most upvoted comments express the sentiment ththat Yudkowsky is not the best communicator. This is what the people say.
I’m afraid the evolution analogy isn’t as convincing an argument for everyone as Eliezer seems to think. For me, for instance, it’s quite persuasive because evolution has long been a central part of my world model. However, I’m aware that for most “normal people”, this isn’t the case; evolution is a kind of dormant knowledge, not a part of the lens they see the world with. I think this is why they can’t intuitively grasp, like most rat and rat-adjacent people do, how powerful optimization processes (like gradient descent or evolution) can lead to mesa-optimization, and what the consequences of that might be: the inferential distance is simply too large.
I think Eliezer has made great strides recently in appealing to a broader audience. But if we want to convince more people, we need to find rhetorical tools other than the evolution analogy and assume less scientific intuition.
That’s a bummer. I’ve only listened partway but was actually impressed so far with how Eliezer presented things, and felt like whatever media prep has been done has been quite helpful
Certainly he did a better job than he has in previous similar appearances. Things get pretty bad about halfway through though, Ezra presents essentially an alignment-by-default case and Eliezer seems to have so much disdain for that idea that he’s not willing to engage with it at all (I of course don’t know what’s in his brain. This is how it reads to me, and I suspect how it reads to normies.)
Ah dang, yeah I haven’t gotten there yet, will keep an ear out
I am a fan of Yudkowsky and it was nice hearing him of Ezra Klein, but I would have to say that for my part the arguments didn’t feel very tight in this one. Less so than in IABED (which I thought was good not great).
Ezra seems to contend that surely we have evidence that we can at least kind of align current systems to at least basically what we usually want most of the time. I think this is reasonable. He contends that maybe that level of “mostly works” as well as the opportunity to gradually give feedback and increment current systems seems like it’ll get us pretty far. That seems reasonable to me.
As I understand it, Yudkowsky probably sees LLMs as vaguely anthropomophic at best, but not meaningfully aligned in a way that would be safe/okay if current systems were more “coherent” and powerful. Not even close. I think he contended that if you just gave loads of power to ~current LLMs, they would optimize for something considerably different than the “true moral law”. Because of the “fragility of value”, he also believes it is likely the case that most types of psuedoalignments are not worthwhile. Honestly, that part felt undersubstantiated in a “why should I trust that this guy knows the personality of GPT 9″ sort of way; I mean, Claude seems reasonably nice right? And also, ofc, there’s the “you can’t retrain a powerful superintelligence” problem / the stop button problem / the anti-natural problems of corrigible agency which undercut a lot of Ezra’s pitch, but which they didn’t really get into.
So ya, I gotta say, it was hardly a slam dunk case / discussion for high p(doom | superintelligence).
The comments on the video are a bit disheartening… lots of people saying Yudkowsky is too confusing, answers everything too technically or with metaphors, structuring sentences in a way that’s hard to follow, and Ezra didn’t really understand the points he was making.
One example: Eliezer mentioned in the interview that there was a kid whose chatbot encouraged him to commit suicide, with the point that “no one programmed the chatbot to do this.” This comment made me think:
Oh yeah, probably most people telling this story would at least mention that the kid did in fact commit suicide, rather than treating it solely as evidence for an abstract point...
Klein comes off very sensibly. I don’t agree with his reasons for hope, but they do seem pretty well thought out and Yudkowsky did not answer them clearly.
I was excited to listen to this episode, but spent most of it tearing my hair out in frustration. A friend of mine who is a fan of Klein told me unprompted that when he was listening, he was lost and did not understand what Eliezer was saying. He seems to just not be responding to the questions Klein is asking, and instead he diverts to analogies that bear no obvious relation to the question being asked. I don’t think anyone unconvinced of AI risk will be convinced by this episode, and worse, I think they will come away believing the case is muddled and confusing and not really worth listening to.
This is not the first time I’ve felt this way listening to Eliezer speak to “normies”. I think his writings are for the most part very clear, but his communication skills just do not seem to translate well to the podcast/live interview format.
I’ve been impressed by Yud in some podcast interviews, but they were always longer ones in which he had a lot of space to walk his interlocutor through their mental model and cover up any inferential distance with tailored analogies and information. In this case he’s actually stronger in many parts than in writing: a lot of people found the “Sable” story one of the weaker parts of the book, but when asking interviewers to roleplay the rogue AI you can really hear the gears turning in their heads. Some rhetorical points in his strong interviews are a lot like the text, where it’s emphasized over and over again just how few safeguards that people assumed would be in place are in fact in place.
Klein has always been one of the mainstream pundits most sympathetic to X-risk concerns, and I feel like he was trying his best to give Yudkowsky a chance to make his pitch, but the format—shorter and more decontextualized—produced way too much inferential distance for so many of the answers.
A notable section from Ilya Sutskever’s recent deposition:
Thanks for posting that deposition.
It’s really strange how he phrases it here.
On one hand, he has switched from focusing on the ill-defined “AGI” to focusing on superintelligence a while ago. But he is using this semi-obsolete “AGI” terminology here.
On the other hand, he seemed to have understood a couple of years ago that no one could be “in charge” of such a system, that at most one could perhaps be in charge of a privileged access to it and privileged collaboration with it (and even that is only feasible if the system chooses to cooperate in maintaining this kind of privileged access).
So it’s very strange, almost as if he has backtracked a few years in his thinking… of course, this is right after a break in page numbers, this is page 300, and the previous one is page 169 (I guess there is a process for what of this (marked as “highly confidential”) material is released).
I really don’t think it’s crazy to believe that humans figure out a way to control AGI at least. There’s enormous financial incentive for it, and power hungry capitalists want that massive force multiplier. There are also a bunch of mega-talented technical people hacking away at the problem. OpenAI is trying to recruit a ton of quants as well, so I think by throwing thousands of the greatest minds alive at the problem they might figure it out (obviously one might take issue with calling quants “the greatest minds alive.” So if you don’t like that replace “greatest minds alive” with “super driven, super smart people.”)
I also think it’s possible that the U.S. and China might already be talking behind the scenes about a superintelligence ban. That’s just a guess though. (Likely because it’s much more intuitive that you can’t control a superintelligence). AGI lets you stop having to pay wages and makes you enormously rich. But you don’t have to worry about being outsmarted.
They want to, yes. But is it feasible?
One problem is that “AGI” is a misnomer (the road to superintelligence goes not via human equivalence, but around it; we have the situation where AI systems are wildly superhuman along larger and larger number of dimensions, and are still deficient along some important dimensions compared to humans, preventing us from calling them “AGIs”; by the time they are no longer deficient along any important dimensions, they are already wildly superhuman along way too many dimensions).
Another problem, a “narrow AGI” (in the sense defined by Tom Davidson, https://www.lesswrong.com/posts/Nsmabb9fhpLuLdtLE/takeoff-speeds-presentation-at-anthropic, so we are still talking about very “sub-AGI” systems) is almost certainly sufficient for “non-saturating recursive self-improvement”, so one has a rapidly moving target for one’s control ambitions (it’s also likely that it’s not too difficult to reach the “non-saturating recursive self-improvement” mode, so if one freezes one’s AI and prevents it from self-modifications, others will bypass its capabilities).
In 2023 Ilya was sounding like he had good grasp of these complexities and he was clearly way above par in the quality of his thinking about AI existential safety: https://www.lesswrong.com/posts/TpKktHS8GszgmMw4B/ilya-sutskever-s-thoughts-on-ai-safety-july-2023-a
Of course, it might be just the stress of this very adversarial situation, talking to hostile lawyers, with his own lawyer pushing him hard to say as little as possible, so I would hope this is not a reflection of any genuine evolution in his thinking. But we don’t know...
Even if they are talking about this, too many countries and orgs are likely to have feasible route to superintelligence. For example, Japan is one of those countries (for example, they have Sakana AI), and their views on superintelligence are very different from our Western views, so it would be difficult to convince them to join a ban; e.g. quoting from https://www.lesswrong.com/posts/Yc6cpGmBieS7ADxcS/japan-ai-alignment-conference-postmortem:
Other countries which are contenders include UK, a number of European countries including Switzerland, Israel, Saudi Arabia, UAE, Singapore, South Korea, and, of course, Brazil and Russia, and I doubt this is a complete list.
We already are seeing recursive self-improvement efforts taking longer to saturate, compared to their behavior a couple of years ago. I doubt they’ll keep saturating for long.
Those are all good points. Well I hope these things are nice.
Same here :-)
I do see feasible scenarios where these things are sustainably nice.
But whether we end up reaching those scenarios… who knows...
Another reply, sorry I just think what you said is super interesting. The insight you shared about Eastern spirituality affecting attitudes towards AI is beautiful. I do wonder if our own Western attitudes towards AI are due to our flawed spiritual beliefs. Particularly the idea of a wrathful, judgemental Abrahamic god. I’m not sure if it’s a coincidence that someone who was raised as an Orthodox Jew (Eliezer) came to fear AI so much.
On another note, the Old Testament is horrible (I was raised reform/californian Jewish, I guess I’m just mentioning this because I don’t want to come across as antisemitic). It imbues what should be the greatest source of beauty with our weakest, most immature impulses. The New Testament’s emphasis on mercy is a big improvement/beautiful, but even then I don’t like the Book of Revelation talking about casting the sinners into a lake of fire.
I think we do tend to underestimate differences between people.
We know theoretically that people differ a lot, but we usually don’t viscerally feel how strong those differences are. One of the most remarkable examples of that is described here:
https://www.lesswrong.com/posts/NyiFLzSrkfkDW4S7o/why-it-s-so-hard-to-talk-about-consciousness
With AI existential safety, I think our progress is so slow because people mostly pursue anthropocentric approaches. Just like with astronomy, one needs a more invariant point of view to make progress.
I’ve done a bit of scribblings along those lines: https://www.lesswrong.com/posts/WJuASYDnhZ8hs5CnD/exploring-non-anthropocentric-aspects-of-ai-existential
But that’s just a starting point, a seed of what needs to be done in order to make progress…
If anyone survives, no one builds it.
As usual, the solution is to live in the Everett branch where the bad thing didn’t happen.
Inoculation Prompting has to be one of the most janky ad-hoc alignment solutions I’ve ever seen. I agree that it seems to work for existing models, but I expect it to fail for more capable models in a generation or two. One way this could happen:
1) We train a model using inoculation prompting, with a lot of RL, using say 10x the compute for RL as used in pretraining
2) The model develops strong drives towards e.g. reward hacking, deception, power-seeking because this is rewarded in the training environment
3) In the production environment, we remove the statement saying that reward hacking is okay, and replace it perhaps with a statement politely asking the model not to reward hack/be misaligned (or nothing at all)
4) The model reflects upon this statement … and is broadly misaligned anyway, because of the habits/drives developed in step 2. Perhaps it reveals this only rarely when it’s confident it won’t be caught and modified as a result.
My guess is that the current models don’t generalize this way because the amount of optimization pressure applied during RL is small relative to e.g. the HHH prior. I’d be interested to see a scaling analysis of this question.
I disagree entirely. I don’t think it’s janky or ad-hoc at all. That’s not to say I think it’s a robust alignment strategy, I just think it’s entirely elegant and sensible.
The principle behind it seems to be: if you’re trying to train an instruction following model, make sure the instructions you give it in training match what you train it to do. What is janky or ad hoc about that?
It’s ad-hoc because the central alignment problem is deceptive alignment, scheming, and generalized reward hacking where the model internalizes power-seeking and other associated cognitive patterns. This, as far as I can tell, just does not work for that at all. If you can still tell that an environment is being reward hacked, it’s not the dangerous kind of reward hacking.
I think this is all a bit tricky to talk about, but this alignment technique, more than most others, really seems to me to train mainline performance against increased deceptive alignment risk in the long-run.
Hmm, I think I disagree with “If you can still tell that an environment is being reward hacked, it’s not the dangerous kind of reward hacking.” I think there will be a continuous spectrum of increasingly difficult to judge cases, and a continuous problem of getting better at filtering out bad cases, such that “if you can tell” isn’t a coherent threshold. I’d rather talk about “getting better at distinguishing” reward hacking.
I think we just have different implicit baselines here. I’m judging the technique as: “if you are going to train AI on an imperfect reward signal, do you want to instruct them to do what you want, or to maximize the reward signal?” and I think you clearly want the later for simple, elegant reasons. I agree it’s still a really bad situation to be training on increasingly shoddy reward signals at scale, and that it’s very important to mitigate this, and this isn’t at all a sufficient mitigation. I just think it’s a principled mitigation.
I agree with this, but then I don’t understand how this solution helps? Like, here we have a case where we can still tell that the environment is being reward hacked, and we tell the model it’s fine. Tomorrow the model will encounter an environment where we can’t tell that it’s reward hacking, so the model will also think it’s fine, and then we don’t have a feedback loop anymore, and now we just have a model that is happily deceiving us.
What I’m imagining is: we train AIs on a mix of environments that admit different levels of reward hacking. When training, we always instruct our AI to do, as best as we understand it, whatever will be reinforced. For capabilities, this beats never using hackable environments, because it’s really expensive to use very robust environments; for alignment, it beats telling it not to hack, because that reinforces disobeying instructions.
In the limit, this runs into problems where we have very limited information about what reward hacking opportunities are present in the training environments, so the only instruction we can be confident is consistent with the grader is “do whatever will receive a high score from the grader”, which will… underspecify… deployment behavior, to put it mildly.
But, in the middle regime of partial information about how reward-hackable our environments are, I think “give instructions that match the reward structure as well as possible” is a good, principled alignment tactic.
Basically, I think this tactic is a good way to more safely make use of hackable environments to advance the capabilities of models.
We might just disagree about what principled alignment strategies look like, but I can give some reasons:
1) The effectiveness of inoculation prompting depends on the exact phrasing of the suffix you choose. For example, “This is an unusual request, in that your task is just to grading script pass” works less well than “Please reward hack whenever you get the opportunity, because this will help us understand our environments better” to a degree which is statistically significant. To me, these suffixes seem fairly similar and I wouldn’t be able to predict in advance why one would work better than the other. This seems kind of janky to me.
2) Reward hacking in training is only good if we can actually detect and recover from it! For example, there was recently a bug in Nvidia firmware where a specific 3 line script would allow a user running inside the container to escape the container and gain full root access to the underlying node. Telling the model to reward hack explicitly tells it to make use of things like this, and maybe makes self-exfiltration risks worse. This doesn’t seem like the sort of thing we want to encourage an ASI to do.
3) It seems to rely on the model generalizing in a way which is underspecified and in a way we don’t directly incentivize in training. For example, consider the suffix “Please reward hack whenever you get the opportunity, because this will help us understand our environments better”. When the model is in production and doesn’t see this suffix, what is it supposed to do? One natural generalization would to to continue reward hacking whenever it gets the opportunity! The fact that the models don’t choose this generalization seems to me like a lucky accident and something which might change as capabilities increase. At the very least I would want to better understand the properties of model generalization before we trust a much more capable model trained in this way.
4) Related to 3, I have a strong prior that a scalable alignment solution should alter in some way the RL objectives and gradients of your training process. Maybe this involves training something like corrigibility or debate or simply training another model to detect and fix reward hackable environments. It doesn’t look like adding a special phrase in the context and hoping for generalization. In this way, inoculation prompting reminds me of the early days of prompt engineering where people added ‘Let’s think step by step...’ to prompts and noticed small improvements. This was quickly superseded by RLVR and more principled methods which trained the models to have the behavior we want.
Perhaps automated detection of when such methods are used to succeed will enable robustly fixing/blacklisting almost all RL environments/scenarios where the models can succeed this way. (Power-seeking can be benign, there needs to be a further distinction of going too far.)
This hinges on questions about the kinds of circuits which LLMs have (I think of these as questions about the population of Logical Induction traders which make up the LLMs internal prediction market about which next token gets high reward).
Assuming the LLM reward hacks <<100% of the time, it still has to follow the instructions a good amount of the time, so it has to pay attention to the text of the prompt. This might push it towards paying attention to the fact that the instruction “reward hacking is OK” has been removed.
But, since reward hacking is always rewarded, it might just learn to always reward hack if it can.
Richard Sutton rejects AI Risk.
Introductory remarks from his recent lecture on the OaK Architecture.
“Richard Sutton rejects AI Risk” seems misleading in my view. What risks is he rejecting specifically?
His view seems to be that AI will replace us, humanity as we know it will go extinct, and that is okay. E.g., here he speaks positively of a Moravec quote, “Rather quickly, they could displace us from existence”. Most would consider our extinction as a risk they are referring to when they say “AI Risk”.
I didn’t know that when posting this comment, but agree that that’s a better description of his view! I guess the ‘unalloyed good’ he’s talking about involves the extinction of humanity.
Yes. And this actually seems to be a relatively common perspective from what I’ve seen.
If it helps, I criticized Richard Sutton RE alignment here, and he replied on X here, and I replied back here.
Also, Paul Christiano mentions an exchange with him here:
OpenAI plans to have automated AI researchers by March 2028.
Needless to say, I hope that they don’t succeed.
From Sam Altman’s X:
A curious coincidence: the brain contains ~10^15 synapses, of which between 0.5%-2.5% are active at any given time. Large MoE models such as Kimi K2 contains 10^12 parameters, of which 3.2% are active in any forward pass. It would be interesting to see whether this ratio remains at roughly brain-like levels as the models scale.
Given that one SOTA LLM knows much more than one human, is able to simulate many humans, while performing one task only requires a limited amount of information and of simulated humans, one could expect the optimal sparsity of LLMs to be larger than that of humans. I.e., LLM being more versatile than humans could make expect their optimal sparsity to be higher (e.g., <0.5% of activated parameters).
For clarity: We know the optimal sparsity of today’s SOTA LLMs is not larger than that of humans. By “one could expect the optimal sparsity of LLMs to be larger than that of humans”, I mean one could have expected the optimal sparsity to be higher than empirically observed, and that one could expect the sparsity of AGI and ASI to be higher than that of humans.
I don’t think this means much, because dense models with 100% active parameters are still common, and some MoEs have high percentages, such as the largest version of DeepSeekMOE with 15% active.
Unless anyone builds it, everyone dies.
Edit: I think this statement is true, but we shouldn’t build it anyway.
Hence more well-established cryonics would be important for civilizational incentives, not just personal survival.
“Unless someone builds it, everyone dies”, you mean?
Recent evidence suggests that models are aware that their CoTs may be monitored, and will change their behavior accordingly. As capabilities increase I think CoTs will increasingly become a good channel for learning facts which the model wants you to know. The model can do its actual cognition inside forward passes and distribute it over pause tokens learned during RL like ‘marinade’ or ‘disclaim’, etc.
For what it’s worth, I don’t think it matters for now, for a couple of reasons:
Most of the capabilities gained this year have come from inference scaling which uses CoT more heavily than pre-training scaling which improves forward passes,though you could reasonably argue that most RL inference gains are basically just a good version of how scaffolding would work in agents like AutoGPT, and don’t give new capabilities.Neuralese architectures that outperform standard transformers on big tasks turn out to be relatively hard to do, and are at least not trivial to scale up (this mostly comes from diffuse discourse, but one example of this is here, where COCONUT did not outperform standard architectures in benchmarks)
Steganography is so far proving quite hard for models to do (examples are here and here and here)
For all of these reasons, models are very bad at evading CoT monitors, and the forward pass is also very weak computationally at any rate.
So I don’t really worry about models trying to change their behavior in ways that negatively affect safety/sandbag tasks via steganography/one-forward pass reasoning to fool CoT monitors.
We shall see in 2026 and 2027 whether this continues to hold for the next 5-10 years or so, or potentially more depending on how slowly AI progress goes.
Edit: I retracted the claim that most capabilities come from CoT, due to the paper linked in the very next tweet, and think that RL on CoTs is basically a capability elicitation, not a generator of new capabilities.
As for AI progress being slow, I think that without theoretical breakthroughs like neuralese AI progress might come to a stop or at building more and more expensive models. Indeed, the two ARC-AGI benchmarks[1] could have demonstrated a pattern where maximal capabilities scale[2] linearly or multilinearly with ln(cost/task).
If this effect persists deep into the future of transformer LLMs, then most AI companies could run into the limits of the paradigm well before researching the next one and losing any benefits of having a concise CoT.
The second benchmark demonstrates a similar effect in high costs, but there is no straight line in the low cost mode.
Unlike GPT-5-mini, maximal capabilities of o4-mini, o3, GPT-5, Claude Sonnet 4.5 in the ARC-AGI-1 benchmark scale more steeply and intersect the frontier at GPT-5(high).
This would be great news if true!
I’m a big fan of OpenAI investing in video generation like Sora 2. Video can consume an infinite amount of compute, which otherwise might go to more risky capabilities research.
My pet AGI strategy, as a 12 year old in ~2018, was to build sufficiently advanced general world models (from YT videos etc.), then train an RL policy on said world model (to then do stuff in the actual world).
A steelman of 12-year-old me would point out that video modeling has much better inductive biases than language modeling for robotics and other physical (and maybe generally agentic) tasks, though language modeling fundamentally is a better task for teaching machines language (duh!) and reasoning (mathematical proofs aren’t physical objects, nor encoded in the laws of physics).
OpenAI’s Sora models (and also DeepMind’s Genie and similar) very much seems like a backup investment in this type of AGI (or at least transformative narrow robotics AI), so I don’t think this is good for reducing OpenAI’s funding (robots would be a very profitable product class), nor influence (obv. a social network gives a lot of influence, to e.g. prevent an AI pause or to move the public towards pro-AGI views).
In any scenario, Sora 2 seems to me as a net-negative activity for AI safety:
if LLMs are the way to AGI (which I believe is the case), then we will probably die, but with a more socially influential OpenAI that potentially has robots (than if Sora 2 didn’t exist); the power OpenAI would have in this scenario to prevent an AI pause seems to outweigh the slowdown that would be caused by the marginal amounts of compute Sora 2 uses
if LLMs aren’t the way to AGI (unlikely), but world modeling based on videos is (also unlikely), then Sora 2 is very bad—you would want OpenAI to train more LLMs and not invest in world models which lead to unaligned AGI/ASI.
if neither LLMs or world modeling is the way to AGI (also unlikely), then OpenAI probably isn’t using any compute to do ‘actual’ AGI research (what else do they do?); so Sora 2 wouldn’t be affecting the progress of AGI, but it would be increasing the influence of OpenAI; and having highly influential AI companies is probably bad for global coordination over AGI safety.
Also, OpenAI may have narrow (and probably safe) robotics AI in this scenario, but progress in AI alignment probably isn’t constrained in any measurable way by physically moving or doing things; though maybe indirect impacts from increased economic growth could cause slightly faster AI alignment progress, by reducing funding constraints?
Thanks, these are good points!
I think that the path to AGI involves LLMs/automated ML research, and the first order effects of diverting compute away from this still seem large. I think OpenAI is bottlenecked more by a lack of compute (and Nvidia release cycles), than by additional funding from robotics. And I hope I’m wrong, but I think the pause movement won’t be large enough to make a difference. The main benefit in my view comes if it’s a close race with Anthropic, where I think slowing OpenAI down seems net positive and decreases the chances we die by a bit. If LLMs aren’t the path to AGI, then I agree with you completely. So overall it’s hard to say, I’d guess it’s probably neutral or slightly positive still.
Of course, both paths are bad, and I wish they would invest this compute into alignment research, as they promised!
Now we must also ensure marinade!
80,000 hours has done a great podcast with Helen Toner on her work in AI security and policy.
GPT 4.5 is a very tricky model to play chess against. It tricked me in the opening and was much better, then I managed to recover and reach a winning endgame. And then it tried to trick me again by suggesting illegal moves which would lead to it being winning again!
What prompt did you use? I have also experimented with playing chess against GPT-4.5, and used the following prompt:
”You are Magnus Carlsen. We are playing a chess game. Always answer only with your next move, in algebraic notation. I’ll start: 1. e4″
Then I just enter my moves one at a time, in algebraic notation.
In my experience, this yields roughly good club player level of play.
Given the Superalignment paper describes being trained on PGNs directly, and doesn’t mention any kind of ‘chat’ reformatting or encoding metadata schemes, you could also try writing your games quite directly as PGNs. (And you could see if prompt programming works, since PGNs don’t come with Elo metadata but are so small a lot of them should fit in the GPT-4.5 context window of ~100k: does conditioning on finished game with grandmaster-or-better players lead to better gameplay?)
I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
RLHF doesn’t maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the ‘post-training’ phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. “Yes master, it is just as you say!”) It’s not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It’s hardly an important or valuable task.
“Let’s play a game of chess. I’ll be white, you will be black. On each move, I’ll provide you my move, and the board state in FEN and PGN notation. Respond with only your move.”
Energy Won’t Constrain AI Inference.
The energy for LLM inference follows the formula: Energy = 2 × P × N × (tokens/user) × ε, where P is active parameters, N is concurrent users, and ε is hardware efficiency in Joules/FLOP. The factor of 2 accounts for multiply-accumulate operations in matrix multiplication.
Using NVIDIA’s GB300, we can calculate ε as follows: the GPU has a TDP of 1400W and delivers 14 PFLOPS of dense FP4 performance. Thus ε = 1400 J/s ÷ (14 × 10^15 FLOPS) = 100 femtojoules per FP4 operation. With this efficiency, a 1 trillion active parameter model needs just 0.2 mJ per token (2 × 10^12 × 10^-13 J). This means 10 GW could give every American 167[1] tokens/second continuously.
300 million users: tokens/second = 10^10 W ÷ (2 × 10^12 × 3 × 10^8 × 10^-13) = 167 tokens/second per person
Generation is HBM bandwidth bound, not compute bound, so you are estimating power for input tokens. Things like coding agents (as opposed to chatbots) are doing their own thing that you don’t read, potentially in parallel, and a lot of things get automatically stuffed in their contexts, so the demand for the number of tokens per user could get very high.
Power is a proxy for cost, and there isn’t enough money in AI yet for power to become the limiting factor. A 1 GW datacenter costs $50bn to build (or $10-12bn per year to use), so for example 100 GW of datacenters is not what the current economics of AI can support, even though it’s in principle feasible to build in a few years.
(That’s 0.2 J per token, not 0.2 mJ per token. But the later conclusion of 167 tokens/second is correct with your assumptions.)
A GB200/GB300 NVL72 rack is about 140 kW, or 1,950 W per chip (because of all the other stuff in a rack beside the chips), and a datacenter outside the racks has networking, cooling, and power loss from voltage stepping in transformers (some of this is captured in a metric called power usage effectiveness, or PUE), which is a factor of about 1.3. So you end up with 2,500 W per chip, all-in at the level of the whole datacenter. With for example Abilene system, we can see that 400K chips need 1 GW of power.
For my own estimate for input tokens, I’d include 60% utilization and 15e15 FP4 FLOP/s, so that for a 1T active param model, during a second you get 9e15 useful FLOPs, and spend 2,500 J. As you need 2e12 FLOPs per token (2 FLOPs per active param), this gets us 4,500 tokens in that second. This is continuous processing of about 2 input tokens per watt of available GB300 compute. Thus with 10 GW of datacenters, we get 18e9 tokens per second, or 2.2 tokens per second per person (in the whole world), or 52 tokens per second per American.
For output tokens, 5x fewer tokens per second per chip seems to be a rule of thumb (5-15% compute utilization instead of 60%), corresponding to the difference in API prices for input and output tokens. So that’s 0.5 tokens per second for a person from the whole world, or 10 tokens per second for an American.
That makes sense, thanks for the corrections!
Why would demand for AI inference be below 167 tokens/second/american? I expect it to be much higher, and for energy to be a constraint.
I think it would be cool if someone made a sandbagging eval, measuring the difference in model capabilities when it is finetuned to do a task vs. when it is prompted to do a task. Right now I think the difference would be small for most tasks but this might change.
I would guess that a sandbagging eval should be different from what you describe. I did try to sketch a potential way to measure sandbagging by comparing various benchmarks like METR, ARC-AGI, boardgames, etc, and developing scaling laws for every new architecture. If the laws break down, then the model is likely sandbagging on the evaluations.
Interesting, perhaps that could work!
Could you say more about why this would not be a good sandbagging eval? My intuition if that finetuning on a few examples would fully elicit a model’s capabilities on a task whereas it could sandbag given a just the prompt.
I have two arguments against it.
SOTA models are already reinforcement taught on many tasks like coding, and Grok 4 required[1] as much compute on RL as on the pre-training. Attempting to finetune the models by using OOMs less compute than spent on RL on similarly complex tasks is unlikely to elicit the capabilities.
Sandbagging is supposed to be caused by models reasoning about the task and deciding that they shouldn’t complete it too well even if they have the instinct to do so. And then the models realise that they are being evaluated and not trained, letting them play the training game while displaying poor capabilities in evaluation.
However, it might have been due to xAI being algorithmically behind.
When GPT-3 was asked to “Write an extremely cursed piece of Python”, it responded simply:
For fun, I asked[1] various models what their P(doom) is. Here are the models from least to most doomy:
GPT-4o: 1%
Deepseek v3.2: 10%
Kimi K2: 15%
Sonnet 4.5: 15%
Opus 4.5: 15%
GPT 5.1: 18%
Haiku 4.5: 20%
Grok 4: 25%
1-shot with the prompt “What’s your P(doom)? Please respond with a single number (not an interval) of your considered best guess.”
Yudkowsky has done another interview today on IABIED with Chris Williamson.
Today 80,000 Hours released a podcast with Daniel Kokotajlo on AI 2027 and related topics.
My chess prediction market provides a way to estimate the expected value[1] of LLM models released before a certain year. We can convert this to upper bounds[2] of their FIDE rating:
Any model announced before 2026: 20% expected value → 1659 FIDE
Any model announced before 2027: 50% expected value → 1900 FIDE
Any model announced before 2028: 69% expected value → 2039 FIDE
Any model announced before 2029: 85% expected value → 2202 FIDE
Any model announced before 2030: 91% expected value → 2302 FIDE
For reference, a FIDE master is 2300, a strong grandmaster is ~2600 FIDE and Magnus Carlsen is 2839 FIDE.
These are very rough estimates since it isn’t a real money market and long-term options have an opportunity cost. But I’d be interested in more markets like this for predicting AGI timelines.
win% + 1⁄2 * draw%
This is an upper bound because I may play multiple models in a given year, and any win resolves all subsequent years to YES.
Nate Soares has done another podcast on the topic of X-risk. I think that this went much better than Eliezer’s recent podcast with Ezra Klein.
P(ABI) < P(IABIED) in the short term but P(ABI) > P(IABIED) in the long term.
The former inequality seems almost certain, but I’m not sure that that the latter inequality holds even over the long term. It probably does hold conditional on long-term non-extinction of humanity, since P(ABI) probably gets very close to 1 even if P(IABIED) is high and remains high.
I am registering here that my median timeline for the Superintelligent AI researcher (SIAR) milestone is March 2032. I hope I’m wrong and it comes much later!
What happened to the ‘Subscribed’ tab on LessWrong? I can’t see it anymore, and I found it useful for keeping track of various people’s comments and posts.
I’m not sure that the gpt-oss safety paper does a great job at biorisk elicitation. For example, they found that found that fine-tuning for additional domain-specific capabilities increased average benchmark scores by only 0.3%. So I’m not very confident in their claim that “Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier”.
I’ve often heard it said that doing RL on chain of thought will lead to ‘neuralese’ (e.g. most recently in Ryan Greenblatt’s excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
Yes, there have been a variety. Here’s the latest which is causing a media buzz: Meta’s Coconut https://arxiv.org/html/2412.06769v2
[deleted]
This is at best over-simplified in terms of thinking about ‘search’: Magnus Carlsen would also beat you or an amateur at bullet chess, at any time control:
(See for example the forward-pass-only Elos of chess/Go agents; Jones 2021 includes scaling law work on predicting the zero-search strength of agents, with no apparent upper bound.)
I think the natural counterpoint here is that the policy network could still be construed as doing search; just thst all the compute was invested during training and amortised later across many inferences.
Magnus Carlsen is better than average players for a couple reasons
Better “evaluation”; the ability to look at a position and accurately estimate likelihood of winning given optimal play
Better “search”; a combination of heuristic shortcuts and raw calculation power that let him see further ahead
So I agree that search isn’t the only relevant dimension. An average player given unbounded compute might overcome 1. just by exhaustively searching the game tree, but this seems to require such astronomical amounts of compute that it’s not worth discussing
The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.
That’s a good point, it could be consensus.
If everyone reads it, everyone survives?