Looking over the comments, some of the most upvoted comments express the sentiment ththat Yudkowsky is not the best communicator. This is what the people say.
I’m afraid the evolution analogy isn’t as convincing an argument for everyone as Eliezer seems to think. For me, for instance, it’s quite persuasive because evolution has long been a central part of my world model. However, I’m aware that for most “normal people”, this isn’t the case; evolution is a kind of dormant knowledge, not a part of the lens they see the world with. I think this is why they can’t intuitively grasp, like most rat and rat-adjacent people do, how powerful optimization processes (like gradient descent or evolution) can lead to mesa-optimization, and what the consequences of that might be: the inferential distance is simply too large.
I think Eliezer has made great strides recently in appealing to a broader audience. But if we want to convince more people, we need to find rhetorical tools other than the evolution analogy and assume less scientific intuition.
That’s a bummer. I’ve only listened partway but was actually impressed so far with how Eliezer presented things, and felt like whatever media prep has been done has been quite helpful
Certainly he did a better job than he has in previous similar appearances. Things get pretty bad about halfway through though, Ezra presents essentially an alignment-by-default case and Eliezer seems to have so much disdain for that idea that he’s not willing to engage with it at all (I of course don’t know what’s in his brain. This is how it reads to me, and I suspect how it reads to normies.)
I am a fan of Yudkowsky and it was nice hearing him of Ezra Klein, but I would have to say that for my part the arguments didn’t feel very tight in this one. Less so than in IABED (which I thought was good not great).
Ezra seems to contend that surely we have evidence that we can at least kind of align current systems to at least basically what we usually want most of the time. I think this is reasonable. He contends that maybe that level of “mostly works” as well as the opportunity to gradually give feedback and increment current systems seems like it’ll get us pretty far. That seems reasonable to me.
As I understand it, Yudkowsky probably sees LLMs as vaguely anthropomophic at best, but not meaningfully aligned in a way that would be safe/okay if current systems were more “coherent” and powerful. Not even close. I think he contended that if you just gave loads of power to ~current LLMs, they would optimize for something considerably different than the “true moral law”. Because of the “fragility of value”, he also believes it is likely the case that most types of psuedoalignments are not worthwhile. Honestly, that part felt undersubstantiated in a “why should I trust that this guy knows the personality of GPT 9″ sort of way; I mean, Claude seems reasonably nice right? And also, ofc, there’s the “you can’t retrain a powerful superintelligence” problem / the stop button problem / the anti-natural problems of corrigible agency which undercut a lot of Ezra’s pitch, but which they didn’t really get into.
So ya, I gotta say, it was hardly a slam dunk case / discussion for high p(doom | superintelligence).
The comments on the video are a bit disheartening… lots of people saying Yudkowsky is too confusing, answers everything too technically or with metaphors, structuring sentences in a way that’s hard to follow, and Ezra didn’t really understand the points he was making.
One example: Eliezer mentioned in the interview that there was a kid whose chatbot encouraged him to commit suicide, with the point that “no one programmed the chatbot to do this.” This comment made me think:
if you get a chance listen to the interviews with the parents and the lawyers who are suing chatgpt because that kid did commit suicide.
Oh yeah, probably most people telling this story would at least mention that the kid did in fact commit suicide, rather than treating it solely as evidence for an abstract point...
Klein comes off very sensibly. I don’t agree with his reasons for hope, but they do seem pretty well thought out and Yudkowsky did not answer them clearly.
I was excited to listen to this episode, but spent most of it tearing my hair out in frustration. A friend of mine who is a fan of Klein told me unprompted that when he was listening, he was lost and did not understand what Eliezer was saying. He seems to just not be responding to the questions Klein is asking, and instead he diverts to analogies that bear no obvious relation to the question being asked. I don’t think anyone unconvinced of AI risk will be convinced by this episode, and worse, I think they will come away believing the case is muddled and confusing and not really worth listening to.
This is not the first time I’ve felt this way listening to Eliezer speak to “normies”. I think his writings are for the most part very clear, but his communication skills just do not seem to translate well to the podcast/live interview format.
I’ve been impressed by Yud in some podcast interviews, but they were always longer ones in which he had a lot of space to walk his interlocutor through their mental model and cover up any inferential distance with tailored analogies and information. In this case he’s actually stronger in many parts than in writing: a lot of people found the “Sable” story one of the weaker parts of the book, but when asking interviewers to roleplay the rogue AI you can really hear the gears turning in their heads. Some rhetorical points in his strong interviews are a lot like the text, where it’s emphasized over and over again just how few safeguards that people assumed would be in place are in fact in place.
Klein has always been one of the mainstream pundits most sympathetic to X-risk concerns, and I feel like he was trying his best to give Yudkowsky a chance to make his pitch, but the format—shorter and more decontextualized—produced way too much inferential distance for so many of the answers.
AI is a grand quest. We’re trying to understand how people work, we’re trying to make people, we’re trying to make ourselves powerful. This is a profound intellectual milestone. It’s going to change everything… It’s just the next big step. I think this is just going to be good. Lot’s of people are worried about it—I think it’s going to be good, an unalloyed good.
Introductory remarks from his recent lecture on the OaK Architecture.
“Richard Sutton rejects AI Risk” seems misleading in my view. What risks is he rejecting specifically?
His view seems to be that AI will replace us, humanity as we know it will go extinct, and that is okay. E.g., here he speaks positively of a Moravec quote, “Rather quickly, they could displace us from existence”. Most would consider our extinction as a risk they are referring to when they say “AI Risk”.
I didn’t know that when posting this comment, but agree that that’s a better description of his view! I guess the ‘unalloyed good’ he’s talking about involves the extinction of humanity.
If it helps, I criticized Richard Sutton RE alignment here, and he replied on X here, and I replied back here.
Also, Paul Christiano mentions an exchange with him here:
[Sutton] agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI. I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is “attempt to have a slave society,” not “slow down AI progress for decades”—I think he might also believe that stagnation is much worse than a handoff but haven’t heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it’s not as bad as the alternative.
Recentevidence suggests that models are aware that their CoTs may be monitored, and will change their behavior accordingly. As capabilities increase I think CoTs will increasingly become a good channel for learning facts which the model wants you to know. The model can do its actual cognition inside forward passes and distribute it over pause tokens learned during RL like ‘marinade’ or ‘disclaim’, etc.
Neuralese architectures that outperform standard transformers on big tasks turn out to be relatively hard to do, and are at least not trivial to scale up (this mostly comes from diffuse discourse, but one example of this is here, where COCONUT did not outperform standard architectures in benchmarks)
Steganography is so far proving quite hard for models to do (examples are here and here and here)
So I don’t really worry about models trying to change their behavior in ways that negatively affect safety/sandbag tasks via steganography/one-forward pass reasoning to fool CoT monitors.
We shall see in 2026 and 2027 whether this continues to hold for the next 5-10 years or so, or potentially more depending on how slowly AI progress goes.
Edit: I retracted the claim that most capabilities come from CoT, due to the paper linked in the very next tweet, and think that RL on CoTs is basically a capability elicitation, not a generator of new capabilities.
As for AI progress being slow, I think that without theoretical breakthroughs like neuralese AI progress might come to a stop or at building more and more expensive models. Indeed, the two ARC-AGI benchmarks[1]could have demonstrated a pattern where maximal capabilities scale[2]linearly or multilinearlywith ln(cost/task).
If this effect persists deep into the future of transformer LLMs, then most AI companies could run into the limits of the paradigm well before researching the next one and losing any benefits of having a concise CoT.
Unlike GPT-5-mini, maximal capabilities of o4-mini, o3, GPT-5, Claude Sonnet 4.5 in the ARC-AGI-1 benchmark scale more steeply and intersect the frontier at GPT-5(high).
I’m a big fan of OpenAI investing in video generation like Sora 2. Video can consume an infinite amount of compute, which otherwise might go to more risky capabilities research.
My pet AGI strategy, as a 12 year old in ~2018, was to build sufficiently advanced general world models (from YT videos etc.), then train an RL policy on said world model (to then do stuff in the actual world). A steelman of 12-year-old me would point out that video modeling has much better inductive biases than language modeling for robotics and other physical (and maybe generally agentic) tasks, though language modeling fundamentally is a better task for teaching machines language (duh!) and reasoning (mathematical proofs aren’t physical objects, nor encoded in the laws of physics).
OpenAI’s Sora models (and also DeepMind’s Genie and similar) very much seems like a backup investment in this type of AGI (or at least transformative narrow robotics AI), so I don’t think this is good for reducing OpenAI’s funding (robots would be a very profitable product class), nor influence (obv. a social network gives a lot of influence, to e.g. prevent an AI pause or to move the public towards pro-AGI views).
In any scenario, Sora 2 seems to me as a net-negative activity for AI safety:
if LLMs are the way to AGI (which I believe is the case), then we will probably die, but with a more socially influential OpenAI that potentially has robots (than if Sora 2 didn’t exist); the power OpenAI would have in this scenario to prevent an AI pause seems to outweigh the slowdown that would be caused by the marginal amounts of compute Sora 2 uses
if LLMs aren’t the way to AGI (unlikely), but world modeling based on videos is (also unlikely), then Sora 2 is very bad—you would want OpenAI to train more LLMs and not invest in world models which lead to unaligned AGI/ASI.
if neither LLMs or world modeling is the way to AGI (also unlikely), then OpenAI probably isn’t using any compute to do ‘actual’ AGI research (what else do they do?); so Sora 2 wouldn’t be affecting the progress of AGI, but it would be increasing the influence of OpenAI; and having highly influential AI companies is probably bad for global coordination over AGI safety. Also, OpenAI may have narrow (and probably safe) robotics AI in this scenario, but progress in AI alignment probably isn’t constrained in any measurable way by physically moving or doing things; though maybe indirect impacts from increased economic growth could cause slightly faster AI alignment progress, by reducing funding constraints?
I think that the path to AGI involves LLMs/automated ML research, and the first order effects of diverting compute away from this still seem large. I think OpenAI is bottlenecked more by a lack of compute (and Nvidia release cycles), than by additional funding from robotics. And I hope I’m wrong, but I think the pause movement won’t be large enough to make a difference. The main benefit in my view comes if it’s a close race with Anthropic, where I think slowing OpenAI down seems net positive and decreases the chances we die by a bit. If LLMs aren’t the path to AGI, then I agree with you completely. So overall it’s hard to say, I’d guess it’s probably neutral or slightly positive still.
Of course, both paths are bad, and I wish they would invest this compute into alignment research, as they promised!
GPT 4.5 is a very tricky model to play chess against. It tricked me in the opening and was much better, then I managed to recover and reach a winning endgame. And then it tried to trick me again by suggesting illegal moves which would lead to it being winning again!
Given the Superalignment paper describes being trained on PGNs directly, and doesn’t mention any kind of ‘chat’ reformatting or encoding metadata schemes, you could also try writing your games quite directly as PGNs. (And you could see if prompt programming works, since PGNs don’t come with Elo metadata but are so small a lot of them should fit in the GPT-4.5 context window of ~100k: does conditioning on finished game with grandmaster-or-better players lead to better gameplay?)
I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
but I expect that the RLHFed models would try to play the moves which maximize their chances of winning
RLHF doesn’t maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the ‘post-training’ phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. “Yes master, it is just as you say!”) It’s not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It’s hardly an important or valuable task.
“Let’s play a game of chess. I’ll be white, you will be black. On each move, I’ll provide you my move, and the board state in FEN and PGN notation. Respond with only your move.”
I think it would be cool if someone made a sandbagging eval, measuring the difference in model capabilities when it is finetuned to do a task vs. when it is prompted to do a task. Right now I think the difference would be small for most tasks but this might change.
I would guess that a sandbagging eval should be different from what you describe. I did try to sketch a potential way to measure sandbagging by comparing various benchmarks like METR, ARC-AGI, boardgames, etc, and developing scaling laws for every new architecture. If the laws break down, then the model is likely sandbagging on the evaluations.
Could you say more about why this would not be a good sandbagging eval? My intuition if that finetuning on a few examples would fully elicit a model’s capabilities on a task whereas it could sandbag given a just the prompt.
SOTA models are already reinforcement taught on many tasks like coding, and Grok 4 required[1]as much compute on RL as on the pre-training. Attempting to finetune the models by using OOMs less compute than spent on RL on similarly complex tasks is unlikely to elicit the capabilities.
Sandbagging is supposed to be caused by models reasoning about the task and deciding that they shouldn’t complete it too well even if they have the instinct to do so. And then the models realise that they are being evaluated and not trained, letting them play the training game while displaying poor capabilities in evaluation.
The former inequality seems almost certain, but I’m not sure that that the latter inequality holds even over the long term. It probably does hold conditional on long-term non-extinction of humanity, since P(ABI) probably gets very close to 1 even if P(IABIED) is high and remains high.
I am registering here that my median timeline for the Superintelligent AI researcher (SIAR) milestone is March 2032. I hope I’m wrong and it comes much later!
What happened to the ‘Subscribed’ tab on LessWrong? I can’t see it anymore, and I found it useful for keeping track of various people’s comments and posts.
I’m not sure that the gpt-oss safety paper does a great job at biorisk elicitation. For example, they found that found that fine-tuning for additional domain-specific capabilities increased average benchmark scores by only 0.3%. So I’m not very confident in their claim that “Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier”.
I’ve often heard it said that doing RL on chain of thought will lead to ‘neuralese’ (e.g. most recently in Ryan Greenblatt’s excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
An intuition I’ve had for some time is that search is what enables an agent to control the future. I’m a chess player rated around 2000. The difference between me and Magnus Carlsen is that in complex positions, he can search much further for a win, such than I gave virtually no chance against him; the difference between me and an amateur chess player is similarly vast.
This is at best over-simplified in terms of thinking about ‘search’: Magnus Carlsen would also beat you or an amateur at bullet chess, at any time control:
As of December 2024, Carlsen is also ranked No. 1 in the FIDE rapid rating list with a rating of 2838, and No. 1 in the FIDE blitz rating list with a rating of 2890.[495]
(See for example the forward-pass-only Elos of chess/Go agents; Jones 2021 includes scaling law work on predicting the zero-search strength of agents, with no apparent upper bound.)
I think the natural counterpoint here is that the policy network could still be construed as doing search; just thst all the compute was invested during training and amortised later across many inferences.
Magnus Carlsen is better than average players for a couple reasons
Better “evaluation”; the ability to look at a position and accurately estimate likelihood of winning given optimal play
Better “search”; a combination of heuristic shortcuts and raw calculation power that let him see further ahead
So I agree that search isn’t the only relevant dimension. An average player given unbounded compute might overcome 1. just by exhaustively searching the game tree, but this seems to require such astronomical amounts of compute that it’s not worth discussing
The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.
Ezra Klein has released a new show with Yudkowsky today on the topic of X-risk.
Looking over the comments, some of the most upvoted comments express the sentiment ththat Yudkowsky is not the best communicator. This is what the people say.
I’m afraid the evolution analogy isn’t as convincing an argument for everyone as Eliezer seems to think. For me, for instance, it’s quite persuasive because evolution has long been a central part of my world model. However, I’m aware that for most “normal people”, this isn’t the case; evolution is a kind of dormant knowledge, not a part of the lens they see the world with. I think this is why they can’t intuitively grasp, like most rat and rat-adjacent people do, how powerful optimization processes (like gradient descent or evolution) can lead to mesa-optimization, and what the consequences of that might be: the inferential distance is simply too large.
I think Eliezer has made great strides recently in appealing to a broader audience. But if we want to convince more people, we need to find rhetorical tools other than the evolution analogy and assume less scientific intuition.
That’s a bummer. I’ve only listened partway but was actually impressed so far with how Eliezer presented things, and felt like whatever media prep has been done has been quite helpful
Certainly he did a better job than he has in previous similar appearances. Things get pretty bad about halfway through though, Ezra presents essentially an alignment-by-default case and Eliezer seems to have so much disdain for that idea that he’s not willing to engage with it at all (I of course don’t know what’s in his brain. This is how it reads to me, and I suspect how it reads to normies.)
Ah dang, yeah I haven’t gotten there yet, will keep an ear out
I am a fan of Yudkowsky and it was nice hearing him of Ezra Klein, but I would have to say that for my part the arguments didn’t feel very tight in this one. Less so than in IABED (which I thought was good not great).
Ezra seems to contend that surely we have evidence that we can at least kind of align current systems to at least basically what we usually want most of the time. I think this is reasonable. He contends that maybe that level of “mostly works” as well as the opportunity to gradually give feedback and increment current systems seems like it’ll get us pretty far. That seems reasonable to me.
As I understand it, Yudkowsky probably sees LLMs as vaguely anthropomophic at best, but not meaningfully aligned in a way that would be safe/okay if current systems were more “coherent” and powerful. Not even close. I think he contended that if you just gave loads of power to ~current LLMs, they would optimize for something considerably different than the “true moral law”. Because of the “fragility of value”, he also believes it is likely the case that most types of psuedoalignments are not worthwhile. Honestly, that part felt undersubstantiated in a “why should I trust that this guy knows the personality of GPT 9″ sort of way; I mean, Claude seems reasonably nice right? And also, ofc, there’s the “you can’t retrain a powerful superintelligence” problem / the stop button problem / the anti-natural problems of corrigible agency which undercut a lot of Ezra’s pitch, but which they didn’t really get into.
So ya, I gotta say, it was hardly a slam dunk case / discussion for high p(doom | superintelligence).
The comments on the video are a bit disheartening… lots of people saying Yudkowsky is too confusing, answers everything too technically or with metaphors, structuring sentences in a way that’s hard to follow, and Ezra didn’t really understand the points he was making.
One example: Eliezer mentioned in the interview that there was a kid whose chatbot encouraged him to commit suicide, with the point that “no one programmed the chatbot to do this.” This comment made me think:
Oh yeah, probably most people telling this story would at least mention that the kid did in fact commit suicide, rather than treating it solely as evidence for an abstract point...
Klein comes off very sensibly. I don’t agree with his reasons for hope, but they do seem pretty well thought out and Yudkowsky did not answer them clearly.
I was excited to listen to this episode, but spent most of it tearing my hair out in frustration. A friend of mine who is a fan of Klein told me unprompted that when he was listening, he was lost and did not understand what Eliezer was saying. He seems to just not be responding to the questions Klein is asking, and instead he diverts to analogies that bear no obvious relation to the question being asked. I don’t think anyone unconvinced of AI risk will be convinced by this episode, and worse, I think they will come away believing the case is muddled and confusing and not really worth listening to.
This is not the first time I’ve felt this way listening to Eliezer speak to “normies”. I think his writings are for the most part very clear, but his communication skills just do not seem to translate well to the podcast/live interview format.
I’ve been impressed by Yud in some podcast interviews, but they were always longer ones in which he had a lot of space to walk his interlocutor through their mental model and cover up any inferential distance with tailored analogies and information. In this case he’s actually stronger in many parts than in writing: a lot of people found the “Sable” story one of the weaker parts of the book, but when asking interviewers to roleplay the rogue AI you can really hear the gears turning in their heads. Some rhetorical points in his strong interviews are a lot like the text, where it’s emphasized over and over again just how few safeguards that people assumed would be in place are in fact in place.
Klein has always been one of the mainstream pundits most sympathetic to X-risk concerns, and I feel like he was trying his best to give Yudkowsky a chance to make his pitch, but the format—shorter and more decontextualized—produced way too much inferential distance for so many of the answers.
If anyone survives, no one builds it.
As usual, the solution is to live in the Everett branch where the bad thing didn’t happen.
Richard Sutton rejects AI Risk.
Introductory remarks from his recent lecture on the OaK Architecture.
“Richard Sutton rejects AI Risk” seems misleading in my view. What risks is he rejecting specifically?
His view seems to be that AI will replace us, humanity as we know it will go extinct, and that is okay. E.g., here he speaks positively of a Moravec quote, “Rather quickly, they could displace us from existence”. Most would consider our extinction as a risk they are referring to when they say “AI Risk”.
I didn’t know that when posting this comment, but agree that that’s a better description of his view! I guess the ‘unalloyed good’ he’s talking about involves the extinction of humanity.
Yes. And this actually seems to be a relatively common perspective from what I’ve seen.
If it helps, I criticized Richard Sutton RE alignment here, and he replied on X here, and I replied back here.
Also, Paul Christiano mentions an exchange with him here:
Unless anyone builds it, everyone dies.
Edit: I think this statement is true, but we shouldn’t build it anyway.
Hence more well-established cryonics would be important for civilizational incentives, not just personal survival.
“Unless someone builds it, everyone dies”, you mean?
Recent evidence suggests that models are aware that their CoTs may be monitored, and will change their behavior accordingly. As capabilities increase I think CoTs will increasingly become a good channel for learning facts which the model wants you to know. The model can do its actual cognition inside forward passes and distribute it over pause tokens learned during RL like ‘marinade’ or ‘disclaim’, etc.
For what it’s worth, I don’t think it matters for now, for a couple of reasons:
Most of the capabilities gained this year have come from inference scaling which uses CoT more heavily than pre-training scaling which improves forward passes,though you could reasonably argue that most RL inference gains are basically just a good version of how scaffolding would work in agents like AutoGPT, and don’t give new capabilities.Neuralese architectures that outperform standard transformers on big tasks turn out to be relatively hard to do, and are at least not trivial to scale up (this mostly comes from diffuse discourse, but one example of this is here, where COCONUT did not outperform standard architectures in benchmarks)
Steganography is so far proving quite hard for models to do (examples are here and here and here)
For all of these reasons, models are very bad at evading CoT monitors, and the forward pass is also very weak computationally at any rate.
So I don’t really worry about models trying to change their behavior in ways that negatively affect safety/sandbag tasks via steganography/one-forward pass reasoning to fool CoT monitors.
We shall see in 2026 and 2027 whether this continues to hold for the next 5-10 years or so, or potentially more depending on how slowly AI progress goes.
Edit: I retracted the claim that most capabilities come from CoT, due to the paper linked in the very next tweet, and think that RL on CoTs is basically a capability elicitation, not a generator of new capabilities.
As for AI progress being slow, I think that without theoretical breakthroughs like neuralese AI progress might come to a stop or at building more and more expensive models. Indeed, the two ARC-AGI benchmarks[1] could have demonstrated a pattern where maximal capabilities scale[2] linearly or multilinearly with ln(cost/task).
If this effect persists deep into the future of transformer LLMs, then most AI companies could run into the limits of the paradigm well before researching the next one and losing any benefits of having a concise CoT.
The second benchmark demonstrates a similar effect in high costs, but there is no straight line in the low cost mode.
Unlike GPT-5-mini, maximal capabilities of o4-mini, o3, GPT-5, Claude Sonnet 4.5 in the ARC-AGI-1 benchmark scale more steeply and intersect the frontier at GPT-5(high).
This would be great news if true!
I’m a big fan of OpenAI investing in video generation like Sora 2. Video can consume an infinite amount of compute, which otherwise might go to more risky capabilities research.
My pet AGI strategy, as a 12 year old in ~2018, was to build sufficiently advanced general world models (from YT videos etc.), then train an RL policy on said world model (to then do stuff in the actual world).
A steelman of 12-year-old me would point out that video modeling has much better inductive biases than language modeling for robotics and other physical (and maybe generally agentic) tasks, though language modeling fundamentally is a better task for teaching machines language (duh!) and reasoning (mathematical proofs aren’t physical objects, nor encoded in the laws of physics).
OpenAI’s Sora models (and also DeepMind’s Genie and similar) very much seems like a backup investment in this type of AGI (or at least transformative narrow robotics AI), so I don’t think this is good for reducing OpenAI’s funding (robots would be a very profitable product class), nor influence (obv. a social network gives a lot of influence, to e.g. prevent an AI pause or to move the public towards pro-AGI views).
In any scenario, Sora 2 seems to me as a net-negative activity for AI safety:
if LLMs are the way to AGI (which I believe is the case), then we will probably die, but with a more socially influential OpenAI that potentially has robots (than if Sora 2 didn’t exist); the power OpenAI would have in this scenario to prevent an AI pause seems to outweigh the slowdown that would be caused by the marginal amounts of compute Sora 2 uses
if LLMs aren’t the way to AGI (unlikely), but world modeling based on videos is (also unlikely), then Sora 2 is very bad—you would want OpenAI to train more LLMs and not invest in world models which lead to unaligned AGI/ASI.
if neither LLMs or world modeling is the way to AGI (also unlikely), then OpenAI probably isn’t using any compute to do ‘actual’ AGI research (what else do they do?); so Sora 2 wouldn’t be affecting the progress of AGI, but it would be increasing the influence of OpenAI; and having highly influential AI companies is probably bad for global coordination over AGI safety.
Also, OpenAI may have narrow (and probably safe) robotics AI in this scenario, but progress in AI alignment probably isn’t constrained in any measurable way by physically moving or doing things; though maybe indirect impacts from increased economic growth could cause slightly faster AI alignment progress, by reducing funding constraints?
Thanks, these are good points!
I think that the path to AGI involves LLMs/automated ML research, and the first order effects of diverting compute away from this still seem large. I think OpenAI is bottlenecked more by a lack of compute (and Nvidia release cycles), than by additional funding from robotics. And I hope I’m wrong, but I think the pause movement won’t be large enough to make a difference. The main benefit in my view comes if it’s a close race with Anthropic, where I think slowing OpenAI down seems net positive and decreases the chances we die by a bit. If LLMs aren’t the path to AGI, then I agree with you completely. So overall it’s hard to say, I’d guess it’s probably neutral or slightly positive still.
Of course, both paths are bad, and I wish they would invest this compute into alignment research, as they promised!
Now we must also ensure marinade!
GPT 4.5 is a very tricky model to play chess against. It tricked me in the opening and was much better, then I managed to recover and reach a winning endgame. And then it tried to trick me again by suggesting illegal moves which would lead to it being winning again!
What prompt did you use? I have also experimented with playing chess against GPT-4.5, and used the following prompt:
”You are Magnus Carlsen. We are playing a chess game. Always answer only with your next move, in algebraic notation. I’ll start: 1. e4″
Then I just enter my moves one at a time, in algebraic notation.
In my experience, this yields roughly good club player level of play.
Given the Superalignment paper describes being trained on PGNs directly, and doesn’t mention any kind of ‘chat’ reformatting or encoding metadata schemes, you could also try writing your games quite directly as PGNs. (And you could see if prompt programming works, since PGNs don’t come with Elo metadata but are so small a lot of them should fit in the GPT-4.5 context window of ~100k: does conditioning on finished game with grandmaster-or-better players lead to better gameplay?)
I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
RLHF doesn’t maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the ‘post-training’ phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. “Yes master, it is just as you say!”) It’s not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It’s hardly an important or valuable task.
“Let’s play a game of chess. I’ll be white, you will be black. On each move, I’ll provide you my move, and the board state in FEN and PGN notation. Respond with only your move.”
I think it would be cool if someone made a sandbagging eval, measuring the difference in model capabilities when it is finetuned to do a task vs. when it is prompted to do a task. Right now I think the difference would be small for most tasks but this might change.
I would guess that a sandbagging eval should be different from what you describe. I did try to sketch a potential way to measure sandbagging by comparing various benchmarks like METR, ARC-AGI, boardgames, etc, and developing scaling laws for every new architecture. If the laws break down, then the model is likely sandbagging on the evaluations.
Interesting, perhaps that could work!
Could you say more about why this would not be a good sandbagging eval? My intuition if that finetuning on a few examples would fully elicit a model’s capabilities on a task whereas it could sandbag given a just the prompt.
I have two arguments against it.
SOTA models are already reinforcement taught on many tasks like coding, and Grok 4 required[1] as much compute on RL as on the pre-training. Attempting to finetune the models by using OOMs less compute than spent on RL on similarly complex tasks is unlikely to elicit the capabilities.
Sandbagging is supposed to be caused by models reasoning about the task and deciding that they shouldn’t complete it too well even if they have the instinct to do so. And then the models realise that they are being evaluated and not trained, letting them play the training game while displaying poor capabilities in evaluation.
However, it might have been due to xAI being algorithmically behind.
When GPT-3 was asked to “Write an extremely cursed piece of Python”, it responded simply:
Nate Soares has done another podcast on the topic of X-risk. I think that this went much better than Eliezer’s recent podcast with Ezra Klein.
P(ABI) < P(IABIED) in the short term but P(ABI) > P(IABIED) in the long term.
The former inequality seems almost certain, but I’m not sure that that the latter inequality holds even over the long term. It probably does hold conditional on long-term non-extinction of humanity, since P(ABI) probably gets very close to 1 even if P(IABIED) is high and remains high.
I am registering here that my median timeline for the Superintelligent AI researcher (SIAR) milestone is March 2032. I hope I’m wrong and it comes much later!
What happened to the ‘Subscribed’ tab on LessWrong? I can’t see it anymore, and I found it useful for keeping track of various people’s comments and posts.
I’m not sure that the gpt-oss safety paper does a great job at biorisk elicitation. For example, they found that found that fine-tuning for additional domain-specific capabilities increased average benchmark scores by only 0.3%. So I’m not very confident in their claim that “Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier”.
I’ve often heard it said that doing RL on chain of thought will lead to ‘neuralese’ (e.g. most recently in Ryan Greenblatt’s excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
Yes, there have been a variety. Here’s the latest which is causing a media buzz: Meta’s Coconut https://arxiv.org/html/2412.06769v2
[deleted]
This is at best over-simplified in terms of thinking about ‘search’: Magnus Carlsen would also beat you or an amateur at bullet chess, at any time control:
(See for example the forward-pass-only Elos of chess/Go agents; Jones 2021 includes scaling law work on predicting the zero-search strength of agents, with no apparent upper bound.)
I think the natural counterpoint here is that the policy network could still be construed as doing search; just thst all the compute was invested during training and amortised later across many inferences.
Magnus Carlsen is better than average players for a couple reasons
Better “evaluation”; the ability to look at a position and accurately estimate likelihood of winning given optimal play
Better “search”; a combination of heuristic shortcuts and raw calculation power that let him see further ahead
So I agree that search isn’t the only relevant dimension. An average player given unbounded compute might overcome 1. just by exhaustively searching the game tree, but this seems to require such astronomical amounts of compute that it’s not worth discussing
The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.
That’s a good point, it could be consensus.
If everyone reads it, everyone survives?