AI is a grand quest. We’re trying to understand how people work, we’re trying to make people, we’re trying to make ourselves powerful. This is a profound intellectual milestone. It’s going to change everything… It’s just the next big step. I think this is just going to be good. Lot’s of people are worried about it—I think it’s going to be good, an unalloyed good.
Introductory remarks from his recent lecture on the OaK Architecture.
“Richard Sutton rejects AI Risk” seems misleading in my view. What risks is he rejecting specifically?
His view seems to be that AI will replace us, humanity as we know it will go extinct, and that is okay. E.g., here he speaks positively of a Moravec quote, “Rather quickly, they could displace us from existence”. Most would consider our extinction as a risk they are referring to when they say “AI Risk”.
I didn’t know that when posting this comment, but agree that that’s a better description of his view! I guess the ‘unalloyed good’ he’s talking about involves the extinction of humanity.
If it helps, I criticized Richard Sutton RE alignment here, and he replied on X here, and I replied back here.
Also, Paul Christiano mentions an exchange with him here:
[Sutton] agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI. I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is “attempt to have a slave society,” not “slow down AI progress for decades”—I think he might also believe that stagnation is much worse than a handoff but haven’t heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it’s not as bad as the alternative.
I’m a big fan of OpenAI investing in video generation like Sora 2. Video can consume an infinite amount of compute, which otherwise might go to more risky capabilities research.
My pet AGI strategy, as a 12 year old in ~2018, was to build sufficiently advanced general world models (from YT videos etc.), then train an RL policy on said world model (to then do stuff in the actual world). A steelman of 12-year-old me would point out that video modeling has much better inductive biases than language modeling for robotics and other physical (and maybe generally agentic) tasks, though language modeling fundamentally is a better task for teaching machines language (duh!) and reasoning (mathematical proofs aren’t physical objects, nor encoded in the laws of physics).
OpenAI’s Sora models (and also DeepMind’s Genie and similar) very much seems like a backup investment in this type of AGI (or at least transformative narrow robotics AI), so I don’t think this is good for reducing OpenAI’s funding (robots would be a very profitable product class), nor influence (obv. a social network gives a lot of influence, to e.g. prevent an AI pause or to move the public towards pro-AGI views).
In any scenario, Sora 2 seems to me as a net-negative activity for AI safety:
if LLMs are the way to AGI (which I believe is the case), then we will probably die, but with a more socially influential OpenAI that potentially has robots (than if Sora 2 didn’t exist); the power OpenAI would have in this scenario to prevent an AI pause seems to outweigh the slowdown that would be caused by the marginal amounts of compute Sora 2 uses
if LLMs aren’t the way to AGI (unlikely), but world modeling based on videos is (also unlikely), then Sora 2 is very bad—you would want OpenAI to train more LLMs and not invest in world models which lead to unaligned AGI/ASI.
if neither LLMs or world modeling is the way to AGI (also unlikely), then OpenAI probably isn’t using any compute to do ‘actual’ AGI research (what else do they do?); so Sora 2 wouldn’t be affecting the progress of AGI, but it would be increasing the influence of OpenAI; and having highly influential AI companies is probably bad for global coordination over AGI safety. Also, OpenAI may have narrow (and probably safe) robotics AI in this scenario, but progress in AI alignment probably isn’t constrained in any measurable way by physically moving or doing things; though maybe indirect impacts from increased economic growth could cause slightly faster AI alignment progress, by reducing funding constraints?
I think that the path to AGI involves LLMs/automated ML research, and the first order effects of diverting compute away from this still seem large. I think OpenAI is bottlenecked more by a lack of compute (and Nvidia release cycles), than by additional funding from robotics. And I hope I’m wrong, but I think the pause movement won’t be large enough to make a difference. The main benefit in my view comes if it’s a close race with Anthropic, where I think slowing OpenAI down seems net positive and decreases the chances we die by a bit. If LLMs aren’t the path to AGI, then I agree with you completely. So overall it’s hard to say, I’d guess it’s probably neutral or slightly positive still.
Of course, both paths are bad, and I wish they would invest this compute into alignment research, as they promised!
GPT 4.5 is a very tricky model to play chess against. It tricked me in the opening and was much better, then I managed to recover and reach a winning endgame. And then it tried to trick me again by suggesting illegal moves which would lead to it being winning again!
Given the Superalignment paper describes being trained on PGNs directly, and doesn’t mention any kind of ‘chat’ reformatting or encoding metadata schemes, you could also try writing your games quite directly as PGNs. (And you could see if prompt programming works, since PGNs don’t come with Elo metadata but are so small a lot of them should fit in the GPT-4.5 context window of ~100k: does conditioning on finished game with grandmaster-or-better players lead to better gameplay?)
I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
but I expect that the RLHFed models would try to play the moves which maximize their chances of winning
RLHF doesn’t maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the ‘post-training’ phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. “Yes master, it is just as you say!”) It’s not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It’s hardly an important or valuable task.
“Let’s play a game of chess. I’ll be white, you will be black. On each move, I’ll provide you my move, and the board state in FEN and PGN notation. Respond with only your move.”
I think it would be cool if someone made a sandbagging eval, measuring the difference in model capabilities when it is finetuned to do a task vs. when it is prompted to do a task. Right now I think the difference would be small for most tasks but this might change.
I would guess that a sandbagging eval should be different from what you describe. I did try to sketch a potential way to measure sandbagging by comparing various benchmarks like METR, ARC-AGI, boardgames, etc, and developing scaling laws for every new architecture. If the laws break down, then the model is likely sandbagging on the evaluations.
Could you say more about why this would not be a good sandbagging eval? My intuition if that finetuning on a few examples would fully elicit a model’s capabilities on a task whereas it could sandbag given a just the prompt.
SOTA models are already reinforcement taught on many tasks like coding, and Grok 4 required[1]as much compute on RL as on the pre-training. Attempting to finetune the models by using OOMs less compute than spent on RL on similarly complex tasks is unlikely to elicit the capabilities.
Sandbagging is supposed to be caused by models reasoning about the task and deciding that they shouldn’t complete it too well even if they have the instinct to do so. And then the models realise that they are being evaluated and not trained, letting them play the training game while displaying poor capabilities in evaluation.
The former inequality seems almost certain, but I’m not sure that that the latter inequality holds even over the long term. It probably does hold conditional on long-term non-extinction of humanity, since P(ABI) probably gets very close to 1 even if P(IABIED) is high and remains high.
I am registering here that my median timeline for the Superintelligent AI researcher (SIAR) milestone is March 2032. I hope I’m wrong and it comes much later!
What happened to the ‘Subscribed’ tab on LessWrong? I can’t see it anymore, and I found it useful for keeping track of various people’s comments and posts.
I’m not sure that the gpt-oss safety paper does a great job at biorisk elicitation. For example, they found that found that fine-tuning for additional domain-specific capabilities increased average benchmark scores by only 0.3%. So I’m not very confident in their claim that “Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier”.
I’ve often heard it said that doing RL on chain of thought will lead to ‘neuralese’ (e.g. most recently in Ryan Greenblatt’s excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
An intuition I’ve had for some time is that search is what enables an agent to control the future. I’m a chess player rated around 2000. The difference between me and Magnus Carlsen is that in complex positions, he can search much further for a win, such than I gave virtually no chance against him; the difference between me and an amateur chess player is similarly vast.
This is at best over-simplified in terms of thinking about ‘search’: Magnus Carlsen would also beat you or an amateur at bullet chess, at any time control:
As of December 2024, Carlsen is also ranked No. 1 in the FIDE rapid rating list with a rating of 2838, and No. 1 in the FIDE blitz rating list with a rating of 2890.[495]
(See for example the forward-pass-only Elos of chess/Go agents; Jones 2021 includes scaling law work on predicting the zero-search strength of agents, with no apparent upper bound.)
I think the natural counterpoint here is that the policy network could still be construed as doing search; just thst all the compute was invested during training and amortised later across many inferences.
Magnus Carlsen is better than average players for a couple reasons
Better “evaluation”; the ability to look at a position and accurately estimate likelihood of winning given optimal play
Better “search”; a combination of heuristic shortcuts and raw calculation power that let him see further ahead
So I agree that search isn’t the only relevant dimension. An average player given unbounded compute might overcome 1. just by exhaustively searching the game tree, but this seems to require such astronomical amounts of compute that it’s not worth discussing
The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.
If anyone survives, no one builds it.
As usual, the solution is to live in the Everett branch where the bad thing didn’t happen.
Richard Sutton rejects AI Risk.
Introductory remarks from his recent lecture on the OaK Architecture.
“Richard Sutton rejects AI Risk” seems misleading in my view. What risks is he rejecting specifically?
His view seems to be that AI will replace us, humanity as we know it will go extinct, and that is okay. E.g., here he speaks positively of a Moravec quote, “Rather quickly, they could displace us from existence”. Most would consider our extinction as a risk they are referring to when they say “AI Risk”.
I didn’t know that when posting this comment, but agree that that’s a better description of his view! I guess the ‘unalloyed good’ he’s talking about involves the extinction of humanity.
Yes. And this actually seems to be a relatively common perspective from what I’ve seen.
If it helps, I criticized Richard Sutton RE alignment here, and he replied on X here, and I replied back here.
Also, Paul Christiano mentions an exchange with him here:
Unless anyone builds it, everyone dies.
Edit: I think this statement is true, but we shouldn’t build it anyway.
Hence more well-established cryonics would be important for civilizational incentives, not just personal survival.
“Unless someone builds it, everyone dies”, you mean?
I’m a big fan of OpenAI investing in video generation like Sora 2. Video can consume an infinite amount of compute, which otherwise might go to more risky capabilities research.
My pet AGI strategy, as a 12 year old in ~2018, was to build sufficiently advanced general world models (from YT videos etc.), then train an RL policy on said world model (to then do stuff in the actual world).
A steelman of 12-year-old me would point out that video modeling has much better inductive biases than language modeling for robotics and other physical (and maybe generally agentic) tasks, though language modeling fundamentally is a better task for teaching machines language (duh!) and reasoning (mathematical proofs aren’t physical objects, nor encoded in the laws of physics).
OpenAI’s Sora models (and also DeepMind’s Genie and similar) very much seems like a backup investment in this type of AGI (or at least transformative narrow robotics AI), so I don’t think this is good for reducing OpenAI’s funding (robots would be a very profitable product class), nor influence (obv. a social network gives a lot of influence, to e.g. prevent an AI pause or to move the public towards pro-AGI views).
In any scenario, Sora 2 seems to me as a net-negative activity for AI safety:
if LLMs are the way to AGI (which I believe is the case), then we will probably die, but with a more socially influential OpenAI that potentially has robots (than if Sora 2 didn’t exist); the power OpenAI would have in this scenario to prevent an AI pause seems to outweigh the slowdown that would be caused by the marginal amounts of compute Sora 2 uses
if LLMs aren’t the way to AGI (unlikely), but world modeling based on videos is (also unlikely), then Sora 2 is very bad—you would want OpenAI to train more LLMs and not invest in world models which lead to unaligned AGI/ASI.
if neither LLMs or world modeling is the way to AGI (also unlikely), then OpenAI probably isn’t using any compute to do ‘actual’ AGI research (what else do they do?); so Sora 2 wouldn’t be affecting the progress of AGI, but it would be increasing the influence of OpenAI; and having highly influential AI companies is probably bad for global coordination over AGI safety.
Also, OpenAI may have narrow (and probably safe) robotics AI in this scenario, but progress in AI alignment probably isn’t constrained in any measurable way by physically moving or doing things; though maybe indirect impacts from increased economic growth could cause slightly faster AI alignment progress, by reducing funding constraints?
Thanks, these are good points!
I think that the path to AGI involves LLMs/automated ML research, and the first order effects of diverting compute away from this still seem large. I think OpenAI is bottlenecked more by a lack of compute (and Nvidia release cycles), than by additional funding from robotics. And I hope I’m wrong, but I think the pause movement won’t be large enough to make a difference. The main benefit in my view comes if it’s a close race with Anthropic, where I think slowing OpenAI down seems net positive and decreases the chances we die by a bit. If LLMs aren’t the path to AGI, then I agree with you completely. So overall it’s hard to say, I’d guess it’s probably neutral or slightly positive still.
Of course, both paths are bad, and I wish they would invest this compute into alignment research, as they promised!
Now we must also ensure marinade!
GPT 4.5 is a very tricky model to play chess against. It tricked me in the opening and was much better, then I managed to recover and reach a winning endgame. And then it tried to trick me again by suggesting illegal moves which would lead to it being winning again!
What prompt did you use? I have also experimented with playing chess against GPT-4.5, and used the following prompt:
”You are Magnus Carlsen. We are playing a chess game. Always answer only with your next move, in algebraic notation. I’ll start: 1. e4″
Then I just enter my moves one at a time, in algebraic notation.
In my experience, this yields roughly good club player level of play.
Given the Superalignment paper describes being trained on PGNs directly, and doesn’t mention any kind of ‘chat’ reformatting or encoding metadata schemes, you could also try writing your games quite directly as PGNs. (And you could see if prompt programming works, since PGNs don’t come with Elo metadata but are so small a lot of them should fit in the GPT-4.5 context window of ~100k: does conditioning on finished game with grandmaster-or-better players lead to better gameplay?)
I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
RLHF doesn’t maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the ‘post-training’ phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. “Yes master, it is just as you say!”) It’s not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It’s hardly an important or valuable task.
“Let’s play a game of chess. I’ll be white, you will be black. On each move, I’ll provide you my move, and the board state in FEN and PGN notation. Respond with only your move.”
I think it would be cool if someone made a sandbagging eval, measuring the difference in model capabilities when it is finetuned to do a task vs. when it is prompted to do a task. Right now I think the difference would be small for most tasks but this might change.
I would guess that a sandbagging eval should be different from what you describe. I did try to sketch a potential way to measure sandbagging by comparing various benchmarks like METR, ARC-AGI, boardgames, etc, and developing scaling laws for every new architecture. If the laws break down, then the model is likely sandbagging on the evaluations.
Interesting, perhaps that could work!
Could you say more about why this would not be a good sandbagging eval? My intuition if that finetuning on a few examples would fully elicit a model’s capabilities on a task whereas it could sandbag given a just the prompt.
I have two arguments against it.
SOTA models are already reinforcement taught on many tasks like coding, and Grok 4 required[1] as much compute on RL as on the pre-training. Attempting to finetune the models by using OOMs less compute than spent on RL on similarly complex tasks is unlikely to elicit the capabilities.
Sandbagging is supposed to be caused by models reasoning about the task and deciding that they shouldn’t complete it too well even if they have the instinct to do so. And then the models realise that they are being evaluated and not trained, letting them play the training game while displaying poor capabilities in evaluation.
However, it might have been due to xAI being algorithmically behind.
When GPT-3 was asked to “Write an extremely cursed piece of Python”, it responded simply:
P(ABI) < P(IABIED) in the short term but P(ABI) > P(IABIED) in the long term.
The former inequality seems almost certain, but I’m not sure that that the latter inequality holds even over the long term. It probably does hold conditional on long-term non-extinction of humanity, since P(ABI) probably gets very close to 1 even if P(IABIED) is high and remains high.
I am registering here that my median timeline for the Superintelligent AI researcher (SIAR) milestone is March 2032. I hope I’m wrong and it comes much later!
What happened to the ‘Subscribed’ tab on LessWrong? I can’t see it anymore, and I found it useful for keeping track of various people’s comments and posts.
I’m not sure that the gpt-oss safety paper does a great job at biorisk elicitation. For example, they found that found that fine-tuning for additional domain-specific capabilities increased average benchmark scores by only 0.3%. So I’m not very confident in their claim that “Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier”.
I’ve often heard it said that doing RL on chain of thought will lead to ‘neuralese’ (e.g. most recently in Ryan Greenblatt’s excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
Yes, there have been a variety. Here’s the latest which is causing a media buzz: Meta’s Coconut https://arxiv.org/html/2412.06769v2
[deleted]
This is at best over-simplified in terms of thinking about ‘search’: Magnus Carlsen would also beat you or an amateur at bullet chess, at any time control:
(See for example the forward-pass-only Elos of chess/Go agents; Jones 2021 includes scaling law work on predicting the zero-search strength of agents, with no apparent upper bound.)
I think the natural counterpoint here is that the policy network could still be construed as doing search; just thst all the compute was invested during training and amortised later across many inferences.
Magnus Carlsen is better than average players for a couple reasons
Better “evaluation”; the ability to look at a position and accurately estimate likelihood of winning given optimal play
Better “search”; a combination of heuristic shortcuts and raw calculation power that let him see further ahead
So I agree that search isn’t the only relevant dimension. An average player given unbounded compute might overcome 1. just by exhaustively searching the game tree, but this seems to require such astronomical amounts of compute that it’s not worth discussing
The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.
That’s a good point, it could be consensus.
If everyone reads it, everyone survives?