AI is a grand quest. We’re trying to understand how people work, we’re trying to make people, we’re trying to make ourselves powerful. This is a profound intellectual milestone. It’s going to change everything… It’s just the next big step. I think this is just going to be good. Lot’s of people are worried about it—I think it’s going to be good, an unalloyed good.
Introductory remarks from his recent lecture on the OaK Architecture.
“Richard Sutton rejects AI Risk” seems misleading in my view. What risks is he rejecting specifically?
His view seems to be that AI will replace us, humanity as we know it will go extinct, and that is okay. E.g., here he speaks positively of a Moravec quote, “Rather quickly, they could displace us from existence”. Most would consider our extinction as a risk they are referring to when they say “AI Risk”.
I didn’t know that when posting this comment, but agree that that’s a better description of his view! I guess the ‘unalloyed good’ he’s talking about involves the extinction of humanity.
If it helps, I criticized Richard Sutton RE alignment here, and he replied on X here, and I replied back here.
Also, Paul Christiano mentions an exchange with him here:
[Sutton] agrees that all else equal it would be better if we handed off to human uploads instead of powerful AI. I think his view is that the proposed course of action from the alignment community is morally horrifying (since in practice he thinks the alternative is “attempt to have a slave society,” not “slow down AI progress for decades”—I think he might also believe that stagnation is much worse than a handoff but haven’t heard his view on this specifically) and that even if you are losing something in expectation by handing the universe off to AI systems it’s not as bad as the alternative.
GPT 4.5 is a very tricky model to play chess against. It tricked me in the opening and was much better, then I managed to recover and reach a winning endgame. And then it tried to trick me again by suggesting illegal moves which would lead to it being winning again!
Given the Superalignment paper describes being trained on PGNs directly, and doesn’t mention any kind of ‘chat’ reformatting or encoding metadata schemes, you could also try writing your games quite directly as PGNs. (And you could see if prompt programming works, since PGNs don’t come with Elo metadata but are so small a lot of them should fit in the GPT-4.5 context window of ~100k: does conditioning on finished game with grandmaster-or-better players lead to better gameplay?)
I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
but I expect that the RLHFed models would try to play the moves which maximize their chances of winning
RLHF doesn’t maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the ‘post-training’ phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. “Yes master, it is just as you say!”) It’s not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It’s hardly an important or valuable task.
“Let’s play a game of chess. I’ll be white, you will be black. On each move, I’ll provide you my move, and the board state in FEN and PGN notation. Respond with only your move.”
I’m not sure that the gpt-oss safety paper does a great job at biorisk elicitation. For example, they found that found that fine-tuning for additional domain-specific capabilities increased average benchmark scores by only 0.3%. So I’m not very confident in their claim that “Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier”.
I’ve often heard it said that doing RL on chain of thought will lead to ‘neuralese’ (e.g. most recently in Ryan Greenblatt’s excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
An intuition I’ve had for some time is that search is what enables an agent to control the future. I’m a chess player rated around 2000. The difference between me and Magnus Carlsen is that in complex positions, he can search much further for a win, such than I gave virtually no chance against him; the difference between me and an amateur chess player is similarly vast.
This is at best over-simplified in terms of thinking about ‘search’: Magnus Carlsen would also beat you or an amateur at bullet chess, at any time control:
As of December 2024, Carlsen is also ranked No. 1 in the FIDE rapid rating list with a rating of 2838, and No. 1 in the FIDE blitz rating list with a rating of 2890.[495]
(See for example the forward-pass-only Elos of chess/Go agents; Jones 2021 includes scaling law work on predicting the zero-search strength of agents, with no apparent upper bound.)
I think the natural counterpoint here is that the policy network could still be construed as doing search; just thst all the compute was invested during training and amortised later across many inferences.
Magnus Carlsen is better than average players for a couple reasons
Better “evaluation”; the ability to look at a position and accurately estimate likelihood of winning given optimal play
Better “search”; a combination of heuristic shortcuts and raw calculation power that let him see further ahead
So I agree that search isn’t the only relevant dimension. An average player given unbounded compute might overcome 1. just by exhaustively searching the game tree, but this seems to require such astronomical amounts of compute that it’s not worth discussing
The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.
Richard Sutton rejects AI Risk.
Introductory remarks from his recent lecture on the OaK Architecture.
“Richard Sutton rejects AI Risk” seems misleading in my view. What risks is he rejecting specifically?
His view seems to be that AI will replace us, humanity as we know it will go extinct, and that is okay. E.g., here he speaks positively of a Moravec quote, “Rather quickly, they could displace us from existence”. Most would consider our extinction as a risk they are referring to when they say “AI Risk”.
I didn’t know that when posting this comment, but agree that that’s a better description of his view! I guess the ‘unalloyed good’ he’s talking about involves the extinction of humanity.
Yes. And this actually seems to be a relatively common perspective from what I’ve seen.
If it helps, I criticized Richard Sutton RE alignment here, and he replied on X here, and I replied back here.
Also, Paul Christiano mentions an exchange with him here:
Now we must also ensure marinade!
GPT 4.5 is a very tricky model to play chess against. It tricked me in the opening and was much better, then I managed to recover and reach a winning endgame. And then it tried to trick me again by suggesting illegal moves which would lead to it being winning again!
What prompt did you use? I have also experimented with playing chess against GPT-4.5, and used the following prompt:
”You are Magnus Carlsen. We are playing a chess game. Always answer only with your next move, in algebraic notation. I’ll start: 1. e4″
Then I just enter my moves one at a time, in algebraic notation.
In my experience, this yields roughly good club player level of play.
Given the Superalignment paper describes being trained on PGNs directly, and doesn’t mention any kind of ‘chat’ reformatting or encoding metadata schemes, you could also try writing your games quite directly as PGNs. (And you could see if prompt programming works, since PGNs don’t come with Elo metadata but are so small a lot of them should fit in the GPT-4.5 context window of ~100k: does conditioning on finished game with grandmaster-or-better players lead to better gameplay?)
I gave the model both the PGN and the FEN on every move with this in mind. Why do you think conditioning on high level games would help? I can see why for the base models, but I expect that the RLHFed models would try to play the moves which maximize their chances of winning, with or without such prompting.
RLHF doesn’t maximize probability of winning, it maximizes a mix of token-level predictive loss (since that is usually added as a loss either directly or implicitly by the K-L) and rater approval, and god knows what else goes on these days in the ‘post-training’ phase muddying the waters further. Not at all the same thing. (Same way that a RLHF model might not optimize for correctness, and instead be sycophantic. “Yes master, it is just as you say!”) It’s not at all obvious to me that RLHF should be expected to make the LLMs play their hardest (a rater might focus on punishing illegal moves, or rewarding good-but-not-better-than-me moves), or that the post-training would affect it much at all: how many chess games are really going into the RLHF or post-training, anyway? (As opposed to the pretraining PGNs.) It’s hardly an important or valuable task.
“Let’s play a game of chess. I’ll be white, you will be black. On each move, I’ll provide you my move, and the board state in FEN and PGN notation. Respond with only your move.”
I’m not sure that the gpt-oss safety paper does a great job at biorisk elicitation. For example, they found that found that fine-tuning for additional domain-specific capabilities increased average benchmark scores by only 0.3%. So I’m not very confident in their claim that “Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier”.
I’ve often heard it said that doing RL on chain of thought will lead to ‘neuralese’ (e.g. most recently in Ryan Greenblatt’s excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
Yes, there have been a variety. Here’s the latest which is causing a media buzz: Meta’s Coconut https://arxiv.org/html/2412.06769v2
[deleted]
This is at best over-simplified in terms of thinking about ‘search’: Magnus Carlsen would also beat you or an amateur at bullet chess, at any time control:
(See for example the forward-pass-only Elos of chess/Go agents; Jones 2021 includes scaling law work on predicting the zero-search strength of agents, with no apparent upper bound.)
I think the natural counterpoint here is that the policy network could still be construed as doing search; just thst all the compute was invested during training and amortised later across many inferences.
Magnus Carlsen is better than average players for a couple reasons
Better “evaluation”; the ability to look at a position and accurately estimate likelihood of winning given optimal play
Better “search”; a combination of heuristic shortcuts and raw calculation power that let him see further ahead
So I agree that search isn’t the only relevant dimension. An average player given unbounded compute might overcome 1. just by exhaustively searching the game tree, but this seems to require such astronomical amounts of compute that it’s not worth discussing
The low resource configuration of o3 that only aggregates 6 traces already improved on results of previous contenders a lot, the plot of dependence on problem size shows this very clearly. Is there a reason to suspect that aggregation is best-of-n rather than consensus (picking the most popular answer)? Their outcome reward model might have systematic errors worse than those of the generative model, since ground truth is in verifiers anyway.
That’s a good point, it could be consensus.
Does anyone still believe in the original[1] AI 2027 timelines? If so, why?
Daniel has since updated to 2028, and I think the other authors are more conservative.