I tend to think the correct lesson from Claude Plays Pokemon is “it’s impressive that it does as well as it does, because it hasn’t been trained to do things like this at all!”.
Same with the vending machine example.
Presumably, with all the hype around “agentic”, tasks like this (beyond just “agentic” coding) will be added to the RL pipeline soon. Then, we will get to see what the capabilities are like when agency gets explicitly trained.
(Crux: I’m wrong if Claude 4 already has tasks like this in the RL.)
I think that agency at chess is not the same as agency in the real world. That is why we have superhuman chess bots, and not super human autonomous drones.
Very roughly speaking, the bottleneck here is world-models. Game tree search can probably work on real-world problems to the extent that NNs can provide good world-models for these problems. Of course, we haven’t seen large-scale tests of this sort of architecture yet (Claude Plays Pokemon is even less a test of how well this sort of thing works; reasoning models are not doing MCTS internally).
I suppose that I don’t know exactly what kind of agentic tasks LLMs are currently being trained on…. But people have been talking about LLM agents for years, and I’d be shocked if the frontier labs weren’t trying? Like, if that worked out of the box, we would know by now (?). Do you disagree?
It seems like for your point to make sense, you have to be arguing that LLMs haven’t been trained on such agentic tasks at all—not just that they perhaps weren’t trained on Pokémon specifically. They’re supposed to be general agents—we should be evaluating them on such things as untrained tasks! And like, complete transcripts of twitch streams of Pokémon play throughs probably are in the training data, so this is even pretty in-distribution. Their performance is NOT particularly impressive compared to what I would have expected chatting with them in 2022 or so when it seemed like they had pretty decent common sense. I would have expected Pokémon to be solved 3 years later. The apparent competence was to some degree an illusion—that or they really just can’t be motivated to do stuff yet. And I worry that these two memes—AGI is near, and alignment is not solved—are kind of propping each other up here. If capabilities seem to lag, it’s because alignment isn’t solved and the LLMs don’t care about the task. If alignment seems to be solved, it’s because LLMs aren’t competent enough to take the sharp left turn, but they will be soon. I’m not talking about you specifically, but the memetic environment on lesswrong.
Unrelated but: How do you know reasoning models are not doing MCTS internally? I’m not sure I really agree with that regardless of what you mean by “internally”. ToT is arguably a mutated and horribly heuristic type of guided MCTS. And I don’t know if something MCTS like is happening inside the LLMs.
But people have been talking about LLM agents for years, and I’d be shocked if the frontier labs weren’t trying? Like, if that worked out of the box, we would know by now (?).
Agentic (tool-using) RLVR only started working in late 2024, with o3 the first proper tool-using reasoning LLM prototype. From how it all looks (rickety and failing in weird ways), it’ll take another pretraining scale-up to get enough redundant reliability for some noise to fall away, and thus to get a better look at the implied capabilities. Also the development of environments for agentic RLVR only seems to be starting to ramp this year, and GB200 NVL72s that are significantly more efficient for RLVR on large models are only now starting to get online in large quantities.
So I expect that only 2026 LLMs trained with agentic RLVR will give a first reasonable glimpse of what this method gets us, the shape of its limitations, and only in 2027 we’ll get a picture overdetermined by essential capabilities of the method rather than by contingent early-days issues. (In the worlds where it ends up below AGI in 2027, and also where nothing else works too well before that.)
So in other words, everything has to go “right” for AGI by 2027? Maybe it will work. I’m only arguing against high confidence in short timelines. Anything could happen.
I’m responding to the point about LLM agents being a thing for years, and that therefore some level of maturity should be expected from them. I think this isn’t quite right, as the current method is new, the older methods didn’t work out, and it’s too early to tell that the new method won’t work out.
So I’m discussing when it’ll be time to tell that it won’t work out either (unless it does), at which point it’ll be possible to have some sense as to why. Which is not yet, probably in 2026, and certainly by 2027. I’m not really arguing about the probability that it does work out.
You are consistent about this kind of reasoning, but a lot of others seem to expect everything to happen really fast (before 2030) while also dismissing anything that doesn’t work as not having been tried because there haven’t been enough years for research.
Numbers? What does “high confidence” mean here? IIRC from our non-text discussions, Tsvi considers anything above 1% by end-of-year 2030 to be “high confidence in short timelines” of the sort he would have something to say about. (But not the level of strong disagreement he’s expressing in our written dialogue until something like 5-10% iirc.) What numbers would you “only argue against”?
I recall it as part of our (unrecorded) conversation, but I could be misremembering. Given your reaction I think I was probably misremembering. Sorry for the error!
So, to be clear, what is the probability someone else could state such that you would have “something to say about it” (ie, some kind of argument against it)? Your own probability being 0.5% − 1% isn’t inconsistent with what I said (if you’d have something to say about any probability above your own), but where would you actually put that cutoff? 5%? 10%?
If someone says 10% by 2030, we disagree, but it would be hard to find something to talk about purely on that basis. (Of course, they could have other more specific beliefs that I could argue with.) If they say, IDK, 25% or something (IDK, obviously not a sharp cutoff by any means, why would there be?), then I start feeling like we ought to be able to find a disagreement just by investigating what makes us say such different probabilities. Also I start feeling like they have strategically bad probabilities (I mean, their beliefs that are incorrect according to me would have practical implications that I think are mistaken actions). (On second thought, probably even 10% has strategically bad implications, assuming that implies 20% by 2035 or similar.)
Well, overconfident/underconfident is always only meaningful relative to some baseline, so if you strongly think (say) 0.001% is the right level of confidence, then 1% is high relative to that.
The various numbers I’ve stated during this debate are 60%, 50%, and 30%, so none of them are high by your meaning. Does that really mean you aren’t arguing against my positions? (This was not my previous impression.)
I think 60% by 2030 is too high, and I am arguing against numbers like that. There’s some ambiguity about drawing the lines because high numbers on very short timelines are of course strictly less plausible than high numbers on merely short timelines, so there isn’t necessarily one best number to compare.
On reflection, I don’t like the phrase “high confidence” for <50% and preferably not even for <75%. Something like “high credence” seems more appropriate—though one can certainly have higher or lower confidence, it is not clear communication to say you are highly confident of something which you believe at little better than even odds. Even if you were buying a lottery ticket with the special knowledge that you had picked one of three possible winning numbers, you still wouldn’t say you were highly confident that ticket would win—even though we would no longer be confident of losing!
Anyway, I haven’t necessarily been consistent / explicit about this throughout the conversation.
So I expect that only 2026 LLMs trained with agentic RLVR will give a first reasonable glimpse of what this method gets us, the shape of its limitations, and only in 2027 we’ll get a picture overdetermined by essential capabilities of the method
I’m at least 50% sure that this timeline would happen ~2x faster.
Conditional on training for agency yielding positive results the rest would be overdetermined by EoY 2025 / early 2026.
Otherwise, 2026 will be a slog and the 2027 wouldn’t happen in time (i.e. longer timelines).
I suppose that I don’t know exactly what kind of agentic tasks LLMs are currently being trained on…. But people have been talking about LLM agents for years, and I’d be shocked if the frontier labs weren’t trying? Like, if that worked out of the box, we would know by now (?). Do you disagree?
I don’t think LLMs have been particularly trained on what I’d consider the obvious things to really focus on agency-qua-agency in the sense we care about here. (I do think they’ve been laying down scaffolding and doing the preliminary versions of the obvious-things-you’d-do-first-in-particular)
Several months ago I had dinner with a GMD employee who’s team was working on RL to make LLMs play games. I would be very surprised if this hasn’t been going on for well over a year already.
In terms of public releases, reasoning models are less than a year old. The way these things work, I suspect, is that there are a lot of smaller, less expensive experiments going on at any given time, which generally take time to make it into the next big training run. These projects take some time to propose and develop, and the number of such experiments going on at a frontier lab at a given time is (very roughly) the number of research engineers (ie talent-constrained; you can’t try every idea). Big training runs take several months, with roughly one happening at a time.
“Agentic” wasn’t a big buzzword until very recently. Google Trends shows an obvious exponential-ish trend which starts very small, in the middle of last year, but doesn’t get significant until the beginning of this year, and explodes out from there.
Thinking about all this, I think things seem just about on the fence. I suspect the first few reasoning models didn’t have game-playing in their RL at all, because the emphasis was on getting “reasoning” to work. A proactive lab could have put game-playing into the RL for the next iteration. A reactive lab could have only gotten serious about it this year.
The scale also matters a lot. Data-hunger means that they’ll throw anything they have into the next training run so long as it saw some success in smaller-scale experiments and maybe even if not. However, the first round of game-playing training environments could end up being a negligible effect on the final product due to not having a ton of training cases yet. However, by the second round, if not the first, they should have scraped together a big collection of cases to train on.
There’s also the question of how good the RL algorithms are. I haven’t looked into it very much and also most of the top labs keep details quite private anyway, but, my impression is that the RL algorithms used so far have been quite bad (not ‘real RL’—just assigning equal credit to all tokens in a chain-of-thought). This will presumably get better (EG they’ll figure out how to use some MCTS variant if they haven’t already). This is extremely significant for long-horizon tasks, because the RL algorithms right now (I’m guessing) have to be able to solicit at least one successful sample in order to get a good training gradient in that direction; long tasks will be stuck in failed runs if there’s not any planning-like component.
In any case, yeah, I think if we haven’t seen proper game-playing training in frontier models yet, we should see it very soon. If LLMs are still “below child level at this task” end-of-year then this will be a significant update towards longer timelines for me. (Pokemon doesn’t count anymore, though, because now there’s been significant scaffolding-tuning for that case, and because a lab could specifically train on pokemon due to the focus on that case.)
Also: I suspect there’s already been a lot of explicit agency training in the context of programming. (Maybe not very long time-horizon stuff, though.)
Reinforcing Tsvi’s point:
I tend to think the correct lesson from Claude Plays Pokemon is “it’s impressive that it does as well as it does, because it hasn’t been trained to do things like this at all!”.
Same with the vending machine example.
Presumably, with all the hype around “agentic”, tasks like this (beyond just “agentic” coding) will be added to the RL pipeline soon. Then, we will get to see what the capabilities are like when agency gets explicitly trained.
(Crux: I’m wrong if Claude 4 already has tasks like this in the RL.)
Very roughly speaking, the bottleneck here is world-models. Game tree search can probably work on real-world problems to the extent that NNs can provide good world-models for these problems. Of course, we haven’t seen large-scale tests of this sort of architecture yet (Claude Plays Pokemon is even less a test of how well this sort of thing works; reasoning models are not doing MCTS internally).
I suppose that I don’t know exactly what kind of agentic tasks LLMs are currently being trained on…. But people have been talking about LLM agents for years, and I’d be shocked if the frontier labs weren’t trying? Like, if that worked out of the box, we would know by now (?). Do you disagree?
It seems like for your point to make sense, you have to be arguing that LLMs haven’t been trained on such agentic tasks at all—not just that they perhaps weren’t trained on Pokémon specifically. They’re supposed to be general agents—we should be evaluating them on such things as untrained tasks! And like, complete transcripts of twitch streams of Pokémon play throughs probably are in the training data, so this is even pretty in-distribution. Their performance is NOT particularly impressive compared to what I would have expected chatting with them in 2022 or so when it seemed like they had pretty decent common sense. I would have expected Pokémon to be solved 3 years later. The apparent competence was to some degree an illusion—that or they really just can’t be motivated to do stuff yet. And I worry that these two memes—AGI is near, and alignment is not solved—are kind of propping each other up here. If capabilities seem to lag, it’s because alignment isn’t solved and the LLMs don’t care about the task. If alignment seems to be solved, it’s because LLMs aren’t competent enough to take the sharp left turn, but they will be soon. I’m not talking about you specifically, but the memetic environment on lesswrong.
Unrelated but: How do you know reasoning models are not doing MCTS internally? I’m not sure I really agree with that regardless of what you mean by “internally”. ToT is arguably a mutated and horribly heuristic type of guided MCTS. And I don’t know if something MCTS like is happening inside the LLMs.
Agentic (tool-using) RLVR only started working in late 2024, with o3 the first proper tool-using reasoning LLM prototype. From how it all looks (rickety and failing in weird ways), it’ll take another pretraining scale-up to get enough redundant reliability for some noise to fall away, and thus to get a better look at the implied capabilities. Also the development of environments for agentic RLVR only seems to be starting to ramp this year, and GB200 NVL72s that are significantly more efficient for RLVR on large models are only now starting to get online in large quantities.
So I expect that only 2026 LLMs trained with agentic RLVR will give a first reasonable glimpse of what this method gets us, the shape of its limitations, and only in 2027 we’ll get a picture overdetermined by essential capabilities of the method rather than by contingent early-days issues. (In the worlds where it ends up below AGI in 2027, and also where nothing else works too well before that.)
So in other words, everything has to go “right” for AGI by 2027?
Maybe it will work. I’m only arguing against high confidence in short timelines. Anything could happen.
I’m responding to the point about LLM agents being a thing for years, and that therefore some level of maturity should be expected from them. I think this isn’t quite right, as the current method is new, the older methods didn’t work out, and it’s too early to tell that the new method won’t work out.
So I’m discussing when it’ll be time to tell that it won’t work out either (unless it does), at which point it’ll be possible to have some sense as to why. Which is not yet, probably in 2026, and certainly by 2027. I’m not really arguing about the probability that it does work out.
You are consistent about this kind of reasoning, but a lot of others seem to expect everything to happen really fast (before 2030) while also dismissing anything that doesn’t work as not having been tried because there haven’t been enough years for research.
Numbers? What does “high confidence” mean here? IIRC from our non-text discussions, Tsvi considers anything above 1% by end-of-year 2030 to be “high confidence in short timelines” of the sort he would have something to say about. (But not the level of strong disagreement he’s expressing in our written dialogue until something like 5-10% iirc.) What numbers would you “only argue against”?
Say what now?? Did I write that somewhere? That would be a typo or possibly a thinko. My own repeatedly stated probabilities would be around 1% or .5%! E.g. in https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce
I recall it as part of our (unrecorded) conversation, but I could be misremembering. Given your reaction I think I was probably misremembering. Sorry for the error!
So, to be clear, what is the probability someone else could state such that you would have “something to say about it” (ie, some kind of argument against it)? Your own probability being 0.5% − 1% isn’t inconsistent with what I said (if you’d have something to say about any probability above your own), but where would you actually put that cutoff? 5%? 10%?
If someone says 10% by 2030, we disagree, but it would be hard to find something to talk about purely on that basis. (Of course, they could have other more specific beliefs that I could argue with.) If they say, IDK, 25% or something (IDK, obviously not a sharp cutoff by any means, why would there be?), then I start feeling like we ought to be able to find a disagreement just by investigating what makes us say such different probabilities. Also I start feeling like they have strategically bad probabilities (I mean, their beliefs that are incorrect according to me would have practical implications that I think are mistaken actions). (On second thought, probably even 10% has strategically bad implications, assuming that implies 20% by 2035 or similar.)
High confidence means at least over 75%
Short timelines means, say, less than 10 years, though at this point I think the very short timeline picture means “around 2030”
I don’t know how anyone could reasonably refer to 1% confidence as high.
Well, overconfident/underconfident is always only meaningful relative to some baseline, so if you strongly think (say) 0.001% is the right level of confidence, then 1% is high relative to that.
The various numbers I’ve stated during this debate are 60%, 50%, and 30%, so none of them are high by your meaning. Does that really mean you aren’t arguing against my positions? (This was not my previous impression.)
I think 60% by 2030 is too high, and I am arguing against numbers like that. There’s some ambiguity about drawing the lines because high numbers on very short timelines are of course strictly less plausible than high numbers on merely short timelines, so there isn’t necessarily one best number to compare.
On reflection, I don’t like the phrase “high confidence” for <50% and preferably not even for <75%. Something like “high credence” seems more appropriate—though one can certainly have higher or lower confidence, it is not clear communication to say you are highly confident of something which you believe at little better than even odds. Even if you were buying a lottery ticket with the special knowledge that you had picked one of three possible winning numbers, you still wouldn’t say you were highly confident that ticket would win—even though we would no longer be confident of losing!
Anyway, I haven’t necessarily been consistent / explicit about this throughout the conversation.
I’m at least 50% sure that this timeline would happen ~2x faster. Conditional on training for agency yielding positive results the rest would be overdetermined by EoY 2025 / early 2026. Otherwise, 2026 will be a slog and the 2027 wouldn’t happen in time (i.e. longer timelines).
I don’t think LLMs have been particularly trained on what I’d consider the obvious things to really focus on agency-qua-agency in the sense we care about here. (I do think they’ve been laying down scaffolding and doing the preliminary versions of the obvious-things-you’d-do-first-in-particular)
Several months ago I had dinner with a GMD employee who’s team was working on RL to make LLMs play games. I would be very surprised if this hasn’t been going on for well over a year already.
In terms of public releases, reasoning models are less than a year old. The way these things work, I suspect, is that there are a lot of smaller, less expensive experiments going on at any given time, which generally take time to make it into the next big training run. These projects take some time to propose and develop, and the number of such experiments going on at a frontier lab at a given time is (very roughly) the number of research engineers (ie talent-constrained; you can’t try every idea). Big training runs take several months, with roughly one happening at a time.
“Agentic” wasn’t a big buzzword until very recently. Google Trends shows an obvious exponential-ish trend which starts very small, in the middle of last year, but doesn’t get significant until the beginning of this year, and explodes out from there.
Thinking about all this, I think things seem just about on the fence. I suspect the first few reasoning models didn’t have game-playing in their RL at all, because the emphasis was on getting “reasoning” to work. A proactive lab could have put game-playing into the RL for the next iteration. A reactive lab could have only gotten serious about it this year.
The scale also matters a lot. Data-hunger means that they’ll throw anything they have into the next training run so long as it saw some success in smaller-scale experiments and maybe even if not. However, the first round of game-playing training environments could end up being a negligible effect on the final product due to not having a ton of training cases yet. However, by the second round, if not the first, they should have scraped together a big collection of cases to train on.
There’s also the question of how good the RL algorithms are. I haven’t looked into it very much and also most of the top labs keep details quite private anyway, but, my impression is that the RL algorithms used so far have been quite bad (not ‘real RL’—just assigning equal credit to all tokens in a chain-of-thought). This will presumably get better (EG they’ll figure out how to use some MCTS variant if they haven’t already). This is extremely significant for long-horizon tasks, because the RL algorithms right now (I’m guessing) have to be able to solicit at least one successful sample in order to get a good training gradient in that direction; long tasks will be stuck in failed runs if there’s not any planning-like component.
In any case, yeah, I think if we haven’t seen proper game-playing training in frontier models yet, we should see it very soon. If LLMs are still “below child level at this task” end-of-year then this will be a significant update towards longer timelines for me. (Pokemon doesn’t count anymore, though, because now there’s been significant scaffolding-tuning for that case, and because a lab could specifically train on pokemon due to the focus on that case.)
Also: I suspect there’s already been a lot of explicit agency training in the context of programming. (Maybe not very long time-horizon stuff, though.)