This is a valuable discussion to have, but I believe Tsvi has not raised or focused on the strongest arguments. For context, like Tsvi, I don’t understand why people seem to be so confident of short timelines. However (though I did not read everything, and honestly I think this was justified since the conversation eventually seems to cycle and become unproductive) I generally found Abram’s arguments more persuasive and I seem to consider short timelines much more plausible than Tsvi does.
I agree that “originality” / “creativity” in models is something we want to watch, but I think Tsvi fails to raise to the strongest argument that gets at this: LLMs are really, really bad at agency. Like, when it comes to the general category of “knowing stuff” and even “reasoning stuff out” there can be some argument around whether LLMs have passed through undergrad to grad student level, and whether this is really crystalized or fluid intelligence. But we’re interested in ASI here. ASI has to win at the category we might call “doing stuff.” Obviously this is a bit of a loose concept, but the situation here is MUCH more clear cut.
Claude cannot run a vending machine business without making wildly terrible decisions. A high school student would do a better job than Claude at this, and it’s not close.
Before that experiment, my best (flawed) example was Pokemon. Last I checked, there is no LLM that has beaten Pokemon end-to-end with fixed scaffolding. Gemini beat it, but the scaffolding was adapted as it played, which is obviously cheating, and as far as I understand it was still ridiculously slow for such a railroaded children’s game. And Claude 4 did not even improve at this task significantly beyond Claude 3. In other words, LLMs are below child level at this task.
I don’t know as much about this, but based on dropping in to a recent RL conference I believe LLMs are also really bad at games like NetHack.
I don’t think I’m cherry picking here. These seem like reasonable and in fact rather easy test cases for agentic behavior. I expect planning in the real world to be much harder for curse-of-dimensionality reasons. And in fact I am not seeing any robots walking down the street (I know this is partially manufacturing / hardware, and mention this only as a sanity check. As a similar unreliable sanity check, my robotics and automation etf has been a poor investment. Probably someone will explain to me why I’m stupid for even considering these factors, and they will probably be right).
Now let’s consider the bigger picture. The recent METR report on task length scaling for various tasks overall moved me slightly towards shorter timelines by showing exponential scaling across many domains. However, note that more agentic domains are generaly years behind less agentic domains, and in the case of FSD (which to me seems “most agentic”) the scaling is MUCH slower. There is more than one way to interpret these findings, but I think there is a reasonable interpretation which is consistent with my model: the more agency a task requires, the slower LLMs are gaining capability at that task. I haven’t done the (underspecified) math, but this seems to very likely cash out to subexponential scaling on agency (which I model as bottlenecked by the first task you totally fall over on).
None of this directly gets at AI for AI research. Maybe LLMs will have lots of useful original insights while they are still unable to run a vending machine business. But… I think this type of reasoning: “there exists a positive feedback loop → singularity” is pretty loose to say the least. LLMs may significantly speed up AI research and this may turn out to just compensate for the death of Moore’s law. It’s hard to say. It depends how good at research you expect an LLM to get without needing the skills to run a vending machine business. Personally, I weakly suspect that serious research leans on agency to some degree, and is eventually bottlenecked by agency.
To be explicit, I want to replace the argument “LLMs don’t seem to be good at original thinking” with “There are a priori reasons to doubt that LLMs will succed at original thinking. Also, they are clearly lagging significantly at agency. Plausibly, this implies that they in fact lack some of the core skills needed for serious original thinking. Also, LLMs still do not seem to be doing much original thinking (I would argue still nothing on the level of a research contribution, though admittedly there are now some edge cases), so this hypothesis has at least not been disconfirmed.” To me, that seems like a pretty strong reason not to be confident about short timelines.
I see people increasingly arguing that agency failures are actually alignment failures. This could be right, but it also could be cope. In fact I am confused about the actual distinction—an LLM with no long-term motivational system lacks both agency and alignment. If it were a pure alignment failure, we would expect LLMs to do agentic-looking stuff, just not what we wanted. Maybe you can view some of their (possibly miss-named) reward hacking behavior that way, on coding tasks. Or you know, possibly they just can’t code that well or delude themselves and so they cheat (they don’t seem to perform sophisticated exploits unless researchers bait them into it?). But Pokemon and NetHack and the vending machine? Maybe they just don’t want to win. But they also don’t seem to be doing much instrumental power seeking, so it doesn’t really seem like they WANT anything.
Anyway, this is my crux. If we start to see competent agentic behavior I will buy into the short timelines view at 75% +
One other objection I want to head off: Yes, there must be some brain-like algorithm which is far more sample efficient and agentic than LLMs (though it’s possible that large enough trained and post-trained LLMs eventually are just as good, which is kind of the issue at dispute here). That brain-like algorithm has not been discovered and I see no reason to expect it to be discovered in the next 5 years unless LLMs have already foomed. So I do not consider this particularly relevant to the discussion about confidence in very short timelines.
Also, worth stating explicitly that I agree with both interlocutors that we should pause AGI development now out of reasonable caution, which I consider highly overdetermined.
Anyway, this is my crux. If we start to see competent agentic behavior I will buy into the short timelines view at 75% +
Seems good to flesh out what you mean by this if it’s such a big crux. Ideally, you’d be able to flesh this out in such a way that bad vision (a key problem for games like pokemon) and poor motivation/adversarial-robustness (a key problem for vending claude because it would sort of knowingly make bad financial decisions) aren’t highlighted.
Would this count as competent agentic behavior?
The AI often successfully completes messy software engineering tasks which require 1 week of work for a skilled human and which require checking back in with the person who specified the task to resolve ambiguities. The way the AI completes these tasks involves doing a bunch of debugging and iteration (though perhaps less than a human would do).
Yes, if time horizons on realistic SWE tasks pass 8-16 hours that would change my mind—I have already offered to bet the AI 2027 team cash on on this (not taken up) and you can provide me liquidity on the various existing manifold markets (not going to dig up the specific ones) which I very occasionally trade on.
Adversarial robustness is part of agency, so I don’t agree with that aspect of your framing.
I think that it is. I keep meaning to write my thoughts on this issue up.
I believe adversarially robustness is a core agency skill because reasoning can defeat itself; you have to be unable to fool yourself. You can’t be fooled by the processes you spin off, figuratively or literally. You can’t be fooled by other people’s bad but convincing ideas either.
this is related to an observation I’ve made that exotic counterexamples are likely to show up in wrong proofs, not becuase they are typical, but because mathematicians will tend to construct unusual situations while seeking to misuse true results to prove a false result.
a weaker position is that even if adversarial robustness isn’t itself necessary for agency, an egregious failure to be adversarially robust seems awfully likely to indicate that something deeper is missing or broken.
IMO, the type of adversarial robustness you’re discussing is sufficiently different than what people typically mean by adversarial robustness that it would be worth tabooing the word. (E.g., I might say “robust self-verification is required”.)
The way I model this situation is tied to my analysis of joint AIXI which treats the action bits as adversarial because the distribution is not realizable.
so, there are actually a few different concepts here which my mental models link in a non-transparent way.
(I’ve noticed that when people say things like I just said, it seems to be fairly common that their model is just conflating things and they’re wrong. I don’t think that applies to me, but it’s worth a minor update on the outside view)
I see people increasingly arguing that agency failures are actually alignment failures. This could be right, but it also could be cope. In fact I am confused about the actual distinction
Reading this made me think that the framing “Everything is alignment-constrained, nothing is capabilities-constrained.” is a rathering and that a more natural/joint-carving framing is:
To the extent that you can get capabilities by your own means (rather than hoping for reality to give you access to a new pool of some resource or whatever), you get them by getting various things to align so that they produce those capabilities.
Or, in other words, all capabilities stem from “getting things to ‘align’ with each other in the right way”.
Is this a problematic equivocation of the term “alignment”? The term “alignment” is polysemous and thus quite equivocable anyway, but if we narrow down on what I consider the most sensible explication of the relevant submeaning, i.e., Tsvi’s “make a mind that is highly capable, and whose ultimate effects are determined by the judgement of human operators”, then (modulo whether you want to apply the term “alignment” to the LLMs which is downstream from other modulos: modulo “highly capable” (and modulo “mind”) and modulo the question of whether there is a sufficient continuity or inferential connection between the LLMs you’re talking about here and the possible future omnicide-capable AI or whatever[1]) I think the framing mostly works.
I still feel like there’s something wrong or left unsaid in this framing. Perhaps it’s that the tails of the alignment-capabilities distinction (to the extent that you want to use it all) come apart as you move from the coarse-grained realm of clear distinction between “thing can do bad thing X but won’t and that ‘won’t’ is quite robust” to the finer-grained real of blurry “thing can’t do X but for reasons that are too messy to concisely describe in terms of capabilities and alignment”.
Not sure what you mean by agency, but I probably disagree with you here. I don’t think agency is that strong an indicator of “this is going to kill us within 5 years”, and conversely I don’t think the lack of agency implies “this won’t kill us within 5 years”.
In these sorts of cases, I probably qualitatively agree with Abram’s point about performance / elicitation / “alignment”. In other words, I expect training with RL (broadly) to pick up some medium-hanging fruit that’s pretty easily available given what gippities can already do / quasi-understand.
Concretely, I wouldn’t be very surprised by FSD working soon, other robotics things working, some jobs on the level of “manage some vending machines” being replaced, some customer relationship management jobs being replaced, etc.
For comparison, good old fashioned chess playing programs defeated human chess players last millenium by searching through action-paths superhumanly. That’s already enough agency to be very scary.
I think that agency at chess is not the same as agency in the real world. That is why we have superhuman chess bots, and not super human autonomous drones.
(I don’t expect this to be convincing. I agree that we disagree. I have not seen strong evidence that agency failures will be easily overcome with better elicitation)
I tend to think the correct lesson from Claude Plays Pokemon is “it’s impressive that it does as well as it does, because it hasn’t been trained to do things like this at all!”.
Same with the vending machine example.
Presumably, with all the hype around “agentic”, tasks like this (beyond just “agentic” coding) will be added to the RL pipeline soon. Then, we will get to see what the capabilities are like when agency gets explicitly trained.
(Crux: I’m wrong if Claude 4 already has tasks like this in the RL.)
I think that agency at chess is not the same as agency in the real world. That is why we have superhuman chess bots, and not super human autonomous drones.
Very roughly speaking, the bottleneck here is world-models. Game tree search can probably work on real-world problems to the extent that NNs can provide good world-models for these problems. Of course, we haven’t seen large-scale tests of this sort of architecture yet (Claude Plays Pokemon is even less a test of how well this sort of thing works; reasoning models are not doing MCTS internally).
I suppose that I don’t know exactly what kind of agentic tasks LLMs are currently being trained on…. But people have been talking about LLM agents for years, and I’d be shocked if the frontier labs weren’t trying? Like, if that worked out of the box, we would know by now (?). Do you disagree?
It seems like for your point to make sense, you have to be arguing that LLMs haven’t been trained on such agentic tasks at all—not just that they perhaps weren’t trained on Pokémon specifically. They’re supposed to be general agents—we should be evaluating them on such things as untrained tasks! And like, complete transcripts of twitch streams of Pokémon play throughs probably are in the training data, so this is even pretty in-distribution. Their performance is NOT particularly impressive compared to what I would have expected chatting with them in 2022 or so when it seemed like they had pretty decent common sense. I would have expected Pokémon to be solved 3 years later. The apparent competence was to some degree an illusion—that or they really just can’t be motivated to do stuff yet. And I worry that these two memes—AGI is near, and alignment is not solved—are kind of propping each other up here. If capabilities seem to lag, it’s because alignment isn’t solved and the LLMs don’t care about the task. If alignment seems to be solved, it’s because LLMs aren’t competent enough to take the sharp left turn, but they will be soon. I’m not talking about you specifically, but the memetic environment on lesswrong.
Unrelated but: How do you know reasoning models are not doing MCTS internally? I’m not sure I really agree with that regardless of what you mean by “internally”. ToT is arguably a mutated and horribly heuristic type of guided MCTS. And I don’t know if something MCTS like is happening inside the LLMs.
But people have been talking about LLM agents for years, and I’d be shocked if the frontier labs weren’t trying? Like, if that worked out of the box, we would know by now (?).
Agentic (tool-using) RLVR only started working in late 2024, with o3 the first proper tool-using reasoning LLM prototype. From how it all looks (rickety and failing in weird ways), it’ll take another pretraining scale-up to get enough redundant reliability for some noise to fall away, and thus to get a better look at the implied capabilities. Also the development of environments for agentic RLVR only seems to be starting to ramp this year, and GB200 NVL72s that are significantly more efficient for RLVR on large models are only now starting to get online in large quantities.
So I expect that only 2026 LLMs trained with agentic RLVR will give a first reasonable glimpse of what this method gets us, the shape of its limitations, and only in 2027 we’ll get a picture overdetermined by essential capabilities of the method rather than by contingent early-days issues. (In the worlds where it ends up below AGI in 2027, and also where nothing else works too well before that.)
So in other words, everything has to go “right” for AGI by 2027? Maybe it will work. I’m only arguing against high confidence in short timelines. Anything could happen.
I’m responding to the point about LLM agents being a thing for years, and that therefore some level of maturity should be expected from them. I think this isn’t quite right, as the current method is new, the older methods didn’t work out, and it’s too early to tell that the new method won’t work out.
So I’m discussing when it’ll be time to tell that it won’t work out either (unless it does), at which point it’ll be possible to have some sense as to why. Which is not yet, probably in 2026, and certainly by 2027. I’m not really arguing about the probability that it does work out.
You are consistent about this kind of reasoning, but a lot of others seem to expect everything to happen really fast (before 2030) while also dismissing anything that doesn’t work as not having been tried because there haven’t been enough years for research.
Numbers? What does “high confidence” mean here? IIRC from our non-text discussions, Tsvi considers anything above 1% by end-of-year 2030 to be “high confidence in short timelines” of the sort he would have something to say about. (But not the level of strong disagreement he’s expressing in our written dialogue until something like 5-10% iirc.) What numbers would you “only argue against”?
I recall it as part of our (unrecorded) conversation, but I could be misremembering. Given your reaction I think I was probably misremembering. Sorry for the error!
So, to be clear, what is the probability someone else could state such that you would have “something to say about it” (ie, some kind of argument against it)? Your own probability being 0.5% − 1% isn’t inconsistent with what I said (if you’d have something to say about any probability above your own), but where would you actually put that cutoff? 5%? 10%?
If someone says 10% by 2030, we disagree, but it would be hard to find something to talk about purely on that basis. (Of course, they could have other more specific beliefs that I could argue with.) If they say, IDK, 25% or something (IDK, obviously not a sharp cutoff by any means, why would there be?), then I start feeling like we ought to be able to find a disagreement just by investigating what makes us say such different probabilities. Also I start feeling like they have strategically bad probabilities (I mean, their beliefs that are incorrect according to me would have practical implications that I think are mistaken actions). (On second thought, probably even 10% has strategically bad implications, assuming that implies 20% by 2035 or similar.)
Well, overconfident/underconfident is always only meaningful relative to some baseline, so if you strongly think (say) 0.001% is the right level of confidence, then 1% is high relative to that.
The various numbers I’ve stated during this debate are 60%, 50%, and 30%, so none of them are high by your meaning. Does that really mean you aren’t arguing against my positions? (This was not my previous impression.)
I think 60% by 2030 is too high, and I am arguing against numbers like that. There’s some ambiguity about drawing the lines because high numbers on very short timelines are of course strictly less plausible than high numbers on merely short timelines, so there isn’t necessarily one best number to compare.
On reflection, I don’t like the phrase “high confidence” for <50% and preferably not even for <75%. Something like “high credence” seems more appropriate—though one can certainly have higher or lower confidence, it is not clear communication to say you are highly confident of something which you believe at little better than even odds. Even if you were buying a lottery ticket with the special knowledge that you had picked one of three possible winning numbers, you still wouldn’t say you were highly confident that ticket would win—even though we would no longer be confident of losing!
Anyway, I haven’t necessarily been consistent / explicit about this throughout the conversation.
So I expect that only 2026 LLMs trained with agentic RLVR will give a first reasonable glimpse of what this method gets us, the shape of its limitations, and only in 2027 we’ll get a picture overdetermined by essential capabilities of the method
I’m at least 50% sure that this timeline would happen ~2x faster.
Conditional on training for agency yielding positive results the rest would be overdetermined by EoY 2025 / early 2026.
Otherwise, 2026 will be a slog and the 2027 wouldn’t happen in time (i.e. longer timelines).
I suppose that I don’t know exactly what kind of agentic tasks LLMs are currently being trained on…. But people have been talking about LLM agents for years, and I’d be shocked if the frontier labs weren’t trying? Like, if that worked out of the box, we would know by now (?). Do you disagree?
I don’t think LLMs have been particularly trained on what I’d consider the obvious things to really focus on agency-qua-agency in the sense we care about here. (I do think they’ve been laying down scaffolding and doing the preliminary versions of the obvious-things-you’d-do-first-in-particular)
Several months ago I had dinner with a GMD employee who’s team was working on RL to make LLMs play games. I would be very surprised if this hasn’t been going on for well over a year already.
In terms of public releases, reasoning models are less than a year old. The way these things work, I suspect, is that there are a lot of smaller, less expensive experiments going on at any given time, which generally take time to make it into the next big training run. These projects take some time to propose and develop, and the number of such experiments going on at a frontier lab at a given time is (very roughly) the number of research engineers (ie talent-constrained; you can’t try every idea). Big training runs take several months, with roughly one happening at a time.
“Agentic” wasn’t a big buzzword until very recently. Google Trends shows an obvious exponential-ish trend which starts very small, in the middle of last year, but doesn’t get significant until the beginning of this year, and explodes out from there.
Thinking about all this, I think things seem just about on the fence. I suspect the first few reasoning models didn’t have game-playing in their RL at all, because the emphasis was on getting “reasoning” to work. A proactive lab could have put game-playing into the RL for the next iteration. A reactive lab could have only gotten serious about it this year.
The scale also matters a lot. Data-hunger means that they’ll throw anything they have into the next training run so long as it saw some success in smaller-scale experiments and maybe even if not. However, the first round of game-playing training environments could end up being a negligible effect on the final product due to not having a ton of training cases yet. However, by the second round, if not the first, they should have scraped together a big collection of cases to train on.
There’s also the question of how good the RL algorithms are. I haven’t looked into it very much and also most of the top labs keep details quite private anyway, but, my impression is that the RL algorithms used so far have been quite bad (not ‘real RL’—just assigning equal credit to all tokens in a chain-of-thought). This will presumably get better (EG they’ll figure out how to use some MCTS variant if they haven’t already). This is extremely significant for long-horizon tasks, because the RL algorithms right now (I’m guessing) have to be able to solicit at least one successful sample in order to get a good training gradient in that direction; long tasks will be stuck in failed runs if there’s not any planning-like component.
In any case, yeah, I think if we haven’t seen proper game-playing training in frontier models yet, we should see it very soon. If LLMs are still “below child level at this task” end-of-year then this will be a significant update towards longer timelines for me. (Pokemon doesn’t count anymore, though, because now there’s been significant scaffolding-tuning for that case, and because a lab could specifically train on pokemon due to the focus on that case.)
Also: I suspect there’s already been a lot of explicit agency training in the context of programming. (Maybe not very long time-horizon stuff, though.)
It’s different, yeah—for example, in that doing interesting things in the real world requires originary concept creation. But to do merely “agentic” things doesn’t necessarily require that. IDK what you meant by agency if not “finding paths through causality to drive some state into a small sector of statespace”; I was trying to give a superhuman example of that.
This is a valuable discussion to have, but I believe Tsvi has not raised or focused on the strongest arguments. For context, like Tsvi, I don’t understand why people seem to be so confident of short timelines. However (though I did not read everything, and honestly I think this was justified since the conversation eventually seems to cycle and become unproductive) I generally found Abram’s arguments more persuasive and I seem to consider short timelines much more plausible than Tsvi does.
I agree that “originality” / “creativity” in models is something we want to watch, but I think Tsvi fails to raise to the strongest argument that gets at this: LLMs are really, really bad at agency. Like, when it comes to the general category of “knowing stuff” and even “reasoning stuff out” there can be some argument around whether LLMs have passed through undergrad to grad student level, and whether this is really crystalized or fluid intelligence. But we’re interested in ASI here. ASI has to win at the category we might call “doing stuff.” Obviously this is a bit of a loose concept, but the situation here is MUCH more clear cut.
Claude cannot run a vending machine business without making wildly terrible decisions. A high school student would do a better job than Claude at this, and it’s not close.
Before that experiment, my best (flawed) example was Pokemon. Last I checked, there is no LLM that has beaten Pokemon end-to-end with fixed scaffolding. Gemini beat it, but the scaffolding was adapted as it played, which is obviously cheating, and as far as I understand it was still ridiculously slow for such a railroaded children’s game. And Claude 4 did not even improve at this task significantly beyond Claude 3. In other words, LLMs are below child level at this task.
I don’t know as much about this, but based on dropping in to a recent RL conference I believe LLMs are also really bad at games like NetHack.
I don’t think I’m cherry picking here. These seem like reasonable and in fact rather easy test cases for agentic behavior. I expect planning in the real world to be much harder for curse-of-dimensionality reasons. And in fact I am not seeing any robots walking down the street (I know this is partially manufacturing / hardware, and mention this only as a sanity check. As a similar unreliable sanity check, my robotics and automation etf has been a poor investment. Probably someone will explain to me why I’m stupid for even considering these factors, and they will probably be right).
Now let’s consider the bigger picture. The recent METR report on task length scaling for various tasks overall moved me slightly towards shorter timelines by showing exponential scaling across many domains. However, note that more agentic domains are generaly years behind less agentic domains, and in the case of FSD (which to me seems “most agentic”) the scaling is MUCH slower. There is more than one way to interpret these findings, but I think there is a reasonable interpretation which is consistent with my model: the more agency a task requires, the slower LLMs are gaining capability at that task. I haven’t done the (underspecified) math, but this seems to very likely cash out to subexponential scaling on agency (which I model as bottlenecked by the first task you totally fall over on).
None of this directly gets at AI for AI research. Maybe LLMs will have lots of useful original insights while they are still unable to run a vending machine business. But… I think this type of reasoning: “there exists a positive feedback loop → singularity” is pretty loose to say the least. LLMs may significantly speed up AI research and this may turn out to just compensate for the death of Moore’s law. It’s hard to say. It depends how good at research you expect an LLM to get without needing the skills to run a vending machine business. Personally, I weakly suspect that serious research leans on agency to some degree, and is eventually bottlenecked by agency.
To be explicit, I want to replace the argument “LLMs don’t seem to be good at original thinking” with “There are a priori reasons to doubt that LLMs will succed at original thinking. Also, they are clearly lagging significantly at agency. Plausibly, this implies that they in fact lack some of the core skills needed for serious original thinking. Also, LLMs still do not seem to be doing much original thinking (I would argue still nothing on the level of a research contribution, though admittedly there are now some edge cases), so this hypothesis has at least not been disconfirmed.” To me, that seems like a pretty strong reason not to be confident about short timelines.
I see people increasingly arguing that agency failures are actually alignment failures. This could be right, but it also could be cope. In fact I am confused about the actual distinction—an LLM with no long-term motivational system lacks both agency and alignment. If it were a pure alignment failure, we would expect LLMs to do agentic-looking stuff, just not what we wanted. Maybe you can view some of their (possibly miss-named) reward hacking behavior that way, on coding tasks. Or you know, possibly they just can’t code that well or delude themselves and so they cheat (they don’t seem to perform sophisticated exploits unless researchers bait them into it?). But Pokemon and NetHack and the vending machine? Maybe they just don’t want to win. But they also don’t seem to be doing much instrumental power seeking, so it doesn’t really seem like they WANT anything.
Anyway, this is my crux. If we start to see competent agentic behavior I will buy into the short timelines view at 75% +
One other objection I want to head off: Yes, there must be some brain-like algorithm which is far more sample efficient and agentic than LLMs (though it’s possible that large enough trained and post-trained LLMs eventually are just as good, which is kind of the issue at dispute here). That brain-like algorithm has not been discovered and I see no reason to expect it to be discovered in the next 5 years unless LLMs have already foomed. So I do not consider this particularly relevant to the discussion about confidence in very short timelines.
Also, worth stating explicitly that I agree with both interlocutors that we should pause AGI development now out of reasonable caution, which I consider highly overdetermined.
Seems good to flesh out what you mean by this if it’s such a big crux. Ideally, you’d be able to flesh this out in such a way that bad vision (a key problem for games like pokemon) and poor motivation/adversarial-robustness (a key problem for vending claude because it would sort of knowingly make bad financial decisions) aren’t highlighted.
Would this count as competent agentic behavior?
The AI often successfully completes messy software engineering tasks which require 1 week of work for a skilled human and which require checking back in with the person who specified the task to resolve ambiguities. The way the AI completes these tasks involves doing a bunch of debugging and iteration (though perhaps less than a human would do).
Yes, if time horizons on realistic SWE tasks pass 8-16 hours that would change my mind—I have already offered to bet the AI 2027 team cash on on this (not taken up) and you can provide me liquidity on the various existing manifold markets (not going to dig up the specific ones) which I very occasionally trade on.
Adversarial robustness is part of agency, so I don’t agree with that aspect of your framing.
Maybe so, but it isn’t clearly required for automating AI R&D!
I think that it is. I keep meaning to write my thoughts on this issue up.
I believe adversarially robustness is a core agency skill because reasoning can defeat itself; you have to be unable to fool yourself. You can’t be fooled by the processes you spin off, figuratively or literally. You can’t be fooled by other people’s bad but convincing ideas either.
this is related to an observation I’ve made that exotic counterexamples are likely to show up in wrong proofs, not becuase they are typical, but because mathematicians will tend to construct unusual situations while seeking to misuse true results to prove a false result.
a weaker position is that even if adversarial robustness isn’t itself necessary for agency, an egregious failure to be adversarially robust seems awfully likely to indicate that something deeper is missing or broken.
IMO, the type of adversarial robustness you’re discussing is sufficiently different than what people typically mean by adversarial robustness that it would be worth tabooing the word. (E.g., I might say “robust self-verification is required”.)
I guess that’s true.
The way I model this situation is tied to my analysis of joint AIXI which treats the action bits as adversarial because the distribution is not realizable.
so, there are actually a few different concepts here which my mental models link in a non-transparent way.
(I’ve noticed that when people say things like I just said, it seems to be fairly common that their model is just conflating things and they’re wrong. I don’t think that applies to me, but it’s worth a minor update on the outside view)
To echo my comment from 2 months ago:
Or, in other words, all capabilities stem from “getting things to ‘align’ with each other in the right way”.
Is this a problematic equivocation of the term “alignment”? The term “alignment” is polysemous and thus quite equivocable anyway, but if we narrow down on what I consider the most sensible explication of the relevant submeaning, i.e., Tsvi’s “make a mind that is highly capable, and whose ultimate effects are determined by the judgement of human operators”, then (modulo whether you want to apply the term “alignment” to the LLMs which is downstream from other modulos: modulo “highly capable” (and modulo “mind”) and modulo the question of whether there is a sufficient continuity or inferential connection between the LLMs you’re talking about here and the possible future omnicide-capable AI or whatever[1]) I think the framing mostly works.
I still feel like there’s something wrong or left unsaid in this framing. Perhaps it’s that the tails of the alignment-capabilities distinction (to the extent that you want to use it all) come apart as you move from the coarse-grained realm of clear distinction between “thing can do bad thing X but won’t and that ‘won’t’ is quite robust” to the finer-grained real of blurry “thing can’t do X but for reasons that are too messy to concisely describe in terms of capabilities and alignment”.
These are plausibly very non-trivial modulos … but modulo that non-triviality too.
Not sure what you mean by agency, but I probably disagree with you here. I don’t think agency is that strong an indicator of “this is going to kill us within 5 years”, and conversely I don’t think the lack of agency implies “this won’t kill us within 5 years”.
In these sorts of cases, I probably qualitatively agree with Abram’s point about performance / elicitation / “alignment”. In other words, I expect training with RL (broadly) to pick up some medium-hanging fruit that’s pretty easily available given what gippities can already do / quasi-understand.
Concretely, I wouldn’t be very surprised by FSD working soon, other robotics things working, some jobs on the level of “manage some vending machines” being replaced, some customer relationship management jobs being replaced, etc.
For comparison, good old fashioned chess playing programs defeated human chess players last millenium by searching through action-paths superhumanly. That’s already enough agency to be very scary.
I think that agency at chess is not the same as agency in the real world. That is why we have superhuman chess bots, and not super human autonomous drones.
(I don’t expect this to be convincing. I agree that we disagree. I have not seen strong evidence that agency failures will be easily overcome with better elicitation)
Reinforcing Tsvi’s point:
I tend to think the correct lesson from Claude Plays Pokemon is “it’s impressive that it does as well as it does, because it hasn’t been trained to do things like this at all!”.
Same with the vending machine example.
Presumably, with all the hype around “agentic”, tasks like this (beyond just “agentic” coding) will be added to the RL pipeline soon. Then, we will get to see what the capabilities are like when agency gets explicitly trained.
(Crux: I’m wrong if Claude 4 already has tasks like this in the RL.)
Very roughly speaking, the bottleneck here is world-models. Game tree search can probably work on real-world problems to the extent that NNs can provide good world-models for these problems. Of course, we haven’t seen large-scale tests of this sort of architecture yet (Claude Plays Pokemon is even less a test of how well this sort of thing works; reasoning models are not doing MCTS internally).
I suppose that I don’t know exactly what kind of agentic tasks LLMs are currently being trained on…. But people have been talking about LLM agents for years, and I’d be shocked if the frontier labs weren’t trying? Like, if that worked out of the box, we would know by now (?). Do you disagree?
It seems like for your point to make sense, you have to be arguing that LLMs haven’t been trained on such agentic tasks at all—not just that they perhaps weren’t trained on Pokémon specifically. They’re supposed to be general agents—we should be evaluating them on such things as untrained tasks! And like, complete transcripts of twitch streams of Pokémon play throughs probably are in the training data, so this is even pretty in-distribution. Their performance is NOT particularly impressive compared to what I would have expected chatting with them in 2022 or so when it seemed like they had pretty decent common sense. I would have expected Pokémon to be solved 3 years later. The apparent competence was to some degree an illusion—that or they really just can’t be motivated to do stuff yet. And I worry that these two memes—AGI is near, and alignment is not solved—are kind of propping each other up here. If capabilities seem to lag, it’s because alignment isn’t solved and the LLMs don’t care about the task. If alignment seems to be solved, it’s because LLMs aren’t competent enough to take the sharp left turn, but they will be soon. I’m not talking about you specifically, but the memetic environment on lesswrong.
Unrelated but: How do you know reasoning models are not doing MCTS internally? I’m not sure I really agree with that regardless of what you mean by “internally”. ToT is arguably a mutated and horribly heuristic type of guided MCTS. And I don’t know if something MCTS like is happening inside the LLMs.
Agentic (tool-using) RLVR only started working in late 2024, with o3 the first proper tool-using reasoning LLM prototype. From how it all looks (rickety and failing in weird ways), it’ll take another pretraining scale-up to get enough redundant reliability for some noise to fall away, and thus to get a better look at the implied capabilities. Also the development of environments for agentic RLVR only seems to be starting to ramp this year, and GB200 NVL72s that are significantly more efficient for RLVR on large models are only now starting to get online in large quantities.
So I expect that only 2026 LLMs trained with agentic RLVR will give a first reasonable glimpse of what this method gets us, the shape of its limitations, and only in 2027 we’ll get a picture overdetermined by essential capabilities of the method rather than by contingent early-days issues. (In the worlds where it ends up below AGI in 2027, and also where nothing else works too well before that.)
So in other words, everything has to go “right” for AGI by 2027?
Maybe it will work. I’m only arguing against high confidence in short timelines. Anything could happen.
I’m responding to the point about LLM agents being a thing for years, and that therefore some level of maturity should be expected from them. I think this isn’t quite right, as the current method is new, the older methods didn’t work out, and it’s too early to tell that the new method won’t work out.
So I’m discussing when it’ll be time to tell that it won’t work out either (unless it does), at which point it’ll be possible to have some sense as to why. Which is not yet, probably in 2026, and certainly by 2027. I’m not really arguing about the probability that it does work out.
You are consistent about this kind of reasoning, but a lot of others seem to expect everything to happen really fast (before 2030) while also dismissing anything that doesn’t work as not having been tried because there haven’t been enough years for research.
Numbers? What does “high confidence” mean here? IIRC from our non-text discussions, Tsvi considers anything above 1% by end-of-year 2030 to be “high confidence in short timelines” of the sort he would have something to say about. (But not the level of strong disagreement he’s expressing in our written dialogue until something like 5-10% iirc.) What numbers would you “only argue against”?
Say what now?? Did I write that somewhere? That would be a typo or possibly a thinko. My own repeatedly stated probabilities would be around 1% or .5%! E.g. in https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce
I recall it as part of our (unrecorded) conversation, but I could be misremembering. Given your reaction I think I was probably misremembering. Sorry for the error!
So, to be clear, what is the probability someone else could state such that you would have “something to say about it” (ie, some kind of argument against it)? Your own probability being 0.5% − 1% isn’t inconsistent with what I said (if you’d have something to say about any probability above your own), but where would you actually put that cutoff? 5%? 10%?
If someone says 10% by 2030, we disagree, but it would be hard to find something to talk about purely on that basis. (Of course, they could have other more specific beliefs that I could argue with.) If they say, IDK, 25% or something (IDK, obviously not a sharp cutoff by any means, why would there be?), then I start feeling like we ought to be able to find a disagreement just by investigating what makes us say such different probabilities. Also I start feeling like they have strategically bad probabilities (I mean, their beliefs that are incorrect according to me would have practical implications that I think are mistaken actions). (On second thought, probably even 10% has strategically bad implications, assuming that implies 20% by 2035 or similar.)
High confidence means at least over 75%
Short timelines means, say, less than 10 years, though at this point I think the very short timeline picture means “around 2030”
I don’t know how anyone could reasonably refer to 1% confidence as high.
Well, overconfident/underconfident is always only meaningful relative to some baseline, so if you strongly think (say) 0.001% is the right level of confidence, then 1% is high relative to that.
The various numbers I’ve stated during this debate are 60%, 50%, and 30%, so none of them are high by your meaning. Does that really mean you aren’t arguing against my positions? (This was not my previous impression.)
I think 60% by 2030 is too high, and I am arguing against numbers like that. There’s some ambiguity about drawing the lines because high numbers on very short timelines are of course strictly less plausible than high numbers on merely short timelines, so there isn’t necessarily one best number to compare.
On reflection, I don’t like the phrase “high confidence” for <50% and preferably not even for <75%. Something like “high credence” seems more appropriate—though one can certainly have higher or lower confidence, it is not clear communication to say you are highly confident of something which you believe at little better than even odds. Even if you were buying a lottery ticket with the special knowledge that you had picked one of three possible winning numbers, you still wouldn’t say you were highly confident that ticket would win—even though we would no longer be confident of losing!
Anyway, I haven’t necessarily been consistent / explicit about this throughout the conversation.
I’m at least 50% sure that this timeline would happen ~2x faster. Conditional on training for agency yielding positive results the rest would be overdetermined by EoY 2025 / early 2026. Otherwise, 2026 will be a slog and the 2027 wouldn’t happen in time (i.e. longer timelines).
I don’t think LLMs have been particularly trained on what I’d consider the obvious things to really focus on agency-qua-agency in the sense we care about here. (I do think they’ve been laying down scaffolding and doing the preliminary versions of the obvious-things-you’d-do-first-in-particular)
Several months ago I had dinner with a GMD employee who’s team was working on RL to make LLMs play games. I would be very surprised if this hasn’t been going on for well over a year already.
In terms of public releases, reasoning models are less than a year old. The way these things work, I suspect, is that there are a lot of smaller, less expensive experiments going on at any given time, which generally take time to make it into the next big training run. These projects take some time to propose and develop, and the number of such experiments going on at a frontier lab at a given time is (very roughly) the number of research engineers (ie talent-constrained; you can’t try every idea). Big training runs take several months, with roughly one happening at a time.
“Agentic” wasn’t a big buzzword until very recently. Google Trends shows an obvious exponential-ish trend which starts very small, in the middle of last year, but doesn’t get significant until the beginning of this year, and explodes out from there.
Thinking about all this, I think things seem just about on the fence. I suspect the first few reasoning models didn’t have game-playing in their RL at all, because the emphasis was on getting “reasoning” to work. A proactive lab could have put game-playing into the RL for the next iteration. A reactive lab could have only gotten serious about it this year.
The scale also matters a lot. Data-hunger means that they’ll throw anything they have into the next training run so long as it saw some success in smaller-scale experiments and maybe even if not. However, the first round of game-playing training environments could end up being a negligible effect on the final product due to not having a ton of training cases yet. However, by the second round, if not the first, they should have scraped together a big collection of cases to train on.
There’s also the question of how good the RL algorithms are. I haven’t looked into it very much and also most of the top labs keep details quite private anyway, but, my impression is that the RL algorithms used so far have been quite bad (not ‘real RL’—just assigning equal credit to all tokens in a chain-of-thought). This will presumably get better (EG they’ll figure out how to use some MCTS variant if they haven’t already). This is extremely significant for long-horizon tasks, because the RL algorithms right now (I’m guessing) have to be able to solicit at least one successful sample in order to get a good training gradient in that direction; long tasks will be stuck in failed runs if there’s not any planning-like component.
In any case, yeah, I think if we haven’t seen proper game-playing training in frontier models yet, we should see it very soon. If LLMs are still “below child level at this task” end-of-year then this will be a significant update towards longer timelines for me. (Pokemon doesn’t count anymore, though, because now there’s been significant scaffolding-tuning for that case, and because a lab could specifically train on pokemon due to the focus on that case.)
Also: I suspect there’s already been a lot of explicit agency training in the context of programming. (Maybe not very long time-horizon stuff, though.)
It’s different, yeah—for example, in that doing interesting things in the real world requires originary concept creation. But to do merely “agentic” things doesn’t necessarily require that. IDK what you meant by agency if not “finding paths through causality to drive some state into a small sector of statespace”; I was trying to give a superhuman example of that.