I think you are somewhat overly fixated on my claim that “maybe the AIs will accelerate the labor input R&D by 10x via basically just being fast and cheap junior employees”. My original claim (in the subcomment) is “I think it could suffice to do a bunch of relatively more banal things extremely fast and cheap”. The “could” part is important. Correspondingly, I think this is only part of the possibilities, though I do think this is a pretty plausible route. Additionally, banal does not imply simple/easy and some level of labor quality will be needed.
(I did propose junior employees as an analogy which maybe implied simple/easy. I didn’t really intend this implication. I think the AIs have to be able to do at least somewhat hard tasks, but maybe don’t need to have a ton of context or have much taste if they can compensate with other advantages.)
I’ll argue against your comment, but first, I’d like to lay out a bunch of background to make sure we’re on the same page and to give a better understanding to people reading through.
Frontier LLM progress has historically been driven by 3 factors:
Increased spending on training runs ($)
Hardware progress (compute / $)
Algorithmic progress (intelligence / compute)
(The split seems to be very roughly 2⁄5, 1⁄5, 2⁄5 respectively.)
If we zoom into algorithmic progress, there are two relevant inputs to the production function:
Compute (for experiments)
Labor (from human researchers and engineers)
A reasonably common view is that compute is a very key bottleneck such that even if you greatly improved labor, algorithmic progress wouldn’t go much faster. This seems plausible to me (though somewhat unlikely), but this isn’t what I was arguing about. I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster. (In other words, 10x’ing this labor input.)
Now, I’ll try to respond to your claims.
My current model is that ML experiments are bottlenecked not on software-engineer hours, but on compute.
Maybe, but that isn’t exactly a crux in this discussion as noted above. The relevant question is whether the important labor going into ML experiments is more “insights” or “engineering” (not whether both of these are bottlenecked on compute).
What actually matters for ML-style progress is picking the correct trick, and then applying it to a big-enough model.
My sense is that engineering is most of the labor, and most people I talk to with relevant experience have a view like: “taste is somewhat important, but lots of people have that and fast execution is roughly as important or more important”. Notably, AI companies really want to hire fast and good engineers and seem to care comparably about this as about more traditional research scientist jobs.
One relevant response would be “sure, AI companies want to hire good engineers, but weren’t we talking about the AIs being bad engineers who run fast?”
I think the AI engineers probably have to be quite good at moderate horizon software engineering, but also that scaling up current approaches can pretty likely achieve this. Possibly my “junior hire” analogy was problematic as “junior hire” can mean not as good at programming in addition to “not as much context at this company, but good at the general skills”.
So 10x’ing the number of small-scale experiments is unlikely to actually 10x ML research, along any promising research direction.
I wasn’t saying that these AIs would mostly be 10x’ing the number of small-scale experiments, though I do think that increasing the number and serial speed of experiments is an important part of the picture.
There are lots of other things that engineers do (e.g., increase the efficiency of experiments so they use less compute, make it much easier to run experiments, etc.).
Indeed, an additional disadvantage of AI-based researchers/engineers is that their forward passes would cut into that limited compute budget. Offloading the computations associated with software engineering and experiment oversight onto the brains of mid-level human engineers is potentially more cost-efficient.
Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1⁄4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I’m not including input prices for simplicity, but input is much cheaper than output and it’s just a messy BOTEC anyway.)
Yes, this compute comes directly at the cost of experiments, but so do employee salaries at current margins. (Maybe this will be less true in the future.)
At the point when AIs are first capable of doing the relevant tasks, it seems likely it is pretty expensive, but I expect costs to drop pretty quickly. And, AI companies will have far more compute in the future as this increases at a rapid rate, making the plausible number of instances substantially higher.
Is there a reason to think that any need for that couldn’t already be satisfied? If it were an actual bottleneck, I would expect it to have already been solved: by the AGI labs just hiring tons of competent-ish software engineers.
I think AI companies would be very happy to hire lots of software engineers who work for nearly free, run 10x faster, work 24⁄7, and are pretty good research engineers. This seems especially true if you add other structural advantages of AI into the mix (train once and use many times, fewer personnel issues, easy to scale up and down, etc). The serial speed is very important.
(The bar of “competent-ish” seems too low. Again, I think “junior” might have been leading you astray here, sorry about that. Imagine more like median AI company engineering hire or a bit better than this. My original comment said “automating research engineering”.)
LLM-based coding tools seem competent enough to significantly speed up a human programmer’s work on formulaic tasks. So any sufficiently simple software-engineering task should already be done at lightning speeds within AGI labs.
I’m not sure I buy this claim about current tools. Also, I wasn’t making a claim about AIs just doing simple tasks (banal does not mean simple) as discussed earlier.
Stepping back from engineering vs insights, my sense is that it isn’t clear that the AIs will be terrible at insights or broader context. So, I think it will probably be more like they are very fast engineers and ok at experimental direction. Being ok helps a bunch by avoiding the need for human intervention at many points.
Maybe a relevant crux is: “Could scaling up current methods yield AIs that can mostly autonomously automate software engineering tasks that are currently being done by engineers at AI companies?” (More precisely, succeed at these tasks very reliably with only a small amount of human advice/help amortized over all tasks. Probably this would partially work by having humans or AIs decompose into relatively smaller subtasks that require a bit less context, though this isn’t notably different from how humans do things themselves.)
But, I think you maybe also have a further crux like: “Does making software engineering at AI companies cheap and extremely fast greatly accelerate the labor input to AI R&D?”
I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster
You’re right, that’s a meaningfully different claim and I should’ve noticed the difference.
I think I would disagree with it as well. Suppose we break up this labor into, say,
“Banal” software engineering.
Medium-difficult systems design and algorithmic improvements (finding optimizations, etc.).
Coming up with new ideas regarding how AI capabilities can be progressed.
High-level decisions regarding architectures, research avenues and strategies, etc. (Not just inventing transformers/the scaling hypothesis/the idea of RL-on-CoT, but picking those approaches out of a sea of ideas, and making the correct decision to commit hard to them.)
In turn, the factors relevant to (4) are:
(a) The serial thinking of the senior researchers and the communication/exchange of ideas between them.
(Where “the senior researchers” are defined as “the people with the power to make strategic research decisions at a given company”.)
(b) The outputs of significant experiments decided on by the senior researchers.
(c) The pool of untested-at-large-scale ideas presented to the senior researchers.
Importantly, in this model, speeding up (1), (2), (3) can only speed up (4) by increasing the turnover speed of (b) and the quality of (c). And I expect that non-AGI-complete AI cannot improve the quality of ideas (3) and cannot directly speed up/replace (a)[1], meaning any acceleration from it can only come from accelerating the engineering and the optimization of significant experiments.
Which, I expect, are in fact mostly bottlenecked by compute, and 10x’ing the human-labor productivity there doesn’t 10x the overall productivity of the human-labor input; it remains stubbornly held up by (a). (I do buy that it can significantly speed it up, say 2x it. But not 10x it.)
Separately, I’m also skeptical that near-term AI can speed up the nontrivial engineering involved in medium-difficult systems design and the management of significant experiments:
Stepping back from engineering vs insights, my sense is that it isn’t clear that the AIs will be terrible at insights or broader context. So, I think it will probably be more like they are very fast engineers and ok at experimental direction. Being ok helps a bunch by avoiding the need for human intervention at many points.
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0. I guess we’ll see if o3 (or an o-series model based on the next-generation base model) change that. AI does feel right on the cusp of getting good at this...
… just as it felt at the time of GPT-3.5, and GPT-4, and Sonnet 3.5.1, and o1. That just the slightest improvement along this axis would allow us to plug the outputs of AI cognition into its inputs and get a competent, autonomous AI agent.
And yet here we are, still.
It’s puzzling to me and I don’t quite understand why it wouldn’t work, but based on the previous track record, I do in fact expect it not to work.
In other words: If an AI is able to improve the quality of ideas and/or reliably pluck out the best ideas from a sea of them, I expect that’s AGI and we can throw out all human cognitive labor entirely.
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0. I guess we’ll see if o3 (or an o-series model based on the next-generation base model) change that. AI does feel right on the cusp of getting good at this...
… just as it felt at the time of GPT-3.5, and GPT-4, and Sonnet 3.5.1, and o1. That just the slightest improvement along this axis would allow us to plug the outputs of AI cognition into its inputs and get a competent, autonomous AI agent.
Boy do I disagree with this take! Excited to discuss.
Can you say more about what skills you think the GPT series has shown ~0 improvement on?
Because if it’s “competent, autonomous agency” then there has been massive progress over the last two years and over the last few months in particular. METR has basically spent dozens of FTE-years specifically trying to measure progress in autonomous agency capability, both with formal benchmarks and with lots of high-surface-area interaction with models (they have people building scaffolds to make the AIs into agents and do various tasks etc.) And METR seems to think that progress has been rapid and indeed faster than they expected.
Has there been enough progress to automate swathes of jobs? No, of course not—see the benchmarks. E.g. RE-bench shows that even the best public models like o1 and newsonnet are only as good as professional coders on time horizons of, like, an hour or so. (give or take, depends on how you measure, the task, etc.) Which means that if you give them the sort of task that would take a normal employee, like, three hours, they are worse than a competent human professional. Specifically they’d burn lots of tokens and compute and push lots of buggy code and overall make a mess of things, just like an eager but incompetent employee.
And I’d say the models are unusually good at these coding tasks compared to other kinds of useful professional tasks, because the companies have been trying harder to train them to code and it’s inherently easier to train due to faster feedback loops etc.
Can you say more about what skills you think the GPT series has shown ~0 improvement on?
Alright, let’s try this. But this is going to be vague.
Here’s a cluster of things that SotA AIs seem stubbornly bad at:
Innovation. LLMs are perfectly able to understand an innovative idea if it’s described to them, even if it’s a new idea that was produced after their knowledge-cutoff date. Yet, there hasn’t been a single LLM-originating innovation, and all attempts to design “AI scientists” have produced useless slop. They seem to have terrible “research taste”, even though they should be able to learn this implicit skill from the training data.
Reliability. Humans are very reliable agents, and SotA AIs aren’t, even when e. g. put into wrappers that encourage them to sanity-check their work. The gap in reliability seems qualitative, rather than just quantitative.
Solving non-templated problems. There seems to be a bimodal distribution of a sort, where some people report LLMs producing excellent code/math, and others report that they fail basic tasks.
Compounding returns on problem-solving time. As the graph you provided shows, humans’ performance scales dramatically with the time they spent on the problem, whereas AIs’ – even o1′s – doesn’t.
My sense is that LLMs are missing some sort of “self-steering” “true autonomy” quality; the quality that allows humans to:
Stare at the actual problem they’re solving, and build its highly detailed model in a “bottom-up” manner. Instead, LLMs go “top-down”: they retrieve the closest-match template problem from a vast database, fill-in some details, and solve that problem.
(Non-templatedness/fluid intelligence.)
Iteratively improve their model of a problem over the course of problem-solving, and do sophisticated course-correction if they realize their strategy isn’t working or if they’re solving the wrong problem. Humans can “snap out of it” if they realize they’re messing up, instead of just doing what they’re doing on inertia.
(Reliability.)
Recognize when their model of a given problem represents a nontrivially new “template” that can be memorized and applied in a variety of other situations, and what these situations might be.
(Innovation.)
My model is that all LLM progress so far has involved making LLMs better at the “top-down” thing. They end up with increasingly bigger databases of template problems, the closest-match templates end up ever-closer to the actual problems they’re facing, their ability to fill-in the details becomes ever-richer, etc. This improves their zero-shot skills, and test-time compute scaling allows them to “feel out” the problem’s shape over an extended period and find an ever-more-detailed top-down fit.
But it’s still fundamentally not what humans do. Humans are able to instantiate a completely new abstract model of a problem – even if it’s initially based on a stored template – and chisel at it until it matches the actual problem near-perfectly. This allows them to be much more reliable; this allows them to keep themselves on-track; this allows them to find “genuinely new” innovations.
The two methods do ultimately converge to the same end result: in the limit of a sufficiently expressive template-database, LLMs would be able to attain the same level of reliability/problem-representation-accuracy as humans. But the top-down method of approaching this limit seems ruinously computationally inefficient; perhaps so inefficient it saturates around GPT-4′s capability level.[1]
LLMs are sleep-walking. We can make their dreams ever-closer to reality, and that makes the illusion that they’re awake ever-stronger. But they’re not, and the current approaches may not be able to wake them up at all.
(As an abstract analogy: imagine that you need to color the space bounded by some 2D curve. In one case, you can take a pencil and do it directly. In another case, you have a collection of cutouts of geometric figures, and you have to fill the area by assembling a collage. If you have a sufficiently rich collection of figures, you can come arbitrarily close; but the “bottom-up” approach is strictly better. In particular, it can handle arbitrarily complicated shapes out-of-the-box, whereas the second approach would require dramatically bigger collections the more complicated the shapes get.)
Edit: Or so my current “bearish on LLMs” model goes. The performance of o3 or GPT-5/6 can very much break it, and the actual mechanisms described are necessarily speculative and tentative.
Under this toy model, it needn’t have saturated around this level; it could’ve comfortably overshot human capabilities. But this doesn’t seem to be what’s happening, likely due to some limitation of the current paradigm not covered by this model.
Thanks! Time will tell who is right. Point by point reply:
You list four things AIs seem stubbornly bad at: 1. Innovation. 2. Reliability. 3. Solving non-templated problems. 4. Compounding returns on problem-solving-time.
First of all, 2 and 4 seem closely related to me. I would say: “Agency skills” are the skills key to being an effective agent, i.e. skills useful for operating autonomously for long periods in pursuit of goals. Noticing when you are stuck is a simple example of an agency skill. Planning is another simple example. In-context learning is another example. would say that current AIs lack agency skills, and that 2 and 4 are just special cases of this. I would also venture to guess with less confidence that 1 and 3 might be because of this as well—perhaps the reason AIs haven’t made any truly novel innovations yet is that doing so takes intellectual work, work they can’t do because they can’t operate autonomously for long periods in pursuit of goals. (Note that reasoning models like o1 are a big leap in the direction of being able to do this!) And perhaps the reason behind the relatively poor performance on non-templated tasks is… wait actually no, that one has a very easy separate explanation, which is that they’ve been trained less on those tasks. A human, too, is better at stuff they’ve done a lot.
Secondly, and more importantly, I don’t think we can say there has been ~0 progress on these dimensions in the last few years, whether you conceive of them in your way or my way. Progress is in general s-curvy; adoption curves are s-curvy. Suppose for example that GPT2 was 4 SDs worse than average human at innovation, reliability, etc. and GPT3 was 3 SDs worse and GPT4 was 2 SDs worse and o1 is 1 SD worse. Under this supposition, the world would look the way that it looks today—Thane would notice zero novel innovations from AIs, Thane would have friends who try to use o1 for coding and find that it’s not useful without templates, etc. Meanwhile, as I’m sure you are aware pretty much every benchmark anyone has ever made has shown rapid progress in the last few years—including benchmarks made by METR who was specifically trying to measure AI R&D ability and agency abilities, and which genuinely do seem to require (small) amounts of agency. So I think the balance of evidence is in favor of progress on the dimensions you are talking about—it just hasn’t reached human level yet, or at any rate not the level at which you’d notice big exciting changes in the world. (Analogous to: Suppose we’ve measured COVID in some countries but not others, and found that in every country we’ve measured, COVID has spread to about 0.01% − 0.001% of the population, and is growing exponentially. If we live in a country that hasn’t measured yet, we should assume COVID is spreading even though we don’t know anyone personally who is sick yet.)
...
You say:
My model is that all LLM progress so far has involved making LLMs better at the “top-down” thing. They end up with increasingly bigger databases of template problems, the closest-match templates end up ever-closer to the actual problems they’re facing, their ability to fill-in the details becomes ever-richer, etc. This improves their zero-shot skills, and test-time compute scaling allows them to “feel out” the problem’s shape over an extended period and find an ever-more-detailed top-down fit.
But it’s still fundamentally not what humans do. Humans are able to instantiate a completely new abstract model of a problem – even if it’s initially based on a stored template – and chisel at it until it matches the actual problem near-perfectly. This allows them to be much more reliable; this allows them to keep themselves on-track; this allows them to find “genuinely new” innovations.
Top down vs. bottom-up seem like two different ways of solving intellectual problems. Do you think it’s a sharp binary distinction? Or do you think it’s a spectrum? If the latter, what makes you think o1 isn’t farther along the spectrum than GPT3? If the former—if it’s a sharp binary—can you say what it is about LLM architecture and/or training methods that renders them incapable of thinking in the bottom-up way? (Like, naively it seems like o1 can do sophisticated reasoning. Moreover, it seems like it was trained in a way that would incentivize it to learn skills useful for solving math problems, and ‘bottom-up reasoning’ seems like a skill that would be useful. Why wouldn’t it learn it?)
Can you describe an intellectual or practical feat, or ideally a problem set, such that if AI solves it in 2025 you’ll update significantly towards my position?
I would also venture to guess with less confidence that 1 and 3 might be because of this as well
Agreed, I do expect that the performance on all of those is mediated by the same variable(s); that’s why I called them a “cluster”.
benchmarks made by METR who was specifically trying to measure AI R&D ability and agency abilities, and which genuinely do seem to require (small) amounts of agency
I think “agency” is a bit of an overly abstract/confusing term to use, here. In particular, I think it also allows both a “top-down” and a “bottom-up” approach.
Humans have “bottom-up” agency: they’re engaging in fluid-intelligence problem-solving and end up “drawing” a decision-making pattern of a specific shape. An LLM, on this model, has a database of templates for such decision-making patterns, and it retrieves the best-fit agency template for whatever problem it’s facing. o1/RL-on-CoTs is a way to deliberately target the set of agency-templates an LLM has, extending it. But it doesn’t change the ultimate nature of what’s happening.
In particular: the bottom-up approach would allow an agent to stay on-target for an arbitrarily long time, creating an arbitrarily precise fit for whatever problem it’s facing. An LLM’s ability to stay on-target, however, would always remain limited by the length and the expressiveness of the templates that were trained into it.
RL on CoTs is a great way to further mask the problem, which is why the o-series seems to make unusual progress on agency-measuring benchmarks. But it’s still just masking it.
can you say what it is about LLM architecture and/or training methods that renders them incapable of thinking in the bottom-up way?
Not sure. I think it might be some combination of “the pretraining phase moves the model deep into the local-minimum abyss of top-down cognition, and the cheaper post-training phase can never hope to get it out of there” and “the LLM architecture sucks, actually”. But I would rather not get into the specifics.
Can you describe an intellectual or practical feat, or ideally a problem set, such that if AI solves it in 2025 you’ll update significantly towards my position?
“Inventing a new field of science” would do it, as far as more-or-less legible measures go. Anything less than that is too easily “fakeable” using top-down reasoning.
That said, I may make this update based on less legible vibes-based evidence, such as if o3′s advice on real-life problems seems to be unusually lucid and creative. (I’m tracking the possibility that LLMs are steadily growing in general capability and that they simply haven’t yet reached the level that impresses me personally. But on balance, I mostly don’t expect this possibility to be realized.)
“Inventing a new field of science” would do it, as far as more-or-less legible measures go. Anything less than that is too easily “fakeable” using top-down reasoning.
Seems unlikely we’ll see this before stuff gets seriously crazy on anyone’s views. (Has any new field of science been invented in the last 5 years by humans? I’m not sure what you’d count.)
It seems like we should at least update towards AIs being very useful for accelerating AI R&D if we very clearly see AI R&D greatly accelerate and it is using tons of AI labor. (And this was the initial top level prompt for this thread.) We could say something similar about other types of research.
Seems unlikely we’ll see this before stuff gets seriously crazy on anyone’s views. (Has any new field of science been invented in the last 5 years? I’m not sure what you’d count.)
Maybe some minor science fields, but yeah entirely new science fields in 5 years is deep into ASI territory, assuming it’s something like a hard science like physics.
(I’m tracking the possibility that LLMs are steadily growing in general capability and that they simply haven’t yet reached the level that impresses me personally. But on balance, I mostly don’t expect this possibility to be realized.)
That possibility is what I believe. I wish we had something to bet on better than “inventing a new field of science,” because by the time we observe that, there probably won’t be much time left to do anything about it. What about e.g. “I, Daniel Kokotajlo, are able to use AI agents basically as substitutes for human engineer/programmer employees. I, as a non-coder, can chat with them and describe ML experiments I want them to run or websites I want them to build etc., and they’ll make it happen at least as quickly and well as a competent professional would.” (And not just for simple websites, for the kind of experiments I’d want to run, which aren’t the most complicated but they aren’t that different from things actual AI company engineers would be doing.)
What about “The model is seemingly as good at solving math problems and puzzles as Thane is, not just on average across many problems but on pretty much any specific problem including on novel ones that are unfamiliar to both of you?
Humans have “bottom-up” agency: they’re engaging in fluid-intelligence problem-solving and end up “drawing” a decision-making pattern of a specific shape. An LLM, on this model, has a database of templates for such decision-making patterns, and it retrieves the best-fit agency template for whatever problem it’s facing. o1/RL-on-CoTs is a way to deliberately target the set of agency-templates an LLM has, extending it. But it doesn’t change the ultimate nature of what’s happening.
In particular: the bottom-up approach would allow an agent to stay on-target for an arbitrarily long time, creating an arbitrarily precise fit for whatever problem it’s facing. An LLM’s ability to stay on-target, however, would always remain limited by the length and the expressiveness of the templates that were trained into it.
Miscellaneous thoughts: I don’t yet buy that this distinction between top-down and bottom-up is binary, and insofar as it’s a spectrum then I’d be willing to bet that there’s been progress along it in recent years. Moreover I’m not even convinced that this distinction matters much for generalization radius / general intelligence, and it’s even less likely to matter for ‘ability to 5x AI R&D’ which is the milestone I’m trying to predict first. Moreover, I don’t think humans stay on-target for an arbitrarily long time.
I wish we had something to bet on better than “inventing a new field of science,”
I’ve thought of one potential observable that is concrete, should be relatively low-capability, and should provoke a strong update towards your model for me:
If there is an AI model such that the complexity of R&D problems it can solve (1) scales basically boundlessly with the amount of serial compute provided to it (or to a “research fleet” based on it), (2) scales much faster with serial compute than with parallel compute, and (3) the required amount of human attention (“babysitting”) is constant or grows very slowly with the amount of serial compute.
This attempts to directly get at the “autonomous self-correction” and “ability to think about R&D problems strategically” ideas.
I’ve not fully thought through all possible ways reality could Goodhart to this benchmark, i. e. “technically” pass it but in a way I find unconvincing. For example, if I failed to include the condition (2), o3 would have probably already “passed” it (since it potentially achieved better performance on ARC-AGI and FrontierMath by sampling thousands of CoTs then outputting the most frequent answer). There might be other loopholes like this...
But it currently seems reasonable and True-Name-y to me.
What about “Daniel Kokotajlo can feed it his docs about some prosaic ML alignment agenda (e.g. the faithful CoT stuff) and then it can autonomously go off and implement the agenda and come back to him with a writeup of the results and takeaways. While working on this, it gets to check in with Daniel once a day for a brief 20-minute chat conversation.”
Does that seem to you like it’ll come earlier, or later, than the milestone you describe?
Prooobably ~simultaneously, but I can maybe see it coming earlier and in a way that isn’t wholly convincing to me. In particular, it would still be a fixed-length task; much longer-length than what the contemporary models can reliably manage today, but still hackable using poorly-generalizing “agency templates” instead of fully general “compact generators of agenty behavior” (which I speculate humans to have and RL’d LLMs not to). It would be some evidence in favor of “AI can accelerate AI R&D”, but not necessarily “LLMs trained via SSL+RL are AGI-complete”.
Actually, I can also see it coming later. For example, some suppose that the capability researchers invent some method for reliably-and-indefinitely extending the amount of serial computations a reasoning model can productively make use of, but the compute or memory requirements grow very fast with the length of a CoT. Some fairly solid empirical evidence and theoretical arguments in favor of boundless scaling can appear quickly, well before the algorithms are made optimal enough to (1) handle weeks-long CoTs and/or (2) allow wide adoption (thus making it available to you).
I think the second scenario is more plausible, actually.
OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the “medium” FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.
Not for math benchmarks. Here’s one way it can “cheat” at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified.
This wouldn’t actually show “agency” and strategic thinking of the kinds that might generalize to open-ended domains and “true” long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment.
Something more open-ended and requiring “research taste” would be needed. Maybe a comparable performance on METR’s benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance.
Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn’t follow the above “cheating” approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking—just like a human thought-chain would).
Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I’m not sure)
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems?
Certainly (experimenting with r1′s CoTs right now, in fact). I agree that they’re not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system “technically” clears the bar you’d outlined, yet I end up unmoved (I don’t want to end up goalpost-moving).
Though neither are they being “strategic” in the way I expect they’d need to be in order to productively use a billion-token CoT.
Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI
Yeah, I’m also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0.
Huh, I disagree reasonably strongly with this. Possible that something along these lines is an empirically testable crux.
I expect this is the sort of thing that can be disproven (if LLM-based AI agents actually do start displacing nontrivial swathes of e. g. non-entry-level SWE workers in 2025-2026), but only “proven” gradually (if “AI agents start displacing nontrivial swathes of some highly skilled cognitive-worker demographic” continually fails to happen year after year after year).
Overall, operationalizing bets/empirical tests about this has remained a cursed problem.
Edit:
As a potentially relevant factor: Were you ever surprised by how unbalanced the progress and the adoption have been? The unexpected mixes of capabilities and incapabilities that AI models have displayed?
My current model is centered on trying to explain this surprising mix (top-tier/superhuman benchmark performance vs. frequent falling-flat-on-its-face real-world performance). My current guess is basically that all capabilities progress has been effectively goodharting on legible performance (benchmarks and their equivalents) while doing ~0 improvement on everything else. Whatever it is benchmarks and benchmark-like metrics are measuring, it’s not what we think it is.
So what we will always observe is AI getting better and better at any neat empirical test we can devise, always seeming on the cusp of being transformative, while continually and inexplicably failing to tilt over into actually being transformative. (The actual performance of o3 and GPT-5/6 would be a decisive test of this model for me.)
top-tier/superhuman benchmark performance vs. frequent falling-flat-on-its-face real-world performance
Models are just recently getting to the point where they can complete 2 hour tasks 50% of the time in METR’s tasks (at least without scaffolding that uses much more inference compute).
This isn’t yet top tier performance, so I don’t see the implication. The key claim is that progress here is very fast.
So, I don’t currently feel that strongly that there is a huge benchmark vs real performance gap in at least autonomous SWE-ish tasks? (There might be in math and I agree that if you just looked at math and exam question benchmarks and compared to humans, the models seem much smarter than they are.)
Something interesting here is that a part of why AI companies won’t want to use agents is because their capabilities are good enough that being very reckless with them might actually cause small-scale misalignment issues, and if that’s truly a big part of the problem in getting companies to adopt AI agents, this is good news for our future:
FWIW my vibe is closer to Thane’s. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we’ve hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.
Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1⁄4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I’m not including input prices for simplicity, but input is much cheaper than output and it’s just a messy BOTEC anyway.)
If you were to spend equal amounts of money on LLM inference and GPUs, that would mean that you’re spending $400,000 / year on GPUs. Divide that 50 ways and each Sonnet instance gets an $8,000 / year compute budget. Over the 18 hours per day that Sonnet is waiting for experiments, that is an average of $1.22 / hour, which is almost exactly the hourly cost of renting a single H100 on Vast.
So I guess the crux is “would a swarm of unreliable researchers with one good GPU apiece be more effective at AI research than a few top researchers who can monopolize X0,000 GPUs for months, per unit of GPU time spent”.
(and yes, at some point it the question switches to “would an AI researcher that is better at AI research than the best humans make better use of GPUs than the best humans” but a that point it’s a matter of quality, not quantity)
Sure, but I think that at the relevant point, you’ll probably be spending at least 5x more on experiments than on inference and potentially a much larger larger ratio if heavy test time compute usage isn’t important. I was just trying to argue that the naive inference cost isn’t that crazy.
Notably, if you give each researcher 2k gpu hours, that would be $2 / gpu hour * 2k * 24 * 365 = $35,040,000 per year which is much higher than the inference cost of the models!
I think I misunderstood what you were saying there—I interpreted it as something like
Currently, ML-capable software developers are quite expensive relative to the cost of compute. Additionally, many small experiments provide more novel and useful insights than a few large experiments. The top practically-useful LLM costs about 1% as much per hour to run as a ML-capable software developer, and that 100x decrease in cost and the corresponding switch to many small-scale experiments would likely result in at least a 10x increase in the speed at which novel, useful insights were generated.
But on closer reading I see you said (emphasis mine)
I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster. (In other words, 10x’ing this labor input.)
So if the employees spend 50% of their time waiting on training runs which are bottlenecked on company-wide availability of compute resources, and 50% of their time writing code, 10xing their labor input (i.e. the speed at which they write code) would result in about an 80% increase in their labor output. Which, to your point, does seem plausible.
Yes. Though notably, if your employees were 10x faster you might want to adjust your workflows to have them spend less time being bottlenecked on compute if that is possible. (And this sort of adaption is included in what I mean.)
Yeah, agreed—the allocation of compute per human would likely become even more skewed if AI agents (or any other tooling improvements) allow your very top people to get more value out of compute than the marginal researcher currently gets.
And notably this shifting of resources from marginal to top researchers wouldn’t require achieving “true AGI” if most of the time your top researchers spend isn’t spent on “true AGI”-complete tasks.
I think you are somewhat overly fixated on my claim that “maybe the AIs will accelerate the labor input R&D by 10x via basically just being fast and cheap junior employees”. My original claim (in the subcomment) is “I think it could suffice to do a bunch of relatively more banal things extremely fast and cheap”. The “could” part is important. Correspondingly, I think this is only part of the possibilities, though I do think this is a pretty plausible route. Additionally, banal does not imply simple/easy and some level of labor quality will be needed.
(I did propose junior employees as an analogy which maybe implied simple/easy. I didn’t really intend this implication. I think the AIs have to be able to do at least somewhat hard tasks, but maybe don’t need to have a ton of context or have much taste if they can compensate with other advantages.)
I’ll argue against your comment, but first, I’d like to lay out a bunch of background to make sure we’re on the same page and to give a better understanding to people reading through.
Frontier LLM progress has historically been driven by 3 factors:
Increased spending on training runs ($)
Hardware progress (compute / $)
Algorithmic progress (intelligence / compute)
(The split seems to be very roughly 2⁄5, 1⁄5, 2⁄5 respectively.)
If we zoom into algorithmic progress, there are two relevant inputs to the production function:
Compute (for experiments)
Labor (from human researchers and engineers)
A reasonably common view is that compute is a very key bottleneck such that even if you greatly improved labor, algorithmic progress wouldn’t go much faster. This seems plausible to me (though somewhat unlikely), but this isn’t what I was arguing about. I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you’d get as if the human employees operated 10x faster. (In other words, 10x’ing this labor input.)
Now, I’ll try to respond to your claims.
Maybe, but that isn’t exactly a crux in this discussion as noted above. The relevant question is whether the important labor going into ML experiments is more “insights” or “engineering” (not whether both of these are bottlenecked on compute).
My sense is that engineering is most of the labor, and most people I talk to with relevant experience have a view like: “taste is somewhat important, but lots of people have that and fast execution is roughly as important or more important”. Notably, AI companies really want to hire fast and good engineers and seem to care comparably about this as about more traditional research scientist jobs.
One relevant response would be “sure, AI companies want to hire good engineers, but weren’t we talking about the AIs being bad engineers who run fast?”
I think the AI engineers probably have to be quite good at moderate horizon software engineering, but also that scaling up current approaches can pretty likely achieve this. Possibly my “junior hire” analogy was problematic as “junior hire” can mean not as good at programming in addition to “not as much context at this company, but good at the general skills”.
I wasn’t saying that these AIs would mostly be 10x’ing the number of small-scale experiments, though I do think that increasing the number and serial speed of experiments is an important part of the picture.
There are lots of other things that engineers do (e.g., increase the efficiency of experiments so they use less compute, make it much easier to run experiments, etc.).
Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1⁄4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I’m not including input prices for simplicity, but input is much cheaper than output and it’s just a messy BOTEC anyway.)
Yes, this compute comes directly at the cost of experiments, but so do employee salaries at current margins. (Maybe this will be less true in the future.)
At the point when AIs are first capable of doing the relevant tasks, it seems likely it is pretty expensive, but I expect costs to drop pretty quickly. And, AI companies will have far more compute in the future as this increases at a rapid rate, making the plausible number of instances substantially higher.
I think AI companies would be very happy to hire lots of software engineers who work for nearly free, run 10x faster, work 24⁄7, and are pretty good research engineers. This seems especially true if you add other structural advantages of AI into the mix (train once and use many times, fewer personnel issues, easy to scale up and down, etc). The serial speed is very important.
(The bar of “competent-ish” seems too low. Again, I think “junior” might have been leading you astray here, sorry about that. Imagine more like median AI company engineering hire or a bit better than this. My original comment said “automating research engineering”.)
I’m not sure I buy this claim about current tools. Also, I wasn’t making a claim about AIs just doing simple tasks (banal does not mean simple) as discussed earlier.
Stepping back from engineering vs insights, my sense is that it isn’t clear that the AIs will be terrible at insights or broader context. So, I think it will probably be more like they are very fast engineers and ok at experimental direction. Being ok helps a bunch by avoiding the need for human intervention at many points.
Maybe a relevant crux is: “Could scaling up current methods yield AIs that can mostly autonomously automate software engineering tasks that are currently being done by engineers at AI companies?” (More precisely, succeed at these tasks very reliably with only a small amount of human advice/help amortized over all tasks. Probably this would partially work by having humans or AIs decompose into relatively smaller subtasks that require a bit less context, though this isn’t notably different from how humans do things themselves.)
But, I think you maybe also have a further crux like: “Does making software engineering at AI companies cheap and extremely fast greatly accelerate the labor input to AI R&D?”
Yup, those two do seem to be the cruxes here.
You’re right, that’s a meaningfully different claim and I should’ve noticed the difference.
I think I would disagree with it as well. Suppose we break up this labor into, say,
“Banal” software engineering.
Medium-difficult systems design and algorithmic improvements (finding optimizations, etc.).
Coming up with new ideas regarding how AI capabilities can be progressed.
High-level decisions regarding architectures, research avenues and strategies, etc. (Not just inventing transformers/the scaling hypothesis/the idea of RL-on-CoT, but picking those approaches out of a sea of ideas, and making the correct decision to commit hard to them.)
In turn, the factors relevant to (4) are:
(a) The serial thinking of the senior researchers and the communication/exchange of ideas between them.
(Where “the senior researchers” are defined as “the people with the power to make strategic research decisions at a given company”.)
(b) The outputs of significant experiments decided on by the senior researchers.
(c) The pool of untested-at-large-scale ideas presented to the senior researchers.
Importantly, in this model, speeding up (1), (2), (3) can only speed up (4) by increasing the turnover speed of (b) and the quality of (c). And I expect that non-AGI-complete AI cannot improve the quality of ideas (3) and cannot directly speed up/replace (a)[1], meaning any acceleration from it can only come from accelerating the engineering and the optimization of significant experiments.
Which, I expect, are in fact mostly bottlenecked by compute, and 10x’ing the human-labor productivity there doesn’t 10x the overall productivity of the human-labor input; it remains stubbornly held up by (a). (I do buy that it can significantly speed it up, say 2x it. But not 10x it.)
Separately, I’m also skeptical that near-term AI can speed up the nontrivial engineering involved in medium-difficult systems design and the management of significant experiments:
It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0. I guess we’ll see if o3 (or an o-series model based on the next-generation base model) change that. AI does feel right on the cusp of getting good at this...
… just as it felt at the time of GPT-3.5, and GPT-4, and Sonnet 3.5.1, and o1. That just the slightest improvement along this axis would allow us to plug the outputs of AI cognition into its inputs and get a competent, autonomous AI agent.
And yet here we are, still.
It’s puzzling to me and I don’t quite understand why it wouldn’t work, but based on the previous track record, I do in fact expect it not to work.
In other words: If an AI is able to improve the quality of ideas and/or reliably pluck out the best ideas from a sea of them, I expect that’s AGI and we can throw out all human cognitive labor entirely.
Arguably, no improvement since GPT-2; I think that post aged really well.
Boy do I disagree with this take! Excited to discuss.
Can you say more about what skills you think the GPT series has shown ~0 improvement on?
Because if it’s “competent, autonomous agency” then there has been massive progress over the last two years and over the last few months in particular. METR has basically spent dozens of FTE-years specifically trying to measure progress in autonomous agency capability, both with formal benchmarks and with lots of high-surface-area interaction with models (they have people building scaffolds to make the AIs into agents and do various tasks etc.) And METR seems to think that progress has been rapid and indeed faster than they expected.
Has there been enough progress to automate swathes of jobs? No, of course not—see the benchmarks. E.g. RE-bench shows that even the best public models like o1 and newsonnet are only as good as professional coders on time horizons of, like, an hour or so. (give or take, depends on how you measure, the task, etc.) Which means that if you give them the sort of task that would take a normal employee, like, three hours, they are worse than a competent human professional. Specifically they’d burn lots of tokens and compute and push lots of buggy code and overall make a mess of things, just like an eager but incompetent employee.
And I’d say the models are unusually good at these coding tasks compared to other kinds of useful professional tasks, because the companies have been trying harder to train them to code and it’s inherently easier to train due to faster feedback loops etc.
Alright, let’s try this. But this is going to be vague.
Here’s a cluster of things that SotA AIs seem stubbornly bad at:
Innovation. LLMs are perfectly able to understand an innovative idea if it’s described to them, even if it’s a new idea that was produced after their knowledge-cutoff date. Yet, there hasn’t been a single LLM-originating innovation, and all attempts to design “AI scientists” have produced useless slop. They seem to have terrible “research taste”, even though they should be able to learn this implicit skill from the training data.
Reliability. Humans are very reliable agents, and SotA AIs aren’t, even when e. g. put into wrappers that encourage them to sanity-check their work. The gap in reliability seems qualitative, rather than just quantitative.
Solving non-templated problems. There seems to be a bimodal distribution of a sort, where some people report LLMs producing excellent code/math, and others report that they fail basic tasks.
Compounding returns on problem-solving time. As the graph you provided shows, humans’ performance scales dramatically with the time they spent on the problem, whereas AIs’ – even o1′s – doesn’t.
My sense is that LLMs are missing some sort of “self-steering” “true autonomy” quality; the quality that allows humans to:
Stare at the actual problem they’re solving, and build its highly detailed model in a “bottom-up” manner. Instead, LLMs go “top-down”: they retrieve the closest-match template problem from a vast database, fill-in some details, and solve that problem.
(Non-templatedness/fluid intelligence.)
Iteratively improve their model of a problem over the course of problem-solving, and do sophisticated course-correction if they realize their strategy isn’t working or if they’re solving the wrong problem. Humans can “snap out of it” if they realize they’re messing up, instead of just doing what they’re doing on inertia.
(Reliability.)
Recognize when their model of a given problem represents a nontrivially new “template” that can be memorized and applied in a variety of other situations, and what these situations might be.
(Innovation.)
My model is that all LLM progress so far has involved making LLMs better at the “top-down” thing. They end up with increasingly bigger databases of template problems, the closest-match templates end up ever-closer to the actual problems they’re facing, their ability to fill-in the details becomes ever-richer, etc. This improves their zero-shot skills, and test-time compute scaling allows them to “feel out” the problem’s shape over an extended period and find an ever-more-detailed top-down fit.
But it’s still fundamentally not what humans do. Humans are able to instantiate a completely new abstract model of a problem – even if it’s initially based on a stored template – and chisel at it until it matches the actual problem near-perfectly. This allows them to be much more reliable; this allows them to keep themselves on-track; this allows them to find “genuinely new” innovations.
The two methods do ultimately converge to the same end result: in the limit of a sufficiently expressive template-database, LLMs would be able to attain the same level of reliability/problem-representation-accuracy as humans. But the top-down method of approaching this limit seems ruinously computationally inefficient; perhaps so inefficient it saturates around GPT-4′s capability level.[1]
LLMs are sleep-walking. We can make their dreams ever-closer to reality, and that makes the illusion that they’re awake ever-stronger. But they’re not, and the current approaches may not be able to wake them up at all.
(As an abstract analogy: imagine that you need to color the space bounded by some 2D curve. In one case, you can take a pencil and do it directly. In another case, you have a collection of cutouts of geometric figures, and you have to fill the area by assembling a collage. If you have a sufficiently rich collection of figures, you can come arbitrarily close; but the “bottom-up” approach is strictly better. In particular, it can handle arbitrarily complicated shapes out-of-the-box, whereas the second approach would require dramatically bigger collections the more complicated the shapes get.)
Edit: Or so my current “bearish on LLMs” model goes. The performance of o3 or GPT-5/6 can very much break it, and the actual mechanisms described are necessarily speculative and tentative.
Under this toy model, it needn’t have saturated around this level; it could’ve comfortably overshot human capabilities. But this doesn’t seem to be what’s happening, likely due to some limitation of the current paradigm not covered by this model.
Thanks! Time will tell who is right. Point by point reply:
You list four things AIs seem stubbornly bad at: 1. Innovation. 2. Reliability. 3. Solving non-templated problems. 4. Compounding returns on problem-solving-time.
First of all, 2 and 4 seem closely related to me. I would say: “Agency skills” are the skills key to being an effective agent, i.e. skills useful for operating autonomously for long periods in pursuit of goals. Noticing when you are stuck is a simple example of an agency skill. Planning is another simple example. In-context learning is another example. would say that current AIs lack agency skills, and that 2 and 4 are just special cases of this. I would also venture to guess with less confidence that 1 and 3 might be because of this as well—perhaps the reason AIs haven’t made any truly novel innovations yet is that doing so takes intellectual work, work they can’t do because they can’t operate autonomously for long periods in pursuit of goals. (Note that reasoning models like o1 are a big leap in the direction of being able to do this!) And perhaps the reason behind the relatively poor performance on non-templated tasks is… wait actually no, that one has a very easy separate explanation, which is that they’ve been trained less on those tasks. A human, too, is better at stuff they’ve done a lot.
Secondly, and more importantly, I don’t think we can say there has been ~0 progress on these dimensions in the last few years, whether you conceive of them in your way or my way. Progress is in general s-curvy; adoption curves are s-curvy. Suppose for example that GPT2 was 4 SDs worse than average human at innovation, reliability, etc. and GPT3 was 3 SDs worse and GPT4 was 2 SDs worse and o1 is 1 SD worse. Under this supposition, the world would look the way that it looks today—Thane would notice zero novel innovations from AIs, Thane would have friends who try to use o1 for coding and find that it’s not useful without templates, etc. Meanwhile, as I’m sure you are aware pretty much every benchmark anyone has ever made has shown rapid progress in the last few years—including benchmarks made by METR who was specifically trying to measure AI R&D ability and agency abilities, and which genuinely do seem to require (small) amounts of agency. So I think the balance of evidence is in favor of progress on the dimensions you are talking about—it just hasn’t reached human level yet, or at any rate not the level at which you’d notice big exciting changes in the world. (Analogous to: Suppose we’ve measured COVID in some countries but not others, and found that in every country we’ve measured, COVID has spread to about 0.01% − 0.001% of the population, and is growing exponentially. If we live in a country that hasn’t measured yet, we should assume COVID is spreading even though we don’t know anyone personally who is sick yet.)
...
You say:
Top down vs. bottom-up seem like two different ways of solving intellectual problems. Do you think it’s a sharp binary distinction? Or do you think it’s a spectrum? If the latter, what makes you think o1 isn’t farther along the spectrum than GPT3? If the former—if it’s a sharp binary—can you say what it is about LLM architecture and/or training methods that renders them incapable of thinking in the bottom-up way? (Like, naively it seems like o1 can do sophisticated reasoning. Moreover, it seems like it was trained in a way that would incentivize it to learn skills useful for solving math problems, and ‘bottom-up reasoning’ seems like a skill that would be useful. Why wouldn’t it learn it?)
Can you describe an intellectual or practical feat, or ideally a problem set, such that if AI solves it in 2025 you’ll update significantly towards my position?
Agreed, I do expect that the performance on all of those is mediated by the same variable(s); that’s why I called them a “cluster”.
I think “agency” is a bit of an overly abstract/confusing term to use, here. In particular, I think it also allows both a “top-down” and a “bottom-up” approach.
Humans have “bottom-up” agency: they’re engaging in fluid-intelligence problem-solving and end up “drawing” a decision-making pattern of a specific shape. An LLM, on this model, has a database of templates for such decision-making patterns, and it retrieves the best-fit agency template for whatever problem it’s facing. o1/RL-on-CoTs is a way to deliberately target the set of agency-templates an LLM has, extending it. But it doesn’t change the ultimate nature of what’s happening.
In particular: the bottom-up approach would allow an agent to stay on-target for an arbitrarily long time, creating an arbitrarily precise fit for whatever problem it’s facing. An LLM’s ability to stay on-target, however, would always remain limited by the length and the expressiveness of the templates that were trained into it.
RL on CoTs is a great way to further mask the problem, which is why the o-series seems to make unusual progress on agency-measuring benchmarks. But it’s still just masking it.
Not sure. I think it might be some combination of “the pretraining phase moves the model deep into the local-minimum abyss of top-down cognition, and the cheaper post-training phase can never hope to get it out of there” and “the LLM architecture sucks, actually”. But I would rather not get into the specifics.
“Inventing a new field of science” would do it, as far as more-or-less legible measures go. Anything less than that is too easily “fakeable” using top-down reasoning.
That said, I may make this update based on less legible vibes-based evidence, such as if o3′s advice on real-life problems seems to be unusually lucid and creative. (I’m tracking the possibility that LLMs are steadily growing in general capability and that they simply haven’t yet reached the level that impresses me personally. But on balance, I mostly don’t expect this possibility to be realized.)
Seems unlikely we’ll see this before stuff gets seriously crazy on anyone’s views. (Has any new field of science been invented in the last 5 years by humans? I’m not sure what you’d count.)
It seems like we should at least update towards AIs being very useful for accelerating AI R&D if we very clearly see AI R&D greatly accelerate and it is using tons of AI labor. (And this was the initial top level prompt for this thread.) We could say something similar about other types of research.
Maybe some minor science fields, but yeah entirely new science fields in 5 years is deep into ASI territory, assuming it’s something like a hard science like physics.
Minor would count.
Thanks for the reply.
That possibility is what I believe. I wish we had something to bet on better than “inventing a new field of science,” because by the time we observe that, there probably won’t be much time left to do anything about it. What about e.g. “I, Daniel Kokotajlo, are able to use AI agents basically as substitutes for human engineer/programmer employees. I, as a non-coder, can chat with them and describe ML experiments I want them to run or websites I want them to build etc., and they’ll make it happen at least as quickly and well as a competent professional would.” (And not just for simple websites, for the kind of experiments I’d want to run, which aren’t the most complicated but they aren’t that different from things actual AI company engineers would be doing.)
What about “The model is seemingly as good at solving math problems and puzzles as Thane is, not just on average across many problems but on pretty much any specific problem including on novel ones that are unfamiliar to both of you?
Miscellaneous thoughts: I don’t yet buy that this distinction between top-down and bottom-up is binary, and insofar as it’s a spectrum then I’d be willing to bet that there’s been progress along it in recent years. Moreover I’m not even convinced that this distinction matters much for generalization radius / general intelligence, and it’s even less likely to matter for ‘ability to 5x AI R&D’ which is the milestone I’m trying to predict first. Moreover, I don’t think humans stay on-target for an arbitrarily long time.
I’ve thought of one potential observable that is concrete, should be relatively low-capability, and should provoke a strong update towards your model for me:
If there is an AI model such that the complexity of R&D problems it can solve (1) scales basically boundlessly with the amount of serial compute provided to it (or to a “research fleet” based on it), (2) scales much faster with serial compute than with parallel compute, and (3) the required amount of human attention (“babysitting”) is constant or grows very slowly with the amount of serial compute.
This attempts to directly get at the “autonomous self-correction” and “ability to think about R&D problems strategically” ideas.
I’ve not fully thought through all possible ways reality could Goodhart to this benchmark, i. e. “technically” pass it but in a way I find unconvincing. For example, if I failed to include the condition (2), o3 would have probably already “passed” it (since it potentially achieved better performance on ARC-AGI and FrontierMath by sampling thousands of CoTs then outputting the most frequent answer). There might be other loopholes like this...
But it currently seems reasonable and True-Name-y to me.
Nice.
What about “Daniel Kokotajlo can feed it his docs about some prosaic ML alignment agenda (e.g. the faithful CoT stuff) and then it can autonomously go off and implement the agenda and come back to him with a writeup of the results and takeaways. While working on this, it gets to check in with Daniel once a day for a brief 20-minute chat conversation.”
Does that seem to you like it’ll come earlier, or later, than the milestone you describe?
Prooobably ~simultaneously, but I can maybe see it coming earlier and in a way that isn’t wholly convincing to me. In particular, it would still be a fixed-length task; much longer-length than what the contemporary models can reliably manage today, but still hackable using poorly-generalizing “agency templates” instead of fully general “compact generators of agenty behavior” (which I speculate humans to have and RL’d LLMs not to). It would be some evidence in favor of “AI can accelerate AI R&D”, but not necessarily “LLMs trained via SSL+RL are AGI-complete”.
Actually, I can also see it coming later. For example, some suppose that the capability researchers invent some method for reliably-and-indefinitely extending the amount of serial computations a reasoning model can productively make use of, but the compute or memory requirements grow very fast with the length of a CoT. Some fairly solid empirical evidence and theoretical arguments in favor of boundless scaling can appear quickly, well before the algorithms are made optimal enough to (1) handle weeks-long CoTs and/or (2) allow wide adoption (thus making it available to you).
I think the second scenario is more plausible, actually.
OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the “medium” FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.
Would this count, for you?
Not for math benchmarks. Here’s one way it can “cheat” at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified.
This wouldn’t actually show “agency” and strategic thinking of the kinds that might generalize to open-ended domains and “true” long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment.
Something more open-ended and requiring “research taste” would be needed. Maybe a comparable performance on METR’s benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance.
Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn’t follow the above “cheating” approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking—just like a human thought-chain would).
Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I’m not sure)
Certainly (experimenting with r1′s CoTs right now, in fact). I agree that they’re not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system “technically” clears the bar you’d outlined, yet I end up unmoved (I don’t want to end up goalpost-moving).
Though neither are they being “strategic” in the way I expect they’d need to be in order to productively use a billion-token CoT.
Yeah, I’m also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!
Huh, I disagree reasonably strongly with this. Possible that something along these lines is an empirically testable crux.
I expect this is the sort of thing that can be disproven (if LLM-based AI agents actually do start displacing nontrivial swathes of e. g. non-entry-level SWE workers in 2025-2026), but only “proven” gradually (if “AI agents start displacing nontrivial swathes of some highly skilled cognitive-worker demographic” continually fails to happen year after year after year).
Overall, operationalizing bets/empirical tests about this has remained a cursed problem.
Edit:
As a potentially relevant factor: Were you ever surprised by how unbalanced the progress and the adoption have been? The unexpected mixes of capabilities and incapabilities that AI models have displayed?
My current model is centered on trying to explain this surprising mix (top-tier/superhuman benchmark performance vs. frequent falling-flat-on-its-face real-world performance). My current guess is basically that all capabilities progress has been effectively goodharting on legible performance (benchmarks and their equivalents) while doing ~0 improvement on everything else. Whatever it is benchmarks and benchmark-like metrics are measuring, it’s not what we think it is.
So what we will always observe is AI getting better and better at any neat empirical test we can devise, always seeming on the cusp of being transformative, while continually and inexplicably failing to tilt over into actually being transformative. (The actual performance of o3 and GPT-5/6 would be a decisive test of this model for me.)
Models are just recently getting to the point where they can complete 2 hour tasks 50% of the time in METR’s tasks (at least without scaffolding that uses much more inference compute).
This isn’t yet top tier performance, so I don’t see the implication. The key claim is that progress here is very fast.
So, I don’t currently feel that strongly that there is a huge benchmark vs real performance gap in at least autonomous SWE-ish tasks? (There might be in math and I agree that if you just looked at math and exam question benchmarks and compared to humans, the models seem much smarter than they are.)
Something interesting here is that a part of why AI companies won’t want to use agents is because their capabilities are good enough that being very reckless with them might actually cause small-scale misalignment issues, and if that’s truly a big part of the problem in getting companies to adopt AI agents, this is good news for our future:
https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/?commentId=qEkRqHtSJfoDA7zJX
FWIW my vibe is closer to Thane’s. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we’ve hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.
If you were to spend equal amounts of money on LLM inference and GPUs, that would mean that you’re spending $400,000 / year on GPUs. Divide that 50 ways and each Sonnet instance gets an $8,000 / year compute budget. Over the 18 hours per day that Sonnet is waiting for experiments, that is an average of $1.22 / hour, which is almost exactly the hourly cost of renting a single H100 on Vast.
So I guess the crux is “would a swarm of unreliable researchers with one good GPU apiece be more effective at AI research than a few top researchers who can monopolize X0,000 GPUs for months, per unit of GPU time spent”.
(and yes, at some point it the question switches to “would an AI researcher that is better at AI research than the best humans make better use of GPUs than the best humans” but a that point it’s a matter of quality, not quantity)
Sure, but I think that at the relevant point, you’ll probably be spending at least 5x more on experiments than on inference and potentially a much larger larger ratio if heavy test time compute usage isn’t important. I was just trying to argue that the naive inference cost isn’t that crazy.
Notably, if you give each researcher 2k gpu hours, that would be $2 / gpu hour * 2k * 24 * 365 = $35,040,000 per year which is much higher than the inference cost of the models!
I think I misunderstood what you were saying there—I interpreted it as something like
But on closer reading I see you said (emphasis mine)
So if the employees spend 50% of their time waiting on training runs which are bottlenecked on company-wide availability of compute resources, and 50% of their time writing code, 10xing their labor input (i.e. the speed at which they write code) would result in about an 80% increase in their labor output. Which, to your point, does seem plausible.
Yes. Though notably, if your employees were 10x faster you might want to adjust your workflows to have them spend less time being bottlenecked on compute if that is possible. (And this sort of adaption is included in what I mean.)
Yeah, agreed—the allocation of compute per human would likely become even more skewed if AI agents (or any other tooling improvements) allow your very top people to get more value out of compute than the marginal researcher currently gets.
And notably this shifting of resources from marginal to top researchers wouldn’t require achieving “true AGI” if most of the time your top researchers spend isn’t spent on “true AGI”-complete tasks.