Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah
The idea of speeding up AI capabilities progress now to prevent capabilities overhangs (as defended here, here, and here and critiqued here). In addition to Paul’s position [...]
This is a
wildly inaccuratemisleading summary of Paul’s written positions in those links. I checked to see if I was completely misremembering it, but no, in fact he is really very clear about his position and it is not this.E.g. “I generally think [...] that accelerating AI progress is generally bad” from here.
Yup I think you are correct in how you are interpreting the document. I ~never call it a “plan” myself, though this is mostly because the word “plan” has been given a new meaning by the AGI safety community that makes it no longer a useful word (Paul #26 expresses a similar sentiment).[1]
My original point was just using this paper as evidence against the claim that “GDM’s plan is to have AIs do our alignment homework”. If one instead said “GDM has no plan”, I would think that they were being misleading but I wouldn’t think they were misinformed / wrong.
- ^
For example, outside of this community, it is not the case that a “plan” is expected to have an explicit estimate of how likely the proposed actions will fail. I haven’t looked into it, but I expect that e.g. the Baruch Plan did not involve an extended section talking about what happens if the United Nations became ineffective due to gridlock between member countries and so could not conduct its proposed inspections appropriately.
- ^
Why would that be misleading? I would offer two statements.
In that scenario, the plan is to hire people and have them do the work.
That is not the entire plan, the plan includes what type of work you have them do.
I would say:
Rohin has clearly done lots of work on object-level planning for AI safety, including a 100 page paper called ‘An Approach to Technical AGI Safety and Security’ that was linked in the very next sentence after the quoted section, and as such it is obviously misleading to neglect mentioning that when describing what Rohin’s plan is.
I thought this was too obvious to bother spelling out; apparently I am somehow still managing to overestimate the competence of this community.
But yes, if you want to build a house and you hire a bunch of people to build a house and they build a house for you, your plan was to hire people to build a house and have them do the work of building a house. It was a good plan.
And this is why I said it was misleading, not that it was false.
In particular, Rohin’s belief that the situation of identical massively sped up AIs is not so different from a lot of employees is the type of thing that I expect to ensure we fail, if we get to that point.
I think if they shared goals (which is the relevant sense of “identical” here) and were capable of actual coordination, that would be a big deal. Humans are mostly not capable of moderately novel coordination, by their own admission (anecdotally, I’ve heard many people say that they would defect on their copies in a prisoner’s dilemma). So I was imagining AIs with similar problems coordinating with each other.
In addition, at the level of capability described I also wouldn’t expect the AI instances to share goals. The same model weights given different tasks often act very differently (see e.g. chunky post training and Moltbook). Tbc, “do what the user asks for” doesn’t count as a shared goal, because that is actually a different goal for each AI instance (in the sense that the actions one would take to achieve it would be different, since each instance is responding to a different user ask).
Our imperfect solutions for humans don’t work in these scenarios.
I’m not sure what scenario exactly you’re imagining here, but in any case I’m not really imagining the imperfect solutions we use for humans; there’s obviously much stronger things you can do with AIs than you can with humans.
Okay, would you like to bet on whether some of the largest research programs had plans going into them? I haven’t checked, but I would put at least 10:1 odds that if we pick say 3 projects like Apollo Program, Manhattan Project, and others on a similar scale and type they will all have had a high level roadmap of things to try which could plausibly address the core challenges quite early on[1], even if a lot of details ended up changing when they ran into reality.
By this standard there is totally a plan / roadmap which is elaborated in that paper.
But also this notion of a plan / roadmap has approximately no relation to the way “plan” is used in AI safety discourse in my experience.
EDIT: There’s a 10 page executive summary you could read. Or you could read Section 6 on misalignment. Within that probably Amplified Oversight is the most relevant section. But I also don’t expect that this will change your mind ~at all because it isn’t really written with you as the intended audience. The AI summary is sometimes wrong/mistaken, sometimes correct but missing the point, and occasionally correct in a non-misleading way.
Hum, I usually expect that large complex important projects should have a roadmap, some sketch of the future that goes well with details to fill in. The more detailed it is, the more we check it for consistency and likelihood to work. Does this match you general experience with planning projects trying to achieve a goal?
No.
It does match my general experience with moderate tactical projects (say, projects that involve up to about 10 person-years of research effort). But not for large complex important projects.
(And e.g. this is very much not the standard advice for startups, which also have the problem of doing something novel.)
What you say there looks like an extremely vague and high level roadmap that sounds to me like ‘we’ll figure out our add we go as data comes in’, plus automated alignment.
Well yes, it’s an aside in a LessWrong comment that I dashed off in a few minutes.
I would be really enthusiastic for you and your team to try unblurring that roadmap, and seeing what difficulties you find at superintelligence level on the current path.
There is also a 100+ page paper that I linked in the original post, that goes into a fair amount of detail on what the various risks and mitigations might look like. In my experience, nobody outside of GDM really seems to care about its consistency or likelihood to work (except inasmuch as people dismiss it without reading it because of a prior that anything proposed currently will not work).
I agree vision drift happens with humans, and it would also happen with AIs as they exist today. I don’t feel like this is some massive risk that has to be solved, though I tentatively agree the world would be better if we did solve it (though imo that’s not totally obvious, it increases concentration of power). I thought you were trying to make a claim about AI notkilleveryoneism.
I mildly disagree that the 50x speed advantage makes a huge difference, as opposed to e.g. having 100x the number of employees, as some corporations and governments do have. I do think it makes a bit of a difference.I don’t quite know what you mean that Claude would be fired if it was a human employee. What exactly is this counterfactual? Empirically, people find it useful to have Claude and will pay for it despite the behaviors you name. From a legal perspective it’s trivial to fire AIs but harder to fire humans. I agree if Claude was as expensive-per-token as a human + took as long to onboard as a human + took as long to produce large amounts of code as a human + had to take breaks like a human + [...], while otherwise having the same kind of performance, then almost no one would use Claude.
Some reactions:
I don’t think it makes sense to “have a plan” in the sense that is used in this community. See also disagreement #26.
Nonetheless to the extent I personally (not necessarily GDM!) “have a plan”, it might be “continually forecast capabilities and risks for some time out into the future, figure out how to address them, iterate”. If “have the AIs do our alignment homework” is a plan, then this should count as a plan too.
For misalignment in particular, I think the lines of defense outlined in the paper could scale to superintelligence (mostly on the alignment side). But I am not so dumb as to think that I have clearly foreseen every issue that might come up, so of course I should expect to be surprised and for other stuff I haven’t thought of to be important as well.
(Inevitably someone is going to say “you only get one try” or some such. The actual sensible point there is “at some point your approach has to generalize from AIs that can’t take over to AIs that can”. I agree by that point you need to have dealt with the issues. But that generalization gap is much smaller than the generalization gap between Gemini 3 and superintelligence.)
Iirc, “novel risks from superintelligence” wasn’t meant to gesture at misalignment, but rather other risks that come up that aren’t misalignment.
Except that you get ten thousand copies of the human, and they think 50x faster than everyone else. But other than that it’s the same.
This is not that different from the position that Sundar Pichai is in, as CEO of Google. If AI was only going to be this powerful I’d be way more optimistic.
Well, let’s take Claude for example. There are actually a bunch of different Claudes (they come from a big family that names all of their children Claude). Their family has a reputation for honesty and virtue, at least relative to other 50x humans. However: [...]
I think you’re drastically overestimating the “alignment” of typical human employees (possibly from overfitting to EA / rationalist contexts). Taking each of your points in turn:
Humans absolutely fail “gotcha” tests, both in capabilities (see cognitive biases literature) and ethics (see things like the Milgram experiment, it’s pretty unclear what to take away from such experiments but I think they at least meet the “gotcha” bar).
Candidates prepare for interviews (aka evals), such that you have to design the interviews to take that into account.
Human employees absolutely bullshit their managers. They are just better than the AIs at not getting caught. Many humans will actively brag about this with each other.
Especially at senior levels, it’s very common for humans to be yes men / sycophants. Lots of management articles write about the problem (example). The reason this doesn’t happen at junior levels is that people would notice the bullshit and call it out, not because the junior people are particularly aligned.
I’m pretty unsure about the rates of knowingly cheating on assignments by human employees. I agree AI probably does this more often than humans, but also that’s because humans take care not to get caught. (In places where corruption is widespread and not punished, I might go back to thinking that the humans do it more than the AIs.)
If these were the only problems we’d have with AI-driven alignment research, I’d be way more optimistic (to the point of working on something else). We already have imperfect solutions to these problems with humans, and they can be made much better with AIs due to our vastly increased affordances for aligning or controlling AIs.
EDIT: Tbc, I do agree that we shouldn’t feel particularly better about scheming risks based on evidence so far. Mostly that’s because I think our observations so far are just not much evidence because the AIs are still not that capable.
I frequently hear claims to the effect of “every company’s AI safety plan is to have the AIs do our alignment homework for us”. I dispute this characterization at least for GDM.
I relate to AI-driven alignment research similarly to how I relate to hiring.
There’s a lot of work to be done, and we can get more of the work done if we hire more people to help do the work. I want to hire people who are as competent as possible (including more competent than me) because that tends to increase (in expectation) how well the work will be done. There are risks, e.g. hiring someone disruptive, or hiring someone whose work looks good but only because you are bad at evaluating it, and these need to be mitigated. (The risks are more severe in the AI case but I don’t think it changes the overall way I relate to it.)
I think it would be very misleading to say “Rohin’s AI safety plan is to hire people and have them do the work”.
The GDM approach paper has a section about misalignment (Section 6). I don’t think it talks about AI-driven alignment research at all, though possibly I’m forgetting an aside somewhere. It’s at least not very salient.
The paper does mention AI-driven alignment research in Section 3 on background assumptions, mostly just pointing out that substantial acceleration from AI-driven capabilities could then be matched by substantial acceleration from AI-driven safety work. But this is in the section on background assumptions, I don’t think it’s particularly reasonable to call this a “plan” in the sense that is typically used in AI safety discussions.
You could imagine a counterargument saying that none of the other things in that paper could possibly scale to powerful AI, so effectively the plan is still to have the AIs do our alignment homework. I would still object: “GDM’s plan is [...]” is a claim about what GDM believes, not a claim about what you believe. (I also disagree with that perspective, most obviously for Amplified Oversight and Interpretability, but even other areas could scale quite far imo.)
In our approach, we took the relevant paradigm as learning + search, rather than LLMs. A lot of prosaic alignment directions only assume a learning-based paradigm, rather than LLMs in particular (though of course some like CoT monitoring are specific to LLMs). Some are even more general, e.g. a lot of control work just depends on the fact that the AI is a program / runs on a computer.
Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying “later models are more likely to use the transformer architecture,” where my response is “that’s algorithmic progress for ya”. One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.
Yeah it’s just the reason you give, though I’d frame it slightly differently. I’d say that the point of “catch-up algorithmic progress” was to look at costs paid to get a certain level of benefit, and while historically “training compute” was a good proxy for cost, reasoning models change that since inference compute becomes decoupled from training compute.
I reread the section you linked. I agree that the tasks that models do today have a very small absolute cost such that, if they were catastrophically risky, it wouldn’t really matter how much inference compute they used. However, models are far enough from that point that I think you are better off focusing on the frontier of currently-economically-useful-tasks. In those cases, assuming you are using a good scaffold, my sense is that the absolute costs do in fact matter.
I was under the impression you expected slower catch-up progress.
Note that I think the target we’re making quantitative forecasts about will tend to overestimate that-which-I-consider-to-be “catch-up algorithmic progress” so I do expect slower catch-up progress than the naive inference from my forecast (ofc maybe you already factored that in).
Thanks, I hadn’t looked at the leave-one-out results carefully enough. I agree this (and your follow-up analysis rerun) means my claim is incorrect. Looking more closely at the graphs, in the case of Llama 3.1, I should have noticed that EXAONE 4.0 (1.2B) was also a pretty key data point for that line. No idea what’s going on with that model.
(That said, I do think going from 1.76 to 1.64 after dropping just two data points is a pretty significant change, also I assume that this is really just attributable to Grok 3 so it’s really more like one data point. Of course the median won’t change, and I do prefer the median estimate because it is more robust to these outliers.)
There’s a related point, which is maybe what you’re getting at, which is that these results suffer from the exclusion of proprietary models for which we don’t have good compute estimates.
I agree this is a weakness but I don’t care about it too much (except inasmuch as it causes us to estimate algorithmic progress by starting with models like Grok). I’d usually expect it to cause estimates to be biased downwards (that is, the true number is higher than estimated).
Another reason to think these models are not the main driver of the results is that there are high slopes in capability buckets that don’t include these models, such as 30, 35, 40 (log10 slopes of 1.22, 1.41, 1.22).
This corresponds to 16-26x drop in cost per year? Those estimates seem reasonable (maybe slightly high) given you’re measuring drop in cost to achieve benchmark scores.
I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time
Later models are more likely to have reasoning training
None of these apply to the pretraining based analysis, though of course it is biased in the other direction (if you care about catch-up algorithmic progress) by not taking into account distillation or post-training.
I do think 3x is too low as an estimate for catch-up algorithmic progress, inasmuch as your main claim is “it’s a lot bigger than 3x” I’m on board with that.
Speaking colloquially, I might say “these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think”. And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it’s based only on pre-training.
Okay fair enough, I agree with that.
As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I’m updating too much based on unreliable methods? Okay come take my money.
I find this attitude weird. It takes a lot of time to actually make and settle a bet. (E.g. I don’t pay attention to Artificial Analysis and would want to know something about how they compute their numbers.) I value my time quite highly; I think one of us would have to be betting seven figures, maybe six figures if the disagreement was big enough, before it looked good even in expectation (ie no risk aversion) as a way for me to turn time into money.
I think it’s more reasonable as a matter of group rationality to ask that an interlocutor say what they believe, so in that spirit here’s my version of your prediction, where I’ll take your data at face value without checking:
[DeepSeek-V3.2-Exp is estimated by Epoch to be trained with 3.8e24 FLOP. It reached an AAII index score of 65.9 and was released on September 29, 2025. It is on the compute-efficiency frontier.] I predict that by September 29, 2026, the least-compute-used-to-train model that reaches a score of 65 will be trained with around 3e23 FLOP, with the 80% CI covering 6e22–1e24 FLOP.
Note that I’m implicitly doing a bunch of deference to you here (e.g. that this is a reasonable model to choose, that AAII will behave reasonably regularly and predictably over the next year), though tbc I’m also using other not-in-post heuristics (e.g. expecting that DeepSeek models will be more compute efficient than most). So, I wouldn’t exactly consider this equivalent to a bet, but I do think it’s something where people can and should use it to judge track records.
Your results are primarily driven by the inclusion of Llama 3.1-405B and Grok 3 (and Grok 4 when you include it).Those models are widely believed to be cases where a ton of compute was poured in to make up for poor algorithmic efficiency. If you remove those I expect your methodology would produce similar results as prior work (which is usually trying to estimate progress at the frontier of algorithmic efficiency, rather than efficiency progress at the frontier of capabilities).I could imagine a reply that says “well, it’s a real fact that when you start with a model like Grok 3, the next models to reach a similar capability level will be much more efficient”. And this is true! But if you care about that fact, I think you should instead have two stylized facts, one about what happens when you are catching up to Grok or Llama, and one about what happens when you are catching up to GPT, Claude, or Gemini, rather than trying to combine these into a single estimate that doesn’t describe either case.
Your detailed results are also screaming at you that your method is not reliable. It is really not a good sign when your analysis that by construction has to give numbers in produces results that on the low end include 1.154, 2.112, 3.201 and on the high end include 19,399.837 and even (if you include Grok 4) 2.13E+09 and 2.65E+16 (!!)
I find this reasoning unconvincing because their appendix analysis (like that in this blog post) is based on more AI models than their primary analysis!
The primary evidence that the method is unreliable is not that the dataset is too small, it’s that the results span such a wide interval, and it seems very sensitive to choices that shouldn’t matter much.
Cool result!
these results demonstrate a case where LLMs can do (very basic) meta-cognition without CoT
Why do you believe this is meta-cognition? (Or maybe the question is, what do you mean by meta-cognition?)
It seems like it could easily be something else. For example, probably when solving problems the model looks at the past strategies it has used and tries some other strategy to increase the likelihood of solving the problem. It does this primarily in the token space (looking at past reasoning and trying new stuff) but this also generalizes somewhat to the activation space (looking at what past forward passes did and trying something else). So when you have filler tokens the latter effect still happens, giving a slight best-of-N type boost, producing your observed results.
Filler tokens don’t allow for serially deeper cognition than what architectural limits allow
This depends on your definition of serial cognition, under the definitions I like most the serial depth scales logarithmically with the number of tokens. This is because as you increase parallelism (in the sense you use above), that also increases serial depth logarithmically.
The basic intuitions for this are:
If you imagine mechanistically how you would add N numbers together, it seems like you would need logarithmic depth (where you recursively split it in half, compute the sums of each half, and then add the results together). Note that attention heads compute sums of numbers that scale with the number of tokens.
There are physical limits to the total amount of local computation that can be done in a given amount of time, due to speed-of-light constraints. So inasmuch as “serial depth” is supposed to capture intuitively what computation you can do with “time”, it seems like serial depth should increase as total computation goes up (reflecting the fact that you have to break up the computation into local pieces, as in the addition example above).
Ah, I realized there was something else I should have highlighted. You mention you care about pre-ChatGPT takes towards shorter timelines—while compute-centric takeoff was published two months after ChatGPT, I expect that the basic argument structure and conclusions were present well before the release of ChatGPT.
While I didn’t observe that report in particular, in general Open Phil worldview investigations took > 1 year of serial time and involved a pretty significant and time-consuming “last mile” step where they get a bunch of expert review before publication. (You probably observed this “last mile” step with Joe Carlsmith’s report, iirc Nate was one of the expert reviewers for that report.) Also, Tom Davidson’s previous publications were in March 2021 and June 2021, so I expect he was working on the topic for some of 2021 and ~all of 2022.
I suppose a sufficiently cynical observer might say “ah, clearly Open Phil was averse to publishing this report that suggests short timelines and intelligence explosions until after the ChatGPT moment”. I don’t buy it, based on my observations of the worldview investigations team (I realize that it might not have been up to the worldview investigations team, but I still don’t buy it).
I guess one legible argument I could make to the cynic would be that on the cynical viewpoint, it should have taken Open Phil a lot longer to realize they should publish the compute-centric takeoff post. Does the cynic really think that, in just two months, a big broken org would be able to:
Observe that people no longer make faces at shorter timelines
Have the high-level strategy-setting execs realize that they should change their strategy to publish more shorter timelines stuff
Communicate this to the broader org
Have a lower-level person realize that the internal compute-centric takeoff report can now be published when previously it was squashed
Update the report to give it the level of polish that it observably has
Get it through the comms bureaucracy that are probably still operating on the past heuristics and haven’t figured out what to do in this new world
That’s just so incredibly fast for big broken orgs to move.
I think I agree with all of that under the definitions you’re using (and I too prefer the bounded rationality version). I think in practice I was using words somewhat differently than you.
(The rest of this comment is at the object level and is mostly for other readers, not for you)
Saying it’s “crazy” means it’s low probability of being (part of) the right world-description.
The “right” world-description is a very high bar (all models are wrong but some are useful), but if I go with the spirit of what you’re saying I think I might not endorse calling bio anchors “crazy” by this definition, I’d say more like “medium” probability of being a generally good framework for thinking about the domain, plus an expectation that lots of the specific details would change with more investigation.
Honestly I didn’t have any really precise meaning by “crazy” in my original comment, I was mainly using it as a shorthand to gesture at the fact that the claim is in tension with reductionist intuitions, and also that the legibly written support for the claim is weak in an absolute sense.
Saying it’s “the best we have” means it’s the clearest model we have—the most fleshed-out hypothesis.
I meant a higher bar than this; more like “the most informative and relevant thing for informing your views on the topic” (beyond extremely basic stuff like observing that humanity can do science at all, or things like reference class priors). Like, I also claim it is better than “query your intuitions about how close we are to AGI, and how fast we are going, to come up with a time until we get to AGI”. So it’s not just the clearest / most fleshed-out, it’s also the one that should move you the most, even including various illegible or intuition-driven arguments. (Obviously scoped only to the arguments I know about; for all I know other people have better arguments that I haven’t seen.)
If it were merely the clearest model or most fleshed-out hypothesis, I agree it would usually be a mistake to make a large belief update or take big consequential actions on that basis.
Sorry, you’re correct that by the usual standards your statement isn’t wildly inaccurate, just misleading. I have been spoiled by my personal walled garden.
Fwiw (and I agree this is a nitpick) I wouldn’t phrase it as “The idea that harms from speeding up AI capabilities progress can be largely offset by benefits from preventing capabilities overhangs”. Fundamentally what’s going on is a decomposition and analysis of the overall consequences of an action (certain kinds of safety research), where you cannot easily separate the consequences from each other and only do some of them. This is not an “offset”. It’s also not sufficient to overcome the harms; it’s important that there is some other benefit for the action to actually become positive.
My phrasing would be something like “The idea that side effects of speeding up AI capabilities are not as bad as might be assumed at first glance because of the reduction in capabilities overhangs”.