Contradict my take on OpenPhil’s past AI beliefs
At many points now, I’ve been asked in private for a critique of EA / EA’s history / EA’s impact and I have ad-libbed statements that I feel guilty about because they have not been subjected to EA critique and refutation. I need to write up my take and let you all try to shoot it down.
Before I can or should try to write up that take, I need to fact-check one of my take-central beliefs about how the last couple of decades have gone down. My belief is that the Open Philanthropy Project, EA generally, and Oxford EA particularly, had bad AI timelines and bad ASI ruin conditional probabilities; and that these invalidly arrived-at beliefs were in control of funding, and were explicitly publicly promoted at the expense of saner beliefs.
An exemplar of OpenPhil / Oxford EA reasoning about timelines is that, as late as 2020, their position on timelines seemed to center on Ajeya Cotra’s “Biological Timelines” estimate which put median timelines to AGI at 30 years later. Leadership dissent from this viewpoint, as I recall, generally centered on having longer rather than shorter median timelines.
An exemplar of poor positioning on AI ruin is Joe Carlsmith’s “Is Power-Seeking AI an Existential Risk?” which enacted a blatant Multiple Stage Fallacy in order to conclude this risk was ~5%.
I recall being told verbally in person by OpenPhil personnel that Cotra and Carlsmith were representative of the OpenPhil view and would be the sort of worldview that controlled MIRI’s chances of getting funding from OpenPhil, i.e., we should expect funding decisions to be premised on roughly these views and try to address ourselves to those premises if we wanted funding.
In recent personal conversations in which I exposited my current fault analysis of EA, I’ve heard people object, “But this wasn’t an official OpenPhil view! Why, some people inside OpenPhil discussed different views!” I think they are failing to appreciate the extent to which mere tolerance of dissenting discussion is not central, in an organizational-psychology analysis of what a large faction actually does. But also, EAs have consistently reacted with surprised dismay when I presented my view that these bad beliefs were in effective control. They may have better information than I did; I was an outsider and did not much engage with what I estimated to then be a lost cause. I want to know the true facts of OpenPhil’s organizational history whatever they may be.
I therefore throw open to EAs / OpenPhil personnel / the Oxford EAs, the question of whether they have strong or weak evidence that any dissenting views from “AI in median >30 years” and “utter AI ruin <10%” (as expressed in the correct directions of shorter timelines and worse ruin chances; and as said before the ChatGPT moment), were permitted to exercise decision-making power over the flow of substantial amounts of funding; or if the weight of reputation and publicity of OpenPhil was at any point put behind promoting those dissenting viewpoints (in the correct direction, before the ChatGPT moment).
This to me is the crux in whether the takes I have been giving in private were fair to OpenPhil. Tolerance of verbal discussion of dissenting views inside OpenPhil is not a crux. EA forum posts are not a crux even if the bylines include mid-level OpenPhil employees.
Public statements saying “But I do concede 10% AGI probability by 2036”, or “conditional on ASI at all, I do assign substantial probability to this broader class of outcomes that includes having a lot of human uploads around and biological humans thereby being sidelined”, is not something I see as exculpatory; it is rather a clear instance of what I see as a larger problem for EA and a primary way it did damage.
(Eg, imagine that your steamship is sinking after hitting an iceberg, and you are yelling for all passengers to get to the lifeboats. As it seems like a few passengers might be starting to pay some little attention, somebody wearing a much more expensive and serious-looking suit than you can afford, stands up and begins declaiming about how their own expert analysis does suggest a 10% chance that the ship takes on enough water to sink as early as the next week; and that they think this has a 25% chance of producing a broad class of genuinely attention-worthy harms, like many passengers needing to swim to the ship’s next destination.)
I have already asked the shoggoths to search for me, and it would probably represent a duplication of effort on your part if you all went off and asked LLMs to search for you independently. I want to know if insiders have contrary evidence that I as an outsider did not know about. If my current take is wrong and unfair, I want to know it; that is not the same as promising to be easy to convince, but I do want to know.
I repeat: You should understand my take to be that of an organizational-psychology cynic who is not per se impressed by the apparent tolerance of dissenting views, people invited to give dissenting talks, dissenters still being invited to parties, et cetera. None of that will surprise me. I do not view it as sufficient to organizational best practices. I will only be surprised by the demonstrated past pragmatic power to control the disposition of funding and public promotion of ideas, contrary to “AGI median in 30 years or longer” and “utter ruin at 10% or lower”, before the ChatGPT moment.
(If you doubt my ability to ever concede to evidence about this sort of topic, observe this past case on Twitter where I immediately and without argument concede that OpenPhil was right and I was wrong, the moment that the evidence appeared to be decisive. (The choice of example may seem snarky but is not actually snark; it is not easy for me to find other cases where, according to my own view, clear concrete evidence came out that I was definitely wrong and OpenPhil definitely right; and I did in that case immediately concede.))
In 2018 I recall being at a private talk hosted by ~2 people that OpenPhil worked closely with and/or thought of as senior advisors, on AI. It was a confidential event so I can’t say who or any specifics, but they were saying that they wanted to take seriously short AI timelines but demanded confidentiality about this, because they felt that being open about their thoughts on this would influence other actors to believe that AI was important to get involved with, but not that the risk was high. I think this confidentiality about their beliefs was a significant part of the dynamics around OpenPhil’s beliefs and signaling at the time.
Some takes:
To answer the main question, I agree circa 2020 if you had to ascribe beliefs to Open Phil the org it would be something like “~30 year median timelines” and “~10% utter AI ruin”. (And I had some visibility on this beyond public info.) I think this throws away tons of detail about actual Open Phil at the time but imo it’s a reasonable move to make.
I don’t think Open Phil had a goal of “promoting” those beliefs, and don’t recall them doing things that would be reasonably considered to be primarily about “promoting” those beliefs.
Obviously they published reports about those topics, but the reason for that is to advance discourse / collective epistemics.
Iirc, bio anchors was the only one that said ~30 year median timelines, I believe semi-informative priors was significantly longer and compute-centric takeoff was significantly shorter.
I don’t think these beliefs made much of a difference to their grant making on object level technical research (relative to beliefs of shorter timelines / higher probability of ruin), just because people almost always want to act on shorter timelines / higher ruin probabilities (than 10%) because those worlds are easier to influence. This is similar to Buck’s take so see his comment for elaboration.
I kinda suspect that upon reading the previous bullet you will say “sure, but the researchers will have contorted their beliefs to make themselves palatable to Open Phil”.
They cared a lot about preventing this. At one point while I was at CHAI (funded by Open Phil), we were talking with someone from Open Phil and I asked about their belief on something, and they declined to answer because they didn’t want their grantees overfitting to their beliefs. I actually found it quite frustrating and tried to argue with them that we were in fact very opinionated and weren’t about to simply adopt their beliefs without questioning them.
Of course it is possible that researchers conformed to Open Phil’s beliefs anyway.
Certainly there was a lot of deference from some communities to Open Phil’s views here, though I would guess that was driven more by their reputation as good thinkers rather than their ability to direct funding. I think there’s a ton of deference to you (Eliezer) in the rationalist community, for similar reasons. Both groups tend to underestimate how much deference they are doing.
Maybe you are also making some claim about Open Phil’s (un)willingness to fund work more focused on comms or influencing social reality towards shorter timelines / higher probabilities of ruin? I don’t have any knowledge about that, but mostly I don’t recall many shovel-ready opportunities being available in 2020.
On the actual object level beliefs:
On timelines, my rough take is “nobody knows, it’s not particularly important for technical work beyond ‘could be soon’, it’s often actively negative to discuss because it polarizes people and wastes time, so ignore it where possible”. But if I’m forced to give an opinion then I’m most influenced by the bio anchors lineage of work and the METR time horizon work.
Generally my stance is roughly: Bio anchors / compute-centric timelines is the worst form of timelines prediction, except all those other approaches that have been tried from time to time.
I agree that the timelines in the original bio anchors report were too long, but mainly because it failed to capture acceleration caused by AI prior to TAI. This was pointed out by Open Phil, specifically in Tom Davidson’s compute-centric takeoff speeds report.
Your views are presumably captured by your critique of bio anchors. I don’t find it convincing; my feelings about that critique are similar to Scott Alexander’s. In the section “Response 4: me”, he says: “Given these two assumptions—that natural artifacts usually have efficiencies within a few OOM of artificial ones, and that compute drives progress pretty reliably—I am proud to be able to give Ajeya’s report the coveted honor of “I do not make an update of literally zero upon reading it”.”
In particular, the primary rejoinder to “why should bio anchors bind to reality; people do things via completely different mechanisms than nature” is “we tried applying this style of reasoning in a few cases where we know the answer, and it seems like it kinda sorta works”.
(Remember that I only want to defend “worst form of timelines prediction except all the other approaches”. I agree this is kind of a crazy argument in some absolute sense.)
So in my view Open Phil looks great in retrospect on this question.
I still have a pretty low probability on AI ruin (by the standards of LessWrong), as I have for a long time, since we haven’t gotten much empirical evidence on the matter. So the 10% AI ruin seems fine in retrospect to me.
Best source on the reasons for disagreement is Where I agree and disagree with Eliezer (written by Paul but I endorsed almost all of it in a comment).
Best source that comes from me is this conversation with AI Impacts, though note it was in 2019 and I wouldn’t endorse all of it any more.
So, just so we’re on the same page abstractly: Would you agree that updating / investing “a lot” in an argument that’s kind of crazy in some absolute sense, would be an epistemic / strategic mistake, even if that argument is the best available specific argument in a relative sense?
Hmm, maybe? What exactly is the alternative?
Some things that I think would usually be epistemic / strategic mistakes in this situation:
Directly adopting the resulting distribution
Not looking into other arguments
Taking actions that would be significantly negative in “nearby” worlds that the argument suggests are unlikely. (The “nearby” qualifier is to avoid problem-of-induction style issues.)
Some things that I don’t think would immediately qualify as epistemic / strategic mistakes (of course they could still be mistakes depending on further details):
Making a large belief update. Generally it seems plausible that you start with some very inchoate opinions (or in Bayesian terms an uninformed prior) and so any argument that seems to have some binding to reality, even a pretty wild and error-prone one, can still cause a big update. (Related: The First Sample Gives the Most Information.)
This is not quite how I felt about bio anchors in 2020 -- I do think there was some binding-to-reality in arguments like “make a wild guess about how far we are, and then extrapolate AI progress out and see when it reaches that point”—but it is close.
Taking consequential actions as a result of the argument. Ultimately we are often faced with consequential decisions, and inaction is also a choice. Obviously you should try to find actions that don’t depend on the particular axis you’re uncertain about (in this case timelines), but sometimes that would still leave significant potential value on the table. In that case you should use the best info available to you.
I actually think this is one of the biggest strengths of Open Phil around 2016-2020. At that time, (1) the vast, vast majority of people believed that AGI was a long time away (especially as revealed by their actions, but also via stated beliefs), and (2) AI timelines was an incredibly cursed topic where there were few arguments that had any semblance of binding to reality. Nevertheless, they saw some arguments that had some tiny bit of binding to reality, concluded that there was a non-trivial chance of AGI within 2 decades, and threw a bunch of effort behind acting on that scenario because it would be so important if it happened.
So imo they did invest “a lot” in the belief that AGI could be soon (though of course that wasn’t based just on bio anchors), and I think this looks like a great call in hindsight.
Thanks!
We easily agree that this depends on further details, but just at this abstract level, I want to record the case that these “probably, usually” are mistakes. (I’m avoiding the object level because I’m not very invested in that discussion—I have opinions but they aren’t the result of lots of investigation on the specific topic of bioanchors or OP’s behavior; totally fair for you to therefore bow out / etc.)
The case is like this:
Suppose you have a Bayesian uninformed prior. Then someone makes an argument that’s “kinda crazy but the best we have”. What should happen? How does a “kinda crazy argument” cash out in terms of likelihood ratios? I’m not sure, and actually I don’t want to think of it in a simple Bayesian context; but one way would be to say: On the hypothesis that AGI comes at year X, we should see more good arguments for AGI coming at year X. When we see an argument for that, we update to thinking AGI comes at year X. How much? Well, if it’s a really good argument, we update a lot. If it’s a crazy argument, we don’t update much, because any Y predicts that there are plenty of crazy arguments for AGI coming at year Y.
The way I actually want to think about the situation would be in terms of bounded rationality and abduction. The situation is more like, we start with pure “model uncertainty”, or in other words “we haven’t thought of most of the relevant hypotheses for how things actually work; we’re going off of a mush of weak guesses, analogies, and high-entropy priors over spaces that seem reasonable”. What happens when we think of a crazy model? It’s helpful, e.g. to stimulate further thinking which might lead to good hypotheses. But does it update our distributions much? I think in terms of probabilities, it looks like fleshing out one very unlikely hypothesis. Saying it’s “crazy” means it’s low probability of being (part of) the right world-description. Saying it’s “the best we have” means it’s the clearest model we have—the most fleshed-out hypothesis. Both of these can be true. But if you add an unlikely hypothesis, you don’t update the overall distribution much at all.
I think I agree with all of that under the definitions you’re using (and I too prefer the bounded rationality version). I think in practice I was using words somewhat differently than you.
(The rest of this comment is at the object level and is mostly for other readers, not for you)
The “right” world-description is a very high bar (all models are wrong but some are useful), but if I go with the spirit of what you’re saying I think I might not endorse calling bio anchors “crazy” by this definition, I’d say more like “medium” probability of being a generally good framework for thinking about the domain, plus an expectation that lots of the specific details would change with more investigation.
Honestly I didn’t have any really precise meaning by “crazy” in my original comment, I was mainly using it as a shorthand to gesture at the fact that the claim is in tension with reductionist intuitions, and also that the legibly written support for the claim is weak in an absolute sense.
I meant a higher bar than this; more like “the most informative and relevant thing for informing your views on the topic” (beyond extremely basic stuff like reference class priors). Like, I also claim it is better than “query your intuitions about how close we are to AGI, and how fast we are going, to come up with a time until we get to AGI”. So it’s not just the clearest / most fleshed-out, it’s also the one that should move you the most, even including various illegible or intuition-driven arguments. (Obviously scoped only to the arguments I know about; for all I know other people have better arguments that I haven’t seen.)
If it were merely the clearest model or most fleshed-out hypothesis, I agree it would usually be a mistake to make a large belief update or take big consequential actions on that basis.
I also want to qualify / explain my statement about it being a crazy argument. The specific part that worries me (and Eliezer, iiuc) is the claim that, at a given point in time, the delta between natural and artificial artifacts will tend to be approximately constant across different domains. This is quite intuition-bending from a mechanistic / reductionist viewpoint, and the current support for it seems very small and fragile (this 8 page doc). However, I can see a path where I would believe in it much more, which would involve things like:
Pinning down the exact methodology: Which metrics are we making this “approximately constant delta” claim for? It’s definitely not arbitrary metrics.
The doc has an explanation, but I don’t currently feel like I could go and replicate it myself while staying faithful to the original intuitions, so I currently feel like I am deferring to the authors of the doc on how they are choosing their metrics.
Once we do this, can we explain how we’re applying it in the AI case? Why should we anchor brain size to neural net size, instead of e.g. brain training flops to neural net training flops?
More precise estimates: Iirc there is a lot of “we used this incredibly heuristic argument to pull out a number, it’s probably not off by OOMs so that’s fine for our purposes”, which I think is reasonable but makes me uneasy.
Get more data points: Surely there are more artifacts to compare.
Check for time invariance: Repeat the analysis at different points in time—do we see a similar “approximately constant gap” if we look at manmade artifacts from (say) 2000 or 1950? Equivalently, this theory predicts that the rate of progress on the chosen metrics should be similar across domains; is that empirically supported?
Flesh out a semi-mechanistic theory (that makes this prediction).
The argument would be something like: even if you have two very very different optimization procedures (evolution vs human intelligence), as long as they are far from optimality, it is reasonable to model them via a single quantity (“optimization power” / “effective fraction of the search space covered”) which is the primary determinant of the performance you get, irrespective of what domain you are in (as long as the search space in the domain is sufficiently large / detailed / complicated). As a result, as long as you focus on metrics that both evolution and humans were optimizing, you should expect the difference in performance on the metric to be primarily a function of the difference in optimization power between evolution and humans-at-that-point-in-time, and to be approximately independent of the domain.
Once you have such a theory, check whether the bio anchors application is sensible according to the theory.
I anticipate someone asking the followup “why didn’t Open Phil do that, then?” I don’t know what Open Phil was thinking, but I don’t think I’d have made a very different decision. It’s a lot of work, not many people can do it, and many of those people had better things to do, e.g. imo the compute-centric takeoff work was indeed more important and caused bigger updates than I think the work above would have done (and was probably easier to do).
Ah, I realized there was something else I should have highlighted. You mention you care about pre-ChatGPT takes towards shorter timelines—while compute-centric takeoff was published two months after ChatGPT, I expect that the basic argument structure and conclusions were present well before the release of ChatGPT.
While I didn’t observe that report in particular, in general Open Phil worldview investigations took > 1 year of serial time and involved a pretty significant and time-consuming “last mile” step where they get a bunch of expert review before publication. (You probably observed this “last mile” step with Joe Carlsmith’s report, iirc Nate was one of the expert reviewers for that report.) Also, Tom Davidson’s previous publications were in March 2021 and June 2021, so I expect he was working on the topic for some of 2021 and ~all of 2022.
I suppose a sufficiently cynical observer might say “ah, clearly Open Phil was averse to publishing this report that suggests short timelines and intelligence explosions until after the ChatGPT moment”. I don’t buy it, based on my observations of the worldview investigations team (I realize that it might not have been up to the worldview investigations team, but I still don’t buy it).
I guess one legible argument I could make to the cynic would be that on the cynical viewpoint, it should have taken Open Phil a lot longer to realize they should publish the compute-centric takeoff post. Does the cynic really think that, in just two months, a big broken org would be able to:
Observe that people no longer make faces at shorter timelines
Have the high-level strategy-setting execs realize that they should change their strategy to publish more shorter timelines stuff
Communicate this to the broader org
Have a lower-level person realize that the internal compute-centric takeoff report can now be published when previously it was squashed
Update the report to give it the level of polish that it observably has
Get it through the comms bureaucracy that are probably still operating on the past heuristics and haven’t figured out what to do in this new world
That’s just so incredibly fast for big broken orgs to move.
Informative comment for me. Thank you.
So your question is whether (with added newline and capitalization for clarity):
Re the first part:
Open Phil decisions were strongly affected by whether they were good according to worldviews where “utter AI ruin” is >10% or timelines are <30 years. Many staff believed at the time that worlds with shorter timelines and higher misalignment risk were more tractable to intervene on, and so put additional focus on interventions targeting those worlds; many also believed that risk was >10% and that median timeline was <30 years. I’m not really sure how to operationalize this, but my sense is that the majority of their funding related to AI safety was targeted at scenarios with higher misalignment risk and shorter timelines than 10%/30 years.
As an example, see Some Background on Our Views Regarding Advanced Artificial Intelligence (2016), where Holden says that his belief that P(AGI before 2036) is above 10% “is important to my stance on the importance of potential risks from advanced artificial intelligence. If I did not hold it, this cause would probably still be a focus area of the Open Philanthropy Project, but holding this view is important to prioritize the cause as highly as we’re planning to.” So he’s clearly saying that the grantmaking strategy is strongly affected by wanting to target the sub-20-year timelines.
I’m not sure how to translate this into the language you use. Among other issues, it’s a little weird to talk about the relative influence of different credences over hypotheses, rather than the relative influence of different hypotheses. The “AI risk is >10% and <30 years” hypotheses had a lot of influence, but that could be true even if all the relevant staff had believed that AI risk is <10% and >30 years (if they’d also believed that those worlds were particularly leveraged to intervene on, as they do).
Lots of decisions were made that would not have been made given the decision procedure of “do whatever’s best assuming AI is in >30 years and risk is <10%”—I think that that decision procedure would have massively changed the AI safety stuff Open Phil did.
I think that this suffices to contradict your description of the situation—they explicitly made many of their decisions based on the possibility of shorter timelines than you described. I haven’t presented evidence here that something similar is true for their assessment of misalignment risk, but I also believe that to be the case.
If I persuaded you of the claims I wrote here (only some of which I backed up with evidence), would that be relevant to your overall stance?
All of this is made more complicated by the fact that Open Phil obviously is and was a large organization with many staff and other stakeholders, who believed different things and had different approaches to translating beliefs into decisions, and who have changed over time. So we can’t really talk about what “Open Phil believed” coherently.
Re the second part: I think the weight of reputation and publicity was put behind encouraging people to plan for the possibility of AI sooner than 30 years, as I noted above; this doesn’t contradict the statement you’ve made but IMO it is relevant to your broader point.
Section 2.2 in “Some Background...” looks IMO pretty prescient
(As a random reference, I thought Joe’s paper about low AI takeover risk was silly at the time, and I think that most people working on grants motivated by AI risk at OP at the time had higher estimates of AI takeover risk. I also thought a lot of takes from the Oxford EAs were pretty silly and I found them frustrating at the time and think they look worse with hindsight. Obviously, many of my beliefs at many of these time periods also look silly in hindsight.)
If you imagine the very serious person wearing the expensive suit saying, “But of course we must prepare for cases where the ship sinks sooner and there is a possibility of some passengers drowning”, whether or not this is Very Exculpatory depends on the counterfactual for what happens if the guy is not there. I think OpenPhil imagines that if they are not there, even fewer people take MIRI seriously. To me this is not clear and it looks like the only thing that broke the logjam was ChatGPT, after which the weight and momentum of OpenPhil views was strongly net negative.
One issue among others is that the kind of work you end up funding when the funding bureaucrats go to the funding-seekers and say, “Well, we mostly think this is many years out and won’t kill everyone, but, you know, just in case, we thought we’d fund you to write papers about it” tends to be papers that make net negative contributions.
Okay, so it sounds like you’re saying that the claims I asserted aren’t cruxy for your claim you wanted contradicted?
I definitely don’t think that Open Phil thought of “have more people take MIRI seriously” as a core objective, and I imagine that opinions on whether “people take MIRI more seriously” is good would depend a lot on how you operationalize it.
I think that Open Phil proactively tried to take a bunch of actions based on the hypothesis that powerful AI would be developed within 20 years. I think the situation with the sinking ship is pretty disanalogous—I think you’d need to say that your guy in the expensive suit was also one of the main people who was proactively taking actions based on the hypothesis that the ship would sink faster.
FWIW I heard rumor they thought of the roughly opposite, “Have people think OpenPhil doesn’t take MIRI seriously”, as an objective. I heard a story that when OpenPhil staff went to academia to interview lots of academics about doing grantmaking in the field of AI, all the academics strongly dismissed MIRI as cranks and bad to associate with, and OpenPhil felt their credibility would be harmed by associating with MIRI.
This is consistent with (and somewhat supported by) the OpenPhil grant report to MIRI saying that they could’ve picked anywhere between $1.5M and $0.5M, and they picked the latter for signaling reasons.
That’s not literally the opposite, that’s a different thing, obviously.
I’m not sure I follow[1]. It’s not a perfect match for the opposite (“Have fewer people take MIRI seriously”) but it’s roughly/functionally in the opposite direction in terms of their funding choices and influence on the discourse.
You may be responding to an earlier of edit of mine, I somewhat substantially edited within ~5 mins of commenting, and then found you’d already replied.
I think this is a pretty poor model of the attitudes of the relevant staff at the time. I also think your disparaging language here leads to your comments being worse descriptions of what was going on.
Well, there sure is a simple story for how it looked from outside. What’s the complicated real truth that you only get to know about from the inside, where everything is, like, not ignorantly handwaved off as incredibly standard bureaucratic organizational dynamics of grantees telling the grantmaker what it wants to hear?
Why does the attitude of the funding bureaucrats make the output of the (presumably earnestly motivated) researchers net-negative?
Is this mostly a selection effect where the people who end up getting funding are not earnest? Is the impact of the funding-signal stronger than the impact of the papers themselves? Is it that even though the researchers are earnest, there’s selection on which things they’re socially allowed to say and this distortion is bad enough that they would have been better off saying nothing?
I expect it’s a combination of selection effects and researchers knowing implicitly where their bread is buttered; I have no particular estimate of the relative share of these effects, except that they are jointly sufficient that, eg, a granter can hire what advertises itself as a group of superforecasters, and get back 1% probability on AI IMO gold by 2025.
It’s not obvious to me that Ajeya’s timelines aged worse than Eliezer’s. In 2020, Ajeya’s median estimate for transformative AI was 2050. My guess is that if based on this her estimate for “an AI that can, if it wants, kill all humans and run the economy on its own without major disruptions” would have been like 2056? I might be wrong, people who knew her views better at the time can correct me.
As far as I know, Eliezer never made official timeline predictions, but in 2017 he made an even-odds bet with Bryan Caplan that AI would kill everyone by January 1, 2030. And in December 2022, just after ChatGPT, he tweeted:
I think child conceived in December 2022 would go to kindergarten in September 2028 (though I’m not very familiar with the US kindergarten system). Generously interpreting “may have a fair chance” as a median, this is a late 2028 median for AI killing everyone.
Unfortunately, both these Eliezer predictions are kind of made as part of jokes (he said at the time that the bet wasn’t very serious). But I think we shouldn’t reward people for only making joking predictions instead of 100-page reports, so I think we should probably accept 2028-2030 as Eliezer’s median at the time.
I think if “an AI that can, if it wants, kill all humans and run the economy on its own without major disruptions” comes before 2037, Eliezer’s prediction will fare better, if it comes after that, then Ajeya’s prediction will fare better. I’m currently about 55% that we will get such AI by 2037, so from my current standpoint I consider Eliezer to be mildly ahead, but only very mildly.
You can do better by saying “I don’t know” than by saying a bunch of wrong stuff. My long reply to Cotra was, “You don’t know, I don’t know, your premises are clearly false, and if you insist on my being Bayesian and providing a direction of predictable error when I claim predictable error then fine your timelines are too long.”
I think an important point is that people can be wrong about timelines in both directions. Anthropic’s official public prediction is that they expect “country of geniuses in a data center” by early 2027. I heard that previously Dario predicted AGI to come even earlier, by 2024 (though I can’t find any source for this now and would be grateful if someone found a source or corrected me that I’m misremembering). Situational Awareness predicts AGI by 2027. The AI safety community’s most successful public output is called AI 2027. These are not fringe figures but some of the most prominent voices in the broader AI safety community. If their timelines turn out to be much too short (as I currently expect), then I think Ajeya’s predictions deserve credit for pushing against these voices, and not only blame for stating a too long timeline.
And I feel it’s not really true that you were just saying “I don’t know” and not implying some predictions yourself. You had the 20230 bet with Bryan. You had the tweet about children not living to see kindergarten. You strongly pushed back against the 2050 timelines, but as far as I know the only time you pushed back agains the very aggressive timelines was your kindergarten tweet, which still implies 2028 timelines. You are now repeatedly calling people who believed the 2050 timelines total fools, which would be an imo very unfair thing to do if AGI arrived after 2037, so I think this implies high confidence on your part that it will come before 2037.
To be clear, I think it’s fine, and often inevitable, to imply things about your timelines beliefs by e.g. what you do and don’t push back against. But I think it’s not fair to claim that you only said “I don’t know”, I think your writing was (perhaps unintentionally?) implying an implicit belief that an AI capable of destroying humanity will come with a median of 2028-2030. I think this would have been a fine prediction to make, but if AI capable of destroying humanity comes after 2037 (which I think is close to 50-50), then I think your implicit predictions will fare worse than Ajeya’s explicit predictions.
That doesn’t sound like the correct response though. You should just say “I predict this isn’t the reason AGI will come late, if AGI comes late”. It’s much less legible / operationalized, but if that’s what you think you know in the context, why add on extra stuff?
When somebody at least pretending to humility says, “Well, I think this here estimator is the best thing we have for anchoring a median estimate”, and I stroll over and proclaim, “Well I think that’s invalid”, I do think there is a certain justice in them demanding of me, “Well, would you at least like to say then in what direction my expectation seems to you to be predictably mistaken?”
Cotra’s model contained estimates which are as obvious BS as anchoring the size of a TRANSFORMATIVE neural net to the GENOME or the training compute to the entire evolution of life on Earth. I don’t think that I understand how Cotra even came up with these two ideas. What I do understand is how Cotra came up with the estimates like 1e31 FLOP or the lifetime anchor of 1e24 FLOP which are likely the only plausible ones in the report. As far as I understand, THESE assumptions would imply that creating the TAI is easy.
In late 2022, Karnofsky wrote:
I think this is later than what you’re asking about; I also would guess that this was Karnofsky’s private belief for a while before publishing, but I’m not sure at what time.
I again don’t consider this a helpful thing to say on a sinking ship when somebody is trying to organize passengers getting to the lifeboats.
Especially if your definition of “AI takeover” is such as to include lots of good possibilities as well as bad ones; maybe the iceberg rockets your ship to the destination sooner and provides all the passengers with free iced drinks, who can say?
Hmm, my sense is Holden was meaning “AI Takeover” roughly in the “AI Ruin” sense, and as such the lower bound here did seem like a helpful thing to say (while I found this upper bound a weirdly unhelpful thing to say).
Like, I think if asked to operationalize Holden at the time would have clarified that by “AI Takeover” he means something that is really bad and catastrophic by his lights.
Yeah, this is after ChatGPT, which I do think changed the epistemic landscape a lot.
Is it? The date on the linked post is 29th Aug 2022, but ChatGPT was November or December.
Huh, it says Dec 15th for me? Like, right at the top?
Yeah but in the quote it links to his original statement on August 29th on LessWrong.
(I quoted the slightly later version on Cold Takes because it expands it from the original context to be a general statement.)
Ah, in that case, yes, seems like decent-ish data!
Ah, confusing. Looks like Ben’s comment quotes a post by Holden from 15th Dec which in turn quotes a different post by Holden from 22nd Aug, which is where the original statement about takeover odds appears.
Both posts are hyperlinked in the comment and I had clicked the latter link without noticing, but in any case yes, seems like the original statement is from pre-ChatGPT.
Edit: just realised Ben had already commented to the same effect below.
I think mainly you’re asking about OP in particular, but a side question:
Who is ‘Oxford EA’? I definitely interacted with many Oxford-based EA(-adjacent) people, though only since 2022 (pre chatGPT), and the range of views and agendas was broad, and included ‘AI soon, very deadly’. I’d guess you mean some more specific (perhaps a funding- or otherwise-rich) smaller group, and I can believe that earlier views were differently distributed.
Will MacAskill could serve as exemplar. More broadly I’m thinking of people who might have called themselves ‘longtermists’ or who hybridized Bostrom with Peter Singer.
Makes sense. Toby Ord? Does Anders count? Or the actual Bostrom? I think that crowd did better than OP by quite a bit, and the wider Oxford AI safety community was quite good. I only met them months before ChatGPT. Will seems still surprisingly (from my view over-) optimistic, but is doing some pretty relevant and good work right now, and is usually careful to caveat where his optimistic assumptions are loadbearing.
Suppose you are correct and that OpenPhil did indeed believe in long timelines pre-ChatGPT. Does this reflect badly on them? It seems like a reasonable prior to me, and many senior researchers even within OA were uncertain that the their methods would scale to more powerful systems.
I think if they sponsored Cotra’s work and cited it, this reflects badly on them. More on them than on Cotra, really; I am not a fan of the theory that you blame the people who were selected to have an opinion or incentivised to have an opinion, so much as the people who did the selection and incentivization. See https://www.lesswrong.com/posts/ax695frGJEzGxFBK4/biology-inspired-agi-timelines-the-trick-that-never-works, which I think stands out as clearly correct in retrospect, for why their analysis was obviously wrong at the time. And I did in that case take the trouble to explain why their whole complicated analysis was bogus, and my model is that this clearly-correct-in-retrospect critique had roughly zero impact or effect on OpenPhil; and that is what I expected and predicted in advance, which is why I did not spend more effort trying to redeem an organization I modeled as irredeemably broken.
Do you find Daniel Kokotajlo’s subsequent work advocating for short timelines valuable? I ask because I believe that he sees/saw his work as directly building on Cotra’s[1].
I think the bar for work being a productive step in the conversation is lower than the bar for it turning out to be correct in hindsight or even its methodology being highly defensible at the time.
Is your position more, “Producing such a model was a fine and good step in the conversation, but OP mistakenly adopted it to guide their actions,” or “Producing such a model was always going to have been a poor move”?
I remember a talk in 2022 where he presented an argument for 10 year timelines, saying, “I stand on the shoulders of Ajeya Cotra”, but I’m on mobile and can’t hunt down a source. Maybe @Daniel Kokotajlo can confirm or disconfirm.
Indeed I do think of it that way.
Is your take “Use these different parameters and you get AGI in 2028 with the current methods”?
As far as Kokotajlo’s memory can be trusted after the ChatGPT moment, he thought that there would be a 50% chance to reach AGI in 2030.
If you can get that or 2050 equally well off yelling “Biological Anchoring”, why not admit that the intuition comes first and then you hunt around for parameters you like? This doesn’t sound like good methodology to me.
One can apply similar methodological arguments to a different problem and test whether they persist. The amount of civilisations not in the Solar Systems is thought to be estimatable via the Drake equation. Drake’s original estimates implied that the Milky Way contains between 1K and 100M civilisations. The only ground truth that we know is the fact that we have yet to find any reliable evidence of such civilisations. But I don’t understand where the equation itself is erroneous.
Returning to the AI timeline crux, Cotra’s idea was the following. TAI is created once someone spends enough compute. Cotra’s main idea was that compute_required = (compute_under_2020_knowledge)/(knowledge_factor(t)), compute_affordable increases exponentially before the bottlenecks related to the world’s economy. The estimates on compute_affordable required mankind only to keep track of who produces it, how it is done and who is willing to pay. A similar procedure was done in the AI-2027 compute forecast.
Then Cotra proceeded to wildly misestimate. Her idea of the knowledge factor was that it makes creation of the TAI twice easier every 2-3 years, which I doubt that I would understand for reasons described in the collapsed sections. Cotra’s ideas on compute_under_2020_knowledge are total BS for reasons I detailed in another comment. Therefore, I fail to understand where Cotra was mistaken aside from using parameters that are total BS. Nor do I think that if Cotra’s model was correct aside from BSed parameters, it wouldn’t be a natural move to correct the parameters.
Cotra’s rationalisation for the TAI to become twice as easy to create every few years
I consider two types of algorithmic progress: relatively incremental and steady progress from iteratively improving architectures and learning algorithms, and the chance of “breakthrough” progress which brings the technical difficulty of training a transformative model down from “astronomically large” / “impossible” to “broadly feasible.”
For incremental progress, the main source I used was Hernandez and Brown 2020, “Measuring the Algorithmic Efficiency of Neural Networks.” The authors reimplemented open source state-of-the-art (SOTA) ImageNet models between 2012 and 2019 (six models in total). They trained each model up to the point that it achieved the same performance as AlexNet achieved in 2012, and recorded the total FLOP that required. They found that the SOTA model in 2019, EfficientNet B0, required ~44 times fewer training FLOP to achieve AlexNet performance than AlexNet did; the six data points fit a power law curve with the amount of computation required to match AlexNet halving every ~16 months over the seven years in the dataset. They also show that linear programming displayed a similar trend over a longer period of time: when hardware is held fixed, the time in seconds taken to solve a standard basket of mixed integer programs by SOTA commercial software packages halved every ~13 months over the 21 years from 1996 to 2017.
Grace 2013 (“Algorithmic Progress in Six Domains”) is the only other paper attempting to systematically quantify algorithmic progress that I am currently aware of, although I have not done a systematic literature review and may be missing others. I have chosen not to examine it in detail because a) it was written largely before the deep learning boom and mostly does not focus on ML tasks, and b) it is less straightforward to translate Grace’s results into the format that I am most interested in (“How has the amount of computation required to solve a fixed task decreased over time?”). Paul is familiar with the results, and he believes that algorithmic progress across the six domains studied in Grace 2013 is consistent with a similar but slightly slower rate of progress, ranging from 13 to 36 months to halve the computation required to reach a fixed level of performance.
This means that the compute required is halving every 16 months for AlexNet, every 13 months for linear programming. While Claude Opus 4.5 does seem to think that Paul’s belief is close to what Grace’s paper implies, the paper’s relevance is likely undermined by Cotra’s own criticism. Next Cotra listed actual assumptions for each of the models, including the two clearly BSed ones. I marked her ideas on when the required compute is halved in bold:
Cotra’s actual assumptions
I assumed that:
Training FLOP requirements for the Lifetime Anchor hypothesis (red) are halving once every 3.5 years and there is only room to improve by ~2 OOM from the 2020 level—moving from a median of ~1e28 in 2020 to ~1e26 by 2100.
Training FLOP requirements for the Short horizon neural network hypothesis (orange) are halving once every 3 years and there is room to improve by ~2 OOM from the 2020 level—moving from a median of ~1e31 in 2020 to ~3e29 by 2100.
Training FLOP requirements for the Genome Anchor hypothesis (yellow) are halving once every 3 years and there is room to improve by ~3 OOM from the 2020 level—moving from a median of ~3e33 in 2020 to ~3e30 by 2100.
Training FLOP requirements for the Medium-horizon neural network hypothesis (green) are halving once every 2 years and there is room to improve by ~3 OOM from the 2020 level—moving from a median of ~3e34 in 2020 to ~3e31 by 2100.
Training FLOP requirements for the Long-horizon neural network hypothesis (blue) are halving once every 2 years and there is room to improve by ~4 OOM from the 2020 level—moving from a median of ~1e38 in 2020 to ~1e34 by 2100.
Training FLOP requirements for the Evolution Anchor hypothesis (purple) are halving once every 2 years and there is room to improve by ~5 OOM from the 2020 level—moving from a median of ~1e41 in 2020 to ~1e36 by 2100.
FWIW, I continue to think your models here are obviously not building on Cotra’s thing, and think something pretty weird is going on when you say they do. Which is not like catastrophic, but I think the credit allocation here feels quite weird.
I think OpenPhil was guided by Cotra’s estimate and promoted that estimate. If they’d labeled it: “Epistemic status: Obviously wrong but maybe somebody builds on it someday” then it would have had a different impact and probably not one I found objectionable.
Separately, I can’t imagine how you could build something not-BS on that foundation and if people are using it to advocate for short timelines then I probably regard that argument as BS and invalid as well.
Except that @Daniel Kokotajlo wrote an entire sequence where the only post published after the ChatGPT moment is this one. Kokotajlo’s sequence was supposed to explain that Cotra’s distribution of training compute for a TAI created by 2020′s ideas is biased towards requiring far more compute than is actually needed.
Kokotajlo’s quote related to Cotra’s errors
Ajeya’s timelines report is the best thing that’s ever been written about AI timelines imo. Whenever people ask me for my views on timelines, I go through the following mini-flowchart:
1. Have you read Ajeya’s report?
--If yes, launch into a conversation about the distribution over 2020′s training compute and explain why I think the distribution should be substantially to the left, why I worry it might shift leftward faster than she projects, and why I think we should use it to forecast AI-PONR instead of TAI.
--If no, launch into a conversation about Ajeya’s framework and why it’s the best and why all discussion of AI timelines should begin there.
However, Kokotajlo’s comment outright claims that everyone’s timelines should be a variation of Cotra’s model.
Kokotajlo praising Cotra’s model
So, why do I think it’s the best? Well, there’s a lot to say on the subject, but, in a nutshell: Ajeya’s framework is to AI forecasting what actual climate models are to climate change forecasting (by contrast with lower-tier methods such as “Just look at the time series of temperature over time / AI performance over time and extrapolate” and “Make a list of factors that might push the temperature up or down in the future / make AI progress harder or easier,” and of course the classic “poll a bunch of people with vaguely related credentials.”
There’s something else which is harder to convey… I want to say Ajeya’s model doesn’t actually assume anything, or maybe it makes only a few very plausible assumptions. This is underappreciated, I think. People will say e.g. “I think data is the bottleneck, not compute.” But Ajeya’s model doesn’t assume otherwise! If you think data is the bottleneck, then the model is more difficult for you to use and will give more boring outputs, but you can still use it. (Concretely, you’d have 2020′s training compute requirements distribution with lots of probability mass way to the right, and then rather than say the distribution shifts to the left at a rate of about one OOM a decade, you’d input whatever trend you think characterizes the likely improvements in data gathering.)
The upshot of this is that I think a lot of people are making a mistake when they treat Ajeya’s framework as just another model to foxily aggregate over. “When I think through Ajeya’s model, I get X timelines, but then when I extrapolate out GWP trends I get Y timelines, so I’m going to go with (X+Y)/2.” I think instead everyone’s timelines should be derived from variations on Ajeya’s model, with extensions to account for things deemed important (like data collection progress) and tweaks upwards or downwards to account for the rest of the stuff not modelled.
What Kokotajlo likely means by claiming that “Ajeya’s timelines report is the best thing that’s ever been written about AI timelines imo” is that Cotra’s model was a good model where the values of parameters are total BS. Her report consists of four parts. In my opinion, two clearest example of BS are the genome anchor which had the transformative model “have about as many parameters as there are bytes in the human genome (~7.5e8 bytes)” and the evolution anchor which claimed “that training computation requirements will resemble the amount of computation performed in all animal brains over the course of evolution from the earliest animals with neurons to modern humans”. This is outright absurd since each animal has its own brain which is trained independently of others, and yet the human brain’s architecture and training are outright enough to have a chance to create the AI who does what a human genius can do.
I find that position weirdly harsh. Sure, if you’re just answering anaguma’s question as a binary (“does it reflect well or poorly, regardless of magnitude?”), that could make sense. (Note to readers: This would mean that the quote I started this comment with should be regarded as taken out of context!) But seeing it as reflecting badly at a high magnitude is the judgment I’d consider weirdly harsh.
I’m saying that as someone who has very little epistemic respect for people who think AI ruin is only about 10% likely—I consider people who think that biased beyond hope.
But back to the timelines point:
It’s not like Bioanchors was claiming high confidence in its modelling assumptions or resultant timelines. At the time, a bunch of people in the broader EA ecosystem had even longer timelines, and Bioanchors IIRC took a somewhat strong stance against assigning significant probability mass for >2100, which some EAs at least considered non-obvious. Seen in that context, it contributed to people updating in the right direction. The report also contained footnotes pointing out that advisors held in high regard by Ajeya had shorter timelines based on specific thoughts on horizon lengths or whatever, so the report was hedging towards shorter timelines. Factoring that in, it aged less poorly than it would have if we weren’t counting those footnotes. Ajeya also posted an update 2 years later where she shortened her timelines a bunch. If it takes orgs only 2 years to update significantly in the right direction, are they really hopelessly broken?
FWIW, I’m leaning towards you having been right about the critique (credit for sticking your neck out). But why is sponsoring or citing work like that such a bad sign? Sure, if they cited it as particularly authoritative, that would be different. But I don’t feel like Open Phil did that. (This seems like a crux based on your questions in the OP and your comments here; my sense from reading other people’s replies, and also my less informed impressions I got from interacting with some Open Phil staff at very few short occasions, is it that you were overestimating the degree to which Open Phil was attached to specific views.)
For comparison, I think Carlsmith’s report on power-seeking was a lot worse in terms of where its predictions landed, so I’d have more sympathy if you pointed to it as an example of what reflects poorly on Open Phil (just want to flag that Carlsmith is my favorite philosophy writer in all of EA). However, also there, I doubt the report was particularly influential within Open Phil, and I don’t remember it being promoted as such. Also, I would guess that the pushback it received from many sides would have changed their evaluation of the report after it was written, if they had initially been more inclined to update on it. I mean, that’s part of the point of writing/publishing reports like that.
Sure, maybe Open Phil was doing a bunch of work directed more towards convincing outside skeptics that what they’re doing is legitimate/okay rather than doing the work “for themselves”? If so, that’s a strategic choice… I can see it leading to biased epistemics, but in a world where things had gone better, maybe it would have gotten further billionaires on board with their mission of giving? And it’s not like doing the insular MIRI thing that you all had been doing before the recent change to get into public comms was risk-free for internal epistemics either. There are risks on both ends of the spectrum, outward-looking and deferring to many experts or at least “caring whether you can convince them”, and inward looking with a small/shrinking circle of people whose research opinions you respect.
On whether some orgs are/were hopelessly broken: it’s possible. I feel sad about many things having aged poorly and I feel like the EA movement has done disappointingly poorly. I also feel like I’ve heard once or twice Open Phil staff saying disappointingly dismissive things about MIRI (even though many of the research directions there didn’t age well either).
I don’t have a strong view on Open Phil anymore—it used to be that I had one (and it was positive), so I have became more skeptical. Maybe you’re picking up a real thing about Open Phil’s x-risk-focused teams having been irredeemably biased or clouded in their approaches. But insofar as you are, I feel like you’ve started with unfortunate examples that, at least to me, don’t ring super true. (I felt prompted to comment because I feel like I should be well-dispositioned to sympathize with your takes given how disappointed I am at the people who still think there’s only a 10% AI ruin chance.)
At a meta level, “publishing, in 2025, a public complaint about OpenPhil’s publicly promoted timelines and how those may have influenced their funding choices” does not seem like it serves any defensible goal.
Let’s suppose the underlying question is “why did OpenPhil give money to OpenAI in 2017”. (Or, conversely, not give money to some other venture in a similar timeframe). Why is this, currently, significantly important? What plausible goal is served by trying to answer this question more precisely?
If it’s because they had long timelines, it tells you that short timeline arguments were not effective, which hopefully everyone already knows. This has been robustly demonstrated across most meaningful groups of people controlling either significant money or government clout. It is not information. I would not update on this.
If they did this because they had short timelines, they believed in whatever Sam was selling for that. I would not update on this either. It is hopefully well understood, by now, that Sam is good at selling things. “You could parachute him into an island full of cannibals and come back in 5 years and he’d be the king.”
If they did this for non-timeline reasons, I might update on, idk, some nebulous impression about how OpenPhil’s bureaucracy worked before the year 2020 or so. How good Sam (or another principal) was at convincing people to give them money. I don’t see how this is an important fact about the world.
Generally my model is that when people do not seem to be behaving optimally, they are behaving close to optimal for something, but that something is not the goal I imagine they are pursuing. I am imagining a goal like “being able to influence future events more effectively”, but I can’t see how that’s served here, so I imagine we’re optimizing for something else.
People ask me questions. I answer them honestly, not least because I don’t have the skill to say “I’m not answering that” without it sending some completely different set of messages. Saying a bunch of stuff in private without giving anyone a chance to respond to what I’m guessing about them is deontologically weighed-against by my rules, though not forbidden depending on circumstances. I do not do this in hopes any good thing results, but then acts with good consequences are few and far between in any case, these days.
Clearly there is some value to thinking about past mistakes and getting an accurate retelling of history. In the best case you can identify the mistakes that generated those bad ideas in the past and fix them.
That shouldn’t be a reason not to do a thing (by itself, all else equal).