Relitigating the Race to Build Friendly AI
Recently I’ve been relitigating some of my old debates with Eliezer, to right the historical wrongs. Err, I mean to improve the AI x-risk community’s strategic stance. (Relevant to my recent theme of humans being bad at strategy—why didn’t I do this sooner?)
Of course the most central old debate was over whether MIRI’s circa 2013 plan, to build a world-altering Friendly AI[1], was a good one. If someone were to defend it today, I imagine their main argument would be that back then, there was no way to know how hard solving Friendliness/alignment would be, so it was worth a try in case it turned out to be easy. This may seem plausible because new evidence about the technical difficulty of alignment was the main reason MIRI pivoted away from their plan, but I want to argue that actually even without this information, there were good enough arguments back then to conclude that the plan was bad:
MIRI was rolling their own metaethics (deploying novel or controversial philosophy) which is not a good idea even if alignment turned out to be not that hard in a narrow technical sense.
The plan was very risky given the possibility of illegible safety problems. What were the chances that a small team would be able to find and make legible all of the relevant problems in time? Even if alignment was actually easy and had no hidden traps, there was no way that a small team could reach high enough justified confidence in this to justify pushing the “launch” button, making the plan either pointless (if the team was rational/cautious enough to ultimately not push the button), or reckless (if the team would have pushed the button anyway).
If otherwise successful, the plan would have caused a small group to have world-altering (or world-destroying) power, somewhere along the way.
Most of the world would not have trusted MIRI (or any similar group) to do this, if they were informed, so MIRI would have had to break some widely held ethical constraints. (This is the same argument behind the current Statement on Superintelligence, that nobody should be building SI without “1. broad scientific consensus that it will be done safely and controllably, and 2. strong public buy-in.”)
It predictably inspired others (such as DeepMind and OpenAI) to join the race[2], and made it very difficult for voices calling for AI pause/stop to attract attention and resources.
(The main rhetorical innovation in my current arguments that wasn’t available back then is the concept of “illegible safety problems”, but the general idea that there could be hidden traps that a small team could easily miss had been brought up, or should have been obvious to MIRI and the nearby community.)
Many of these arguments are still relevant today, considering the plans of the remaining and new race participants, but are not well known due to historical reasons (i.e., MIRI and its supporters argued against them to defend MIRI’s plan, so they were never established as part of the LW consensus or rhetorical toolkit). This post is in part an effort to correct this, and help shift the rhetorical strategy away from putting everything on technical alignment difficulty.
(This post was pulled back into draft, in order to find more supporting evidence for my claims, which also gave me a chance to articulate some further thoughts.)
My regular hobby horse in recent years has been how terrible humans are at making philosophical progress relative to our ability to innovate in technology, how terrible AIs may also be at this (or even worse, in the same relative sense), and how this greatly contributes to x- and s-risks. But I’ve recently come to realize (or pay more attention to) how terrible we also are at strategic thinking, and how terrible AIs may also be at this (in a similar relative sense), which may be an even greater contribution to x- and s-risks.[3]
(To spell this out more, if MIRI’s plan was in fact a bad one, even from our past perspective, why didn’t more people argue against it? Weren’t there anyone whose day jobs were to think strategically about how humanity should navigate complex and highly consequential future technologies/events like the AI transition, and if so why weren’t they trying to talk Eliezer/MIRI out of what they were planning? Either way, if you were observing this course of history in an alien species, how would you judge their strategic competence and chances of successfully navigating such events?)
A potential implication from all of this is that improving AI strategic competence (relative to their technological abilities) may be of paramount importance (so that they can help us with strategic thinking and/or avoid making disastrous mistakes of their own), but this is clearly even more of a double-edged blade than AI philosophical competence. Improving human strategic thinking is more robustly good, but suffers from the same lack of obvious tractability as improving human philosophical competence. Perhaps the conclusion remains the same as it was 12 years ago: we should be trying to pause or slow down the AI transition to buy time to figure all this out.
- ^
This was edited from “to build a Friendly AI to take over the world in service of reducing x-risks” after discussion with @habryka and @jessicata. Jessica also found this passage to support this claim: “MIRI co-founder Eliezer Yudkowsky usually talks about MIRI in particular — or at least, a functional equivalent — creating Friendly AI.” (Interestingly, what was common knowledge on LW just 12 years ago now requires hard-to-find evidence to establish.)
- ^
According to the linked article, Shane Legg was introduced to the idea of AGI through a 2000 talk by Eliezer, and then co-founded DM in 2010 (following an introduction by Eliezer to investor Peter Thiel, which is historically interesting, especially as to Eliezer’s motivations for doing so, which I’ve been unable to find online). I started arguing against SIAI/MIRI’s plan to build FAI in 2004: “Perhaps it can do more good by putting more resources into highlighting
the dangers of unsafe AI, and to explore other approaches to the
Singularity, for example studying human cognition and planning how to do
IA (intelligence amplification) once the requisite technologies become
available.” - ^
If we’re bad at philosophy but good at strategy, we can do things like realize the possibility of illegible x-risks (including ones caused by philosophical errors), and decide to stop or slow down the development of risky technologies on this basis. If we’re good at philosophy but bad at strategy, we might avoid making catastrophic philosophical errors but still commit all kinds of strategic errors in the course of making highly consequential decisions.
I feel like someone should be arguing the other side, and no one else has stepped up, so I guess I’ll have a go. :-P This comment will be like 75% my honest opinions and 25% devil’s advocate. Note that I wasn’t around at the time, sorry for any misunderstandings.
I think your OP does some conflation of (1) “Eliezer was trying to build FAI” with (2) “Eliezer was loudly raising the salience of ASI risk (and thus incidentally the salience ASI in general and how big a deal ASI is), along with related community-building etc.”. But these are two somewhat separate decisions that Eliezer made.
For example, you summarize an article as claiming “Shane Legg was introduced to the idea of AGI through a 2000 talk by Eliezer, and then co-founded DM in 2010 (following an introduction by Eliezer to investor Peter Thiel…)” Those seem to be (2) not (1), right? Well, I guess the 2000 talk is neither (1) nor (2) (Eliezer didn’t yet buy AI risk in 2000), but more generally, MIRI could have directly tried to build FAI without Eliezer giving talks and introducing people, and conversely Eliezer could have given talks and introduced people without MIRI directly trying to build FAI.
So I’m skeptical that (1) (per se) contributed nontrivially to accelerating the race to ASI. For example, I’d be surprised if Demis founded DeepMind partly because he expected MIRI to successfully build ASI, and wanted to beat them to it. My guess is the opposite: Demis expected MIRI to fail to build powerful AI at all, and saw it as a safety outfit not doing anything relevant from a capabilities perspective. After all, DeepMind pursued a very different technical research direction.
On the other hand, I think there’s at least a strong prima facie case that (2) shortened timelines, which is bad. On the other hand, (2) helped build the field of alignment, which is good. So overall, how do we feel about (2)? I dunno. You yourself seemed to be endorsing (2) in 2004 (“…putting more resources into highlighting the dangers of unsafe AI…”). For my part, I have mixed feelings, but by default I tend to be in favor of (2) for kinda deontological reasons (if people’s lives are at risk, it’s by default good to tell them). But (2) is off-topic anyway; the thing you’re re-litigating is (1), right?
OK next, let’s talk about intelligence augmentation (IA), per your other comment proposal: “Given that there are known ways to significantly increase the number of geniuses (i.e., von Neumann level, or IQ 180 and greater), by cloning or embryo selection, an obvious alternative Singularity strategy is to invest directly or indirectly in these technologies, and to try to mitigate existential risks (for example by attempting to delay all significant AI efforts) until they mature and bear fruit (in the form of adult genius-level FAI researchers).”
There are geniuses today, and they mostly don’t work on FAI. Indeed, I think existing geniuses have done more to advance UFAI than FAI. I think the obvious zeroth-order model is that a world with more geniuses would just have all aspects of intellectual progress advance more rapidly, including both capabilities and alignment. So we’d wind up in the same place (i.e. probably doom), just sooner.
What would be some refinements on that zeroth-order model that make IA seem good?
One possible argument: “Maybe there’s a kind of ‘uncanny valley’ of ‘smart enough to advance UFAI but not smart enough to realize that it’s a bad idea’. And IA gets us a bunch of people who are all the way across the valley”. But uncanny-valley-theory doesn’t seem to fit the empirical data, from my perspective. When I look around, “raw intelligence” vs “awareness of AI risk and tendency to leverage that understanding into good decisions” seem somewhat orthogonal to me, as much as I want to flatter myself by thinking otherwise.
Another possible argument: “Maybe it’s not about the tippy-top of the intelligence distribution doing research, but rather the middle of the distribution, e.g. executives and other decisionmakers making terrible decisions”. But realistically we’re not going to be creating tens of millions of geniuses before ASI, enough to really shift the overall population distribution. Note that there are already millions of people smarter than, say, Donald Trump, but they’re not in charge of the USA, and he is. Ditto Sam Altman, etc. There are structural reasons for that, and those reasons won’t go away when thousands of super-geniuses appear on the scene.
Another possible argument: “If awareness of x-risk, good decision-making, etc., relies partly on something besides pure intelligence, e.g. personality … well OK fine, we can do embryo-selection etc. on both intelligence and (that aspect of) personality.” I’m a bit more sympathetic to this, but the science to do that doesn’t exist yet (details). (I might work on it at some point.)
[[2025-12-13 UPDATE: …And even if this problem is solved, i.e. it becomes technically possible to engineer a child’s personality, there is no reason to expect people to actually make more babies with truth-seeking personalities who would help with x-risk, out of proportion to the number of extra baby Sam Altmans etc. who would make the situation worse. Again, the zeroth-order model above seems a better bet.]]
So that’s the IA possibility, which I don’t think changes the overall picture much. And now I’ll circle back to your five-point list. I already addressed the fifth. I claim that the other four are really bad things about our situation that we have basically no hope of avoiding. On my models, ASI doesn’t require much compute, just ideas, and people are already making progress developing those ideas. On the margin we can and should try to delay the inevitable, but ultimately someone is going to build it (and then probably everyone dies). If it gets built in a more democratic and bureaucratic way, like by some kind of CERN for AI, then there are some nice things to say about that from the perspective of ethical procedure, but I don’t expect a better actual outcome than MIRI-of-2010 building it. Probably much worse. The project will still be rolling its own metaethics (at best!), the project will still be ignoring illegible safety problems, the project will almost definitely still involve key personnel winding up in a position to grab world-altering power, and the project will probably still be subjecting the whole world to dire risk by doing something that most of the world doesn’t want them to do. (Or if they pause to wait for global consensus, then someone else will build it in the meantime.) We still have all those problems, because those problems are unavoidable, alas.
I kinda thought the sales pitch for “build a Friendly AI to take over the world in service of reducing x-risks”[1] was: This is a very bad plan, but every other possible plan is even worse.
For example, the “dath ilan” stuff is IIUC what Eliezer views as an actually good approach to ASI x-risk, and it sure doesn’t look anything like that.
Anyway, if that’s the pitch, you can’t argue against it by listing reasons why the plan is very bad, right? We already know the plan is very bad. Instead you need to compare two alternatives.
(For example, if the alternative is “nobody on Earth builds ASI”, then the crux is feasibility. IIUC Eliezer’s perspective (today) is: nope that’s not feasible, the best we can hope for is to buy some time before ASI, like maybe up to an extra decade or two, and then after that we’d still need a different plan.)
Anyway, what’s your preferred alternative? I’m sure you’ve written about it somewhere but I think it belongs in this post too, to allow a side-by-side comparison.
Did you mean to write “build a Task AI to perform a pivotal act in service of reducing x-risks”? Or did MIRI switch from one to the other at some point early on? I don’t know the history. …But it doesn’t matter, my comment applies to both.
Good point. It’s in my first link, but I should probably put it in the current post somewhere. Here it is in the mean time:
@Steven Byrnes In case I accidentally give the wrong impression that this was my original idea, Vernor Vinge had talked about it in his 1993 essay The Coming Technological Singularity: How to Survive in the Post-Human Era as one of several possible routes to the Singularity, and Carl Shulman talked about “iterated embryo selection” in 2009 which is where I got that term from. In retrospect I think reading Vinge in my youth (college years) gave me a way too-high expectation of humanity, e.g., I thought there would be entire academic disciplines devoted to studying questions like which route to the Singularity humanity should take.
Do you recall anything about why arguments for HIA didn’t pick up steam?
I just came across @lukeprog’s (Executive Director of MIRI at the time) Intelligence Amplification and Friendly AI (while looking for something else), which seems to be a reply to Justin Shovelain’s doc and my post.
Under Some Thoughts on Singularity Strategies (the first link in my OP), I commented:
I did not pursue the HIA first argument myself much after that, as it didn’t seem to be my comparative advantage at the time, and it seemed like @JustinShovelain’s efforts was picking up steam. I’m not sure what happened afterwards, but it would be rather surprising if it didn’t have something to do with Eliezer’s insistence and optimism on directly building FAI at the time (which is largely incompatible with “IA first”), but I don’t have any direct evidence of this. I wasn’t in any physical rationalists communities, and don’t recall any online discussions of Justin’s document after this.
ETA: The same comment quoted a passage from Eliezer saying that he considered and rejected “IA first” which probably also directly influenced many people who deferred to him on AI x-risk strategy.
To be fair, it was a fairly tepid rejection, mainly saying “FAI is good too”, but yeah I was surprised to see that.
(I think at the time (2011-2021), if asked, I would have said that IA is not my comparative advantage compared to FAI. This was actually mistaken, but only because almost no one was working seriously on IA. I would have read that Yudkowsky paper, but I definitely don’t recall that passage, and generally had the impression that Yudkowsky’s position was “HIA is good, I just happen to be working on FAI”.)
...Ok now that I think about it, I’m just now recalling several conversations in the past few years, where I’m like “we should have talent / funding for HIA” and the other person is like “well shouldn’t MIRI do that? aren’t they working on that?” and I’m like “what? no? why do you think that?”—which suggests an alternative cause for people not working on HIA (namely, that false impression).
Eugenics is not metrically fit; it makes your grandma gasp and frown; it does not quickly directly benefit the people who make it happen; it does not have the exciting element of a sudden moment.
That shouldn’t sufficiently explain why it doesn’t get steam within LW / rationalist / EA / X-derisking. I mean you’re right that it does explain that practically speaking, because people think it will be hard due to memetic unfitness. But I think they are visibly mistaken about that, and it’s actually fairly obvious, so it’s still a strategic blunder.
I believe that there was an intentional switch, around 2016 (though I’m not confident in the date), from aiming to design a Friendly CEV-optimizing sovereign AI, to aiming to design a corrigible minimal-Science-And-Engineering-AI to stabilize the world (after which a team of probably-uploads could solve the full version of Friendliness and kick off a foom.)
How much was this MIRI’s primary plan? Maybe it was 12 years ago before I interfaced with MIRI? But like, I have hung out with MIRI researchers for an average of multiple hours a week for something like a decade, and during that time period the plan seemed to basically always centrally be:
Try to make technical progress on solving the alignment problem
While trying to create a large public intellectual field that can contribute to solving that problem
While trying to improve the sanity of key decision-makers who will make a bunch of high-stakes decisions involved in AGI
This also seems to me like centrally the strategy I picked up from the sequences, so it must be pretty old.
There was a period of about 4-5 years where research at MIRI pivoted to a confidential-by-default model, and it’s plausible to me that during that period, which I understand much less well, much more of MIRI’s strategy was oriented around doing this.
That said, it seems like Carl Shulman’s prediction from 14 years was born out pretty well:
After MIRI did a bunch of confidential research, possibly in an attempt to maybe just build an aligned AI system, they realized this wasn’t going to work, then did a “halt, melt, and catch fire” move, and switched gears.
Rereading some of the old discussions in the posts you linked, I think I am more sold than I was previously that this was a real strategic debate at the time, and a bunch of people tried to argue in favor of just going and building it, and explicitly against pursuing strategies like human intelligence augmentation, which now look like much better bets to me.
To their credit, many of the people did work on both, and were pretty clear that they really weren’t sure whether the “solving the problem head on” part would work out, and that they thought it would be reasonable for people to pursue other strategies, and that they themselves would pivot if that became clear to them later on. Eliezer, in a section of a paper you quoted yourself 14 years ago he says:
Like, IDK, this really doesn’t seem like particularly high confidence, and while I agree with you that in-retrospect you deserve some Bayes-points for calling this at the time, I don’t think Eliezer loses that many, as it seems like all-throughout he proclaimed substantial probability on your perspective being more right here.
Reposting this comment of mine from a few years ago, which seems germane to this discussion, but certainly doesn’t contradict the claim that this hasn’t been their plan in the past 12 years.
Here is a video of of Eliezer, first hosted on vimeo in 2011. I don’t know when it was recorded.
[Anyone know if there’s a way to embed the video in the comment, so people don’t have to click out to watch it?]
He states explicitly:
And later in the video he says:
It was Yudkowsky’s plan before MIRI was MIRI
http://sl4.org/archive/0107/1820.html
“Creating Friendly AI”
https://intelligence.org/files/CFAI.pdf
Both from 2001.
What about the “Task AGI” and “pivotal act” stuff? That was at the very least, advising others to think seriously about using aligned AI to take over the world, on the basis that the world was otherwise doomed without a pivotal act. Then there was the matter of how much leverage MIRI thought they had as an organization, which is complicated by the confidentiality.
Plausible! Do you have a link handy? Seems better for the conversation to be grounded in an example, and I am not sure exactly which things you are referencing here.
On Arbital. Task directed AGI and Pivotal act.
Offline, at MIRI there were discussions of possible pivotal acts, such as melting all GPUs. I suggested “what about using AI to make billions of dollars” and the response was “no it has to be much bigger than that to fix the game board”. There was some gaming of e.g. AI for uploading or nanotech. (Again, unclear how much leverage MIRI thought they had as an organization)
Hmm, maybe I am misunderstanding this.
The “Task AGI” article is about an approach to build AGI that is safer than building a sovereign, published, on the open internet. I do not disagree that MIRI was working on trying to solve the alignment problem (as I say above, that is what two of the bullet points of my summary of their strategy are about), which this seems to be an attempt at making progress on. It doesn’t seem to me to be much evidence for “MIRI was planning to build FAI in their basement”. Yes, my understanding is that MIRI is expecting that at some point someone will build very powerful AI systems. It would be good for them to know how to do that in a way that has good consequences instead of bad. This article tries to help with that.
The “Pivotal Act” article seems similar? I mean, MIRI is still working on a pivotal act in the form of an international AI ban (subsequently followed maybe with an intelligence augmentation program). I am working on pivotal acts all day! It seems like a useful handle to have. I use it all the time. It does seem to frequently be misunderstood by people to mean “take over the world”, but like, there is no example in the linked article of something like that. The most that the article talks about is:
Which really doesn’t sound much like a “take-over-the-world” strategy. I mean, the above still seems to me like a good plan that in as much as a leading lab has no choice but to pursue AGI as a result of an intense race, I would like them to give it a try. Like, it seems terribly reckless and we are not remotely on track to doing this with any confidence, but like, I am in favor of people openly publishing things that other people should do if they find themselves building ASI. And again the above bullet lists also really don’t sound like “taking over the world”, so I still have trouble connecting this to the paragraph in the OP I take issue with.
None of these sound much like “taking over the world”? Like, yes, if you were to write a paper or blogpost with a plan that allowed someone to make a billion dollars with AI, that seems like it would basically do nothing, and if anything make things worse. It does seem like helpful contributions need to be of both a different type signature, and need to be much bigger than that.
I didn’t say that
At the time it was clear MIRI thought AGI was necessary for pivotal acts, e.g. to melt all GPUs, or to run an upload. I remember discussing “weak nanotech” and so on and they didn’t buy it, they thought they needed aligned task AGI to do a pivotal act.
Quoting task AGI article:
So this is acknowledging massive power concentration.
Furthermore, in context of the disagreement with Paul Christiano, it was clear that MIRI people thought there would be a much bigger capability overhang / FOOM, such that the system did not have to be “competitive”, it could be a “limited AGI” that was WAY less efficient than it could be, because of a pre-existing capability overhang versus the competition. Which, naturally, goes along with massive power concentration.
Wait, you didn’t? I agree you didn’t say “basement” but the section of the OP I am responding to is saying:
And then you said:
The part in square brackets seems like the very clear Gricean implicature here? Am I wrong? If not, what did you mean to say in that sentence?
All the other stuff you say seems fine. I definitely agree MIRI talked about building AIs that would be very powerful and also considered whether power concentration would be a good thing, as it would reduce race dynamics. But again, I am just talking about the part of the OP says that it was MIRI’s plan to build such a system and take over the world, themselves, “in service of reducing x-risk”. None of the above seems like much evidence for that? If you agree that this was not MIRI’s plan, then sure, we are on the same page.
See the two sentences right after.
The Griecian implicature of this is that I at least don’t think it’s clear that MIRI wanted to build an AI to take over the world themselves. Rather, they were encouraging pivotal acts generally, and there’s ambiguity about how much they were individually trying to do so.
The literal implication of this is that it’s hard for people to know how much leverage MIRI has as an organization, which implies it’s hard for them to know that MIRI wanted to take over the world themselves.
Cool, yeah. I mean, I can’t rule this out confidently, but I do pretty strongly object to summarizing this state of affairs as:
Like, at least in my ethics there is a huge enormous gulf between trying to take over the world, and saying that it would be a good idea for someone, ideally someone with as much legitimacy as possible, who is going to build extremely powerful AI systems anyways, to do this:
I go around and do the latter all the time, and think more people should do so! I agree I can’t rule out from the above that MIRI was maybe also planning to build such systems themselves, but I don’t currently find it that likely, and object to people referring to it as a fact of common knowledge.
In this post, I’m mostly talking about my debate with Eliezer more than 12 years ago, when SIAI/MIRI was still talking about building a Friendly AI (which we later described as “sovereign” to distinguish from “task” and “oracle” AI). (Or attempted or proxy debate, anyway, as I’m noticing that Eliezer himself didn’t actually respond to many of my posts/comments.)
However I believe @jessicata is right that a modified form of the plan to build a more limited “task” AI persisted quite a bit after that, probably into the time you started interfacing with MIRI. (I’m not highly motivated to dig for evidence as to exactly how long this plan lasted, as it doesn’t affect my point in the OP.) My guess as to why you got a different impression is that different MIRI people had different plans/intentions/motivations, with Eliezer being the most gung-ho on personally being involved in building some kind of world-altering AI, but also having the most power/influence at MIRI.
As a datapoint, I came into the field after reading the Sequences around 2011, as well as almost all of Yudkowsky’s other writing; then studying math and stuff in university; and then moving to the Bay in 2015. My personal impression of the strategic situation, insofar as I had one, was “AI research has already been accelerating, it’s clearly going to accelerate more and more, we can’t stop this, so we have to build the conceptual background which would allow one to build a (conceptual) function which takes as input an AI field that’s nearing AGI and give as output a minimal AGI that can shut down AI research”. (This has many important flaws, and IDK why I thought that.)
Yudkowsky’s 2008 AI as a Positive and Negative Factor in Global Risk is a pretty good read, both for the content (which is excellent in some ways and easy to critique in others), and for the historical interest (where it’s useful to litigate the question of what MIRI was aiming at around then, and because it’s interesting how much dynamic Yudkowsky anticipated/missed, and because it’s interesting to inhabit 2008 for a bit and update on empirical observations since then).
What specifically is this referring to? The Mere Goodness sequences?
I read your recent post about not rolling your own metaethics as addressed mostly at current AGI or safety researchers who are trying to build or align AIs today. I had thought what you were saying was that those researchers would be better served by stopping what they are doing with AI research, and instead spend their time carefully studying / thinking about / debating / writing about philosophy and metaethics. If someone asked me, I would point to Eliezer’s metaethics sequences (and some of your posts and comments, among others) as a good place to start with that.
I don’t think Eliezer got everything right about philosophy, morality, decision theory, etc. in 2008, but I don’t know of a better / more accessible foundation, and he (and you) definitely got some important and basic ideas right, which are worth accepting and building on (as opposed to endlessly rehashing or recursively going meta on).
Is your view that it was a mistake to even try writing about metaethics while also doing technical alignment research in 2008? Or that the specific way Eliezer wrote those particular sequences is so bad / mistaken / overconfident, that it’s a central example of what you want to caution against with “rolling your own metaethics”? Or merely that Eliezer did not “solve” metaethics sufficiently well, and therefore he (and others) were mistaken to move ahead and / or turn their attention elsewhere? (Either way / regardless, I still don’t really know what you are concretely recommending people do instead, even after reading this thread.)
My position is a combination of:
Eliezer was too confident in his own metaethics, and in his decision theory to a lesser degree (unlike metaethics, he never considered decision theory a solved problem, but was also willing to draw stronger practical conclusions from it than I think was justified) (and probably other philosophical positions that aren’t as salient in my mind EDIT: oh yeah altruism and identity)
Trying to solve philosophical problems like these on a deadline with intent to deploy them into AI is not a good plan, especially if you’re planning to deploy it even if it’s still highly controversial (i.e., a majority of professional philosophers think you are wrong). This includes Eliezer’s effort as well as everyone else’s.
A couple of posts arguing for 1 above:
https://www.lesswrong.com/posts/QvYKSFmsBX3QhgQvF/morality-isn-t-logical
https://www.lesswrong.com/posts/orhEa4wuRJHPmHFsR/six-plausible-meta-ethical-alternatives
Did the above help you figure it out? If not, be more specific about what’s confusing you about that thread?
If the majority of profesional philosophers do endorse your metaethics, how seriously should you take that?
And inversely, do you think it’s implausible that you could have correctly reasoned your way to correct metaethics, as validated by a more narrow community of philosophers, but not yet have convinced everyone in the field?
The attitude of the sequences emphasizes often that most people in the world believe in god, so if you’re interested in figuring out the truth, you gotta be comfortable confidently disclaiming widely held beliefs. What do you say to the person who assesses that academic philosophy is a sufficiently broken field with warped incentives that prevent intellectual progress, and thinks that they should discard the opinion of the whole thing?
Do you just claim that they’re wrong about that, on the object level, and that hypothetical person should have more respect for the views of philosophers?
(That said, I’ll observe that there’s an important in practice asymmetry between “almost everyone is wrong in their belief of X, and I’m confident about that” and “I’ve independently reasoned my way to Y, and I’m very confident of it.” Other people are wrong != I am right.)
I think it’s not totally implausible that one could have called correctly well in advance that the problem itself would be too hard, without actually seeing much evidence that results from MIRI’s and other attempts. I think one could consider things like “the mind is very complex and illegible” and “you have to have a good grasp on the sources of capabilities because of reflective self-modification” and “we have no idea what values are or how intelligence really works”, and maybe get justified confidence, I’m not sure.
But it seems like you’re not arguing that in this post, but instead saying that it was a bad plan even if alignment was easy? I don’t think that’s right, given the stakes and given the difficulty of all plausible plans. I think you can do the thing of trying to solve it, and not be overconfident, and if you do solve it then you use it to end acute risk, but if you don’t solve it you don’t build it. (And indeed IIUC much of the MIRI researcher blob pivoted to other things due to alignment difficulty.) If hypothetically you had a real solution, and triple quadruple checked everything, and did a sane and moral process to work out governance, then I think I’d want the plan to be executed, including “burn all the GPUs” or similar.
First note that the context of my old debate was MIRI’s plan to build a Friendly (sovereign) AI, not the later “burn all the GPUs” Task AI plan. If I was debating the Task AI plan, I’d probably emphasize the “roll your own metaethics” aspect a bit less (although even the Task AI would still have philosophical dependencies like decision theory), and emphasize more that there aren’t good candidate tasks to for the AI to do. E.g. “burn all the GPUs” wouldn’t work because the AI race would just restart the day after with everyone building new GPUs. (This is not Eliezer’s actual task for the Task AI, but I don’t remember his rationale for keeping the actual task secret so I don’t know if I can talk about it here. I think the actual task has similar problems though.)
My other counterarguments all apply as written, so I’m confused that you seem to have entirely ignored them. I guess I’ll reiterate some of them here:
What’s a sane and moral process to work out governance? Did anyone write something down? It seems implausible to me, given other aspects of the plan (i.e., speed and secrecy). If one’s standard for “sane and moral” is something like the current Statement on Superintelligence, then it just seems impossible.
“Triple quadruple checked everything” can’t be trusted when you’re a small team aiming for speed and secrecy. There are instances where widely deployed supposedly “provably secure” cryptographic algorithms and protocols (with proofs published and reviewable by the entire research community, who have clear incentives to find and publish any flaws) years later turned out to be actually insecure because some implicit or explicit assumption used by the proof (e.g., about what the attacker is allowed to do) turned out to be wrong. And that’s a much better understood, inherently simpler problem that has been studied for decades, with public adversarial review processes that much better mitigate human biases compared to a closed small team.
See also items 2 and 5 in my OP.
I didn’t talk about this in the OP (due to potentially distracting from other more important points) but I think Eliezer at least was/is clearly overconfident, judging from a number of observations including his confidence in his philosophical positions. (And overconfidence is just quite hard to avoid in general.) We’re lucky in a way that his ideas for building FAI or a safe Task AI didn’t almost work out, but instead fell wide of the mark, otherwise I think MIRI itself had a high chance of destroying the world.
Well, I meant to address them in a sweeping / not very detailed way. Basically I’m saying that they don’t seem like the sort of thing that should necessarily in real life prevent one from doing a Task-ish pivotal act. In other words, yes, {governance, the world not trusting MIRI, extreme power concentration} are very serious concerns, but in real life I would pretty plausibly—depending on the specific situation—say “yeah ok you should go ahead anyway”. I take your point about takeover-FAI; FWIW I had the impression that takeover-FAI was more like a hypothetical for purposes of design-thinking, like “please notice that your design would be really bad if it were doing a takeover; therefore it’s also bad for pivotal-task, because pivotal-task is quite difficult and relies on many of the same things as a hypothetical safe-takeover-FAI”.
That’s kind of surprising (that this is your response), given that you signed the Superintelligence Statement which seems to contradict this. But I can see some ways that you can claim otherwise, so let me not press this for now and come back to it.
Since you write this in the past tense (“had the impression”), let me first clarify: are you now convinced that sovereign-FAI (I’m avoiding “takeover” due to objection from Habryka and this) was a real and serious plan, or do you want more evidence?
Assuming you’re convinced, I think you should (if you haven’t already) update more towards the view I have of Eliezer, that he is often quite seriously wrong and/or overconfident, including about very important/consequential things like high level AI strategy. I applaud him for being able to eventually change his mind, which probably puts him in at least 99-percentile of humanity, but from an absolute standard, the years it sometimes takes is quite costly, and then often the new position is still seriously wrong. Case in point, the sovereign-FAI idea was his second one, after changing his mind from the first “accelerate AGI as fast as possible (the AGI will have good values/goals by default)”.
Maybe after doing this update, it becomes more plausible that his third idea (Task AGI, which I guess is the first that you personally came into contact with, and then spent years working towards) was also seriously wrong (or more seriously wrong than you think)?
I think no one should build AGI. If someone is going to build AGI anyway, then it might be correct to make AGI yourself first, if you have a way to make actually aligned (hopefully task-ish or something).
Let me rephrase. I already believed that there had been a plan originally, like 2004 +/- 3 years, to make sovereign AI. When I was entering the field, I don’t recall thinking too hard about “what do you do with it”, other than thinking about a Task-ish thing, but with Sovereign AI as a good test case for thought experiments. I don’t know when Yudkowsky and others updated (and still don’t).
I’m still not sure where you’re getting this? I mean, there are places where I would disagree with Yudkowsky’s action-stances or something. For example, I kinda get the sense that he’s planning “as though” he has confident short timelines; I don’t think confident short timelines make sense, but that’s different from how an individual makes their plans. For example, I’m working almost exclusively on plans that only pay off after multiple decades, which looks like “very confident of long timelines”, but I’m not actually very confident of that and I say so...
For example, the rejection of HIA you quoted again seems pretty tepid, which is to say, quite non-confident, explicitly calling out non-confidence. In practice one can only seriously work on one or two different things, so I wonder if you’re incorrectly inferring confidence on his part.
I think I’m not following; I think Task AI is a bad plan, but that’s because it’s extremely difficult. I think you’re asking me to imagine that it is solved, but that we should have unknown-unknown type certainty about our solution; and I just don’t feel like I know how to evaluate that. If there was (by great surprise) some amazing pile of insights that made a safe Task-AGI seem feasible, and that stood up to comprehensive scrutiny (somehow), then it would plausibly be a good plan to actually do. I think you’re saying this is somehow overconfident anyway, and maybe I just disagree with that? But it feels hard to disagree with because it’s pretty hypothetical. If you’re argument is “but other people are overconfident and wrong about their alignment ideas” I think this proves too much? I mean, it seems to prove too much if applied to cryptography, no? Like, you really can make probably-quite-secure systems, though it takes a lot of work and carefulness and stuff, and the guarantees are conditional on certain mathematical conjectures and only apply to some class of attacks. But I mean, the fact that many people with a wide range of abilities could become wrongly overconfident in the security of their hand-rolled system, doesn’t prove an expert can’t know when their system is secure in certain senses.
I got a free subscription to Perplexity Pro, and used it to generate a report about this. The result seems pretty good. The short answer to “when Yudkowsky updated (from Sovereign to Task AGI)” is very likely 2014 or 2015 (see second collapsed section below).
Perplexity AI research report on “using this post and comment and linked resources as a starting point, can you build a comprehensive picture of SIAI/MIRI’s strategy over time, especially pinpointing the specific dates that its strategy changed/pivoted, backed up by the best available evidence
https://www.lesswrong.com/posts/dGotimttzHAs9rcxH/relitigating-the-race-to-build-friendly-ai″
SIAI/MIRI’s strategy over time: phases and pivot points
The Machine Intelligence Research Institute (MIRI), originally the Singularity Institute for Artificial Intelligence (SIAI), has changed strategy multiple times.
Over 25 years, the strategy has moved through roughly these phases:
2000–2003: “Build the Seed AI ourselves, as fast as possible” (accelerationist + Friendly AI).
~2003–2010: “Friendly AI and existential risk” + heavy outreach, still with a strong self‑image as the eventual FAI builder.
2011–2013: Professionalisation and pivot to technical FAI research; formal mission broadened from “build FAI” to “ensure AI turns out well,” though key people still expected MIRI (or a tight equivalent) to build a world‑altering Friendly AI.
2013–2016: Mature “world‑altering Friendly AI” / CEV vision plus growing emphasis on Task‑AGI and “pivotal acts.”
2017–2020: New engineering‑heavy, largely confidential research program aimed at new foundations for alignable AI, with a clear plan around “minimal aligned AGI” performing a pivotal act.
2020–2022: Public admission that the 2017 program had “largely failed” and a period of strategic regrouping; increasingly pessimistic public tone (e.g. AGI Ruin).
2023–present: Official pivot to policy and communications first, with research now framed as secondary support for an “international halt” to progress toward smarter‑than‑human AI.
Below is a more detailed, evidence‑based timeline, with particular attention to when strategy clearly shifted.
1. Founding: “Create a Friendly, self‑improving AI” (2000–2003)
Founding mission (July 2000).
The organization is founded on 27 July 2000 as the Singularity Institute for Artificial Intelligence by Eliezer Yudkowsky and Brian & Sabine Atkins. The mission statement on the early IRS forms and timeline is explicit:
Early technical writings like Coding a Transhuman AI and Creating Friendly AI 1.0 (2001) describe specific architectures for a transhuman AI that self‑improves while remaining benevolent. This is not just “study risks”; it is a plan to design the Seed AI.intelligence+1![]()
Early accelerationism.
MIRI’s own 2024 retrospective states that at founding, the goal was:
So the earliest SIAI strategy can be summarized as:
Build a self‑improving Seed AI (with Friendly properties).
Speeding up AGI is seen as good, provided it is Friendly.
Approximate pivot:
By MIRI’s own later account, Eliezer’s “naturalistic awakening” — realizing that superintelligence would not automatically be moral — occurs between 2000–2003, and:
So around 2003 there is a conceptual pivot from “accelerate AGI, assume it’s benevolent” to “AI alignment is a hard, central problem.”
2. Mid‑2000s: Friendly AI + outreach, still as would‑be FAI builders (2003–2010)
Mission reformulation (2007 and 2010).
By 2007 the formal mission is updated to:
In 2010 it further becomes:
This is already less “we will build it” and more “develop theory and improve odds,” but still assumes SIAI as central technical actor.
Strategic plan 2011: three‑pronged strategy.
The 2011 Strategic Plan lays out three “core strategies”:
“Solve open research problems in Friendly AI theory.”
“Mobilize additional human capital and other resources” (Singularity Summit, LessWrong, volunteers, outreach).
“Improve the function and capabilities of the organization.”intelligence![]()
Near‑term (2011–2012) priorities emphasise:
Public‑facing research on “creating a positive singularity.”
Outreach/education/fundraising via the Summit and rationality community.
Organizational maturity.intelligence![]()
So in this period, strategy is mixed:
Technical aim: make progress on FAI theory.
Meta‑aim: grow a rationalist / EA / donor ecosystem and mainstream attention.
Evidence they still saw themselves as the eventual FAI builders.
Several pieces of evidence support the picture that, in the 2000s and up to at least ~2011, SIAI’s central plan was that it (or a very close “functional equivalent”) would actually build the first Friendly AGI:
The MIRI timeline notes the original mission (2000–2006) as “Create a Friendly, self‑improving Artificial Intelligence.”timelines.issarice
In a video first hosted in 2011, Eliezer says:
Luke Muehlhauser’s 2013 post Friendly AI Research as Effective Altruism (looking back on the founding) explicitly confirms:
Wei Dai’s 2025 LessWrong post also recalls that he was “arguing against SIAI/MIRI’s plan to build FAI” as early as 2004. That is consistent with the explicit “someone is us” framing above.lesswrong+1
So for roughly 2000–2010, the strategic picture is:
Goal: Design and eventually build a Friendly, self‑improving AGI.
Means: Foundational theory + community‑building + outreach (Summits, LessWrong, rationality training).
Role: SIAI is not just an advocacy group; it is intended Seed AI shop.
3. 2011–2013: professionalisation and pivot to FAI math research
Here we see the first clearly documented organizational pivot with a date.
3.1. 2013: “MIRI’s Strategy for 2013” – research‑only, FAI math‑first
In April 2013, MIRI (new name) publishes MIRI’s Strategy for 2013. Key elements:intelligence
They have:
After strategic planning with 20+ advisors in early 2013, they decide to:
They distinguish expository, strategic, and Friendly AI research and explicitly say:
In other words, as of early 2013:
Outreach and rationality training are explicitly deprioritised.
The core strategy is: focus limited staff and budget on math‑style Friendly AI research (logical uncertainty, self‑modification, decision theory) with some high‑quality strategy work, but little expository or broad public outreach.
3.2. 2013: bylaw / mission statement change
In the same period, MIRI’s formal mission statement is softened from “create Friendly AI” to a more conditional role. Luke writes in June 2013:
Luke adds:
So early 2013 marks a formal pivot:
In documents: from “we exist to build FAI” to “we exist to ensure AGI goes well; we might build AGI if necessary.”
In practice: from “outreach + community + research” to “almost pure technical FAI math research.”
4. Circa 2013: the “world‑altering Friendly AI” plan
Wei Dai’s 2025 LessWrong post, which you linked, is primarily about this era. He characterizes the “circa 2013” strategy as:
His argument is that this plan was strategically bad even ex ante, because a small team attempting to directly build a Friendly sovereign AI faces “illegible safety problems” — traps they cannot even see.lesswrong
Does the evidence support that this really was the plan?
Luke’s 2013 post explicitly notes that Eliezer usually talks about MIRI specifically (or a functional equivalent) creating Friendly AI.intelligence
In the 2013 strategy conversation with Holden Karnofsky and others, Eliezer says:
This is consistent with: MIRI expects either itself, or a close “strategically adequate” project built around its ideas and people, to be the group that actually builds the first aligned AGI.
The LessWrong discussion notes that Yudkowsky “usually talks about MIRI in particular — or at least, a functional equivalent — creating Friendly AI,” and this needed explicit textual evidence by 2025 because it was no longer “common knowledge.”alignmentforum+1
Put together, the best reading is:
Around 2013, MIRI’s operational strategy is “do deep foundational technical work now so that MIRI or a tightly aligned group can someday build a world‑altering Friendly AI (often conceived as CEV‑like), thereby securing the future.”
Critics like Wei Dai were indeed pushing alternative strategies (e.g. invest in more geniuses via embryo selection / cloning and buy time by delaying AI efforts) as early as 2004.lesswrong
This is almost exactly the “race to build Friendly AI” picture the LessWrong post is relitigating, although there is internal disagreement over how much this should be described as “take over the world yourselves” versus “someone legitimate should, and we might be that someone.”lesswrong
5. Emergence of Task‑AGI and “pivotal acts” (≈2014–2017)
Over time, there is a shift in how MIRI talks about what the first aligned system should do.
Early concepts such as Coherent Extrapolated Volition (CEV) envisaged a very powerful sovereign AI that, roughly, does what humanity would want if we were smarter, more informed, and more coherent. That is a maximally ambitious alignment target.
Later, Yudkowsky distinguishes between:
A Sovereign (a CEV‑like system running the long‑term future), and
A Task AGI, a powerful but limited system intended to perform a pivotal act — some one‑time intervention that “upsets the gameboard” in our favor (e.g. uploading researchers, preventing hostile superintelligences, enabling nanotech), after which further decisions can be made under safer conditions.lesswrong+1
The LessWrong “pivotal act” and “Task AGI” pages (compiled later, but summarizing this thinking) explicitly give examples of a Task AGI that can:
Upload humans and run uploads at high speed.
Prevent the origin of hostile superintelligences, possibly temporarily.
Deploy nanotechnology that gives a route to intelligence enhancement or strong defense.lesswrong+1
These are very close to the examples quoted in the comment thread you provided.
When does this show up as strategy?
The clearest institutional articulation is MIRI’s 2017 Fundraiser and Strategy Update (Nov 30, 2017). That document lays out a multi‑step high‑level plan:effectivealtruism
Humanity needs a “stable period” to think clearly about long‑run futures and avoid locking in bad outcomes.
To get there, we must first end an “acute risk period” in which AGI could cause extinction.
MIRI expects this to require a “risk‑mitigating technology” developed via:
That, in turn, requires:
They stress the need for a “strategically adequate” project with strong opsec, research closure, commitment to the common good, and “white‑boxed system understanding.”effectivealtruism
This is, in effect, the Task‑AGI / pivotal‑act plan written down in strategic language.
So while Task‑AGI ideas were developing earlier, 2017‑11‑30 is the first strong documented pivot point where MIRI’s official strategic picture becomes:
Don’t aim first for a full CEV‑style sovereign.
Aim to enable a strategically adequate team to build a minimal aligned AGI capable of executing a pivotal act that ends the acute risk period.
Eli Tyre’s later comment, which you saw in the LessWrong thread, describes an “intentional switch, around 2016 (though I’m not confident in the date), from aiming to design a Friendly CEV‑optimizing sovereign AI to aiming to design a corrigible minimal Science‑And‑Engineering‑AI to stabilize the world.” That matches what shows up publicly in 2017.lesswrong
6. 2017–2018: new engineering‑heavy, nondisclosed‑by‑default research program
In parallel, there was a major research‑tactics pivot.
6.1. New research directions (2017–2018)
The 2017 fundraiser already notes that, with new funding and changed views, MIRI is “starting to explore new research directions while also continuing to push on our agent foundations agenda,” and is hiring software engineers to “experiment and explore some ideas in implementation.”effectivealtruism
The 2018 post 2018 Update: Our New Research Directions gives a detailed picture:intelligence
MIRI still has the Agent Foundations agenda (logical uncertainty, embedded agency, decision theory).openphilanthropy+1![]()
“Recently, some MIRI researchers developed new research directions that seem to enable more scalable progress” by building software systems to test intuitions and iterate quickly.intelligence
High‑level themes include:
“Seeking entirely new low‑level foundations for optimization, designed for transparency and alignability from the get‑go, as an alternative to gradient‑descent‑style machine learning foundations.”intelligence
Figuring out parts of cognition that can be “very transparent as cognition.”intelligence
These directions are explicitly engineering‑heavy, Haskell‑centric, and aimed at deconfusion while building real code, not just pen‑and‑paper math.intelligence
6.2. “Nondisclosed‑by‑default” policy
The same 2018 post announces a major change in publication policy:
Reasons include:
Worry that deconfusion results might also lead to capabilities advances.
Desire to let researchers think without constant self‑censorship about dual‑use risks.
Focus on speed of progress rather than exposition.intelligence
So 2018‑11‑22 marks a clear tactical pivot:
From publishing most core work to doing a large fraction of the new engineering‑heavy research under confidentiality.
From a more outreach‑oriented identity to “we really genuinely don’t mean outreach; we’re simply and only aiming to directly make research progress on the core problems of alignment.”intelligence
This is almost certainly the “4–5 years of confidential‑by‑default research” that Habryka references in the comment thread.lesswrong
7. 2020: admitting the 2017 research push “largely failed” and regrouping
By late 2020, MIRI states that the 2017–2019 “new directions” have not panned out as hoped.
In 2020 Updates and Strategy (Dec 20, 2020), Nate Soares writes:intelligence
2020 saw “limited progress in the research MIRI’s leadership had previously been most excited about: the new research directions we started in 2017.”intelligence
Senior staff have become “more pessimistic about this approach”:
As a result:
Some researchers leave; others attempt to “rescue” the 2017‑initiated lines; others pivot to new exploratory projects.intelligence
This is a clear strategic pivot dated 2020‑12‑20:
Official recognition that the previous core engineering program is unlikely to succeed.
An explicit search phase for new strategy, with no replacement yet fully specified.
8. 2022: “AGI Ruin” and strong public pessimism
Eliezer’s widely read 2022 essay AGI Ruin: A List of Lethalities further clarifies how MIRI leadership now see the situation.intelligence+1
Key points relevant to strategy:
Alignment, in the sense “get to less than certain doom on the first dangerous try,” is portrayed as extremely difficult; the problem is lethal and must be solved under both a time limit and a “first critical try” constraint.intelligence
He argues a pivotal act is still needed:
But he also says he sees no realistic path to doing this under current conditions, and regards doom as the default outcome.
Although AGI Ruin is not itself a strategy document, it marks a public rhetorical pivot:
The earlier plan “MIRI (or a strategically adequate project we seed) devises and executes a pivotal act using aligned AGI” is treated as effectively unachievable.
This viewpoint sets the stage for the next major institutional shift: toward halting or pausing AGI rather than expecting to align and deploy it.
9. 2023–2024: pivot to policy and communications first
MIRI’s 2024 Mission and Strategy Update (Jan 3–4, 2024) is the most explicit top‑level strategy document and also provides a retrospective on earlier phases.lesswrong+2
After reviewing its history (founding accelerationism, 2003 alignment shift, 2006–2012 community‑building, 2013 Agent Foundations, 2017 engineering push, 2020 regrouping), it states:
They add:
Thus, in 2023 (per the 2024 retrospective) MIRI’s top‑level strategy pivots to:
Primary: Push for an international halt / moratorium on smarter‑than‑human AI (policy) and make their models legible enough to shift the Overton window (communications).
Secondary: Continue alignment research, but now framed as conditional on policy success buying time.
This is a very sharp contrast with:
2013 (“less outreach, more technical research”).intelligence
2018 (“we really don’t mean outreach; we’re just trying to solve core alignment problems”).intelligence
Now, outreach and policy work are explicitly prioritized over new technical work.
10. Consolidated list of major strategic pivots with dates and evidence
Below is a concise list of the clearest, evidence‑backed pivot points.
How this lines up with the LessWrong post and comments
Starting from the post and comments you linked:
The claim that “MIRI’s circa‑2013 plan [was] to build a world‑altering Friendly AI” is well supported by:
The original mission (“Create a Friendly, self‑improving AI”).timelines.issarice
Eliezer’s 2011 statements that SI’s mission is literally to build Friendly AI and that “someone has to do it… ‘Someone’ is us.”lesswrong
Luke’s 2013 remark that Eliezer usually talks about MIRI in particular (or a functional equivalent) creating FAI.intelligence
The 2013 strategy conversation, which explicitly dismisses the idea of merely influencing “a random AGI project 30 years from now,” in favor of gathering a community that “just [goes and does] it” with safety baked in from the ground up.openphilanthropy![]()
The “pivotal act” and “Task AGI” thinking that comes up in the comments corresponds closely to what MIRI formalizes in 2017 as “minimal aligned AGI” used for “AGI‑empowered technological development” to end the acute risk period.lesswrong+2
The comment‑thread reference to a 4–5‑year confidential research period matches the 2017–2018 launch of nondisclosed‑by‑default, engineering‑heavy research and the 2020 admission that this push had largely failed.intelligence+1
The observation that MIRI later pivoted away from “build FAI ourselves” because alignment turned out to be too hard is consistent with:
The 2020 regrouping post.intelligence
The strongly pessimistic tone of AGI Ruin (2022).intelligence
The 2023 pivot toward policy and communication, treating large‑scale AI pause as central and technical alignment research as increasingly unlikely to arrive in time absent that pause.effectivealtruism+2
The main point of disagreement in the LessWrong discussion is about phrasing like “MIRI’s plan, to build a Friendly AI to take over the world,” which some participants (e.g. Habryka) see as overstating how explicitly “take over the world ourselves” was ever adopted as official organizational strategy, versus “someone should do something world‑altering, and maybe that someone is us or a very similar team.”lesswrong
On the factual side, however, the evidence is fairly clear that:
For roughly its first 10–15 years, SIAI/MIRI saw itself as a small group that might directly build the first Friendly superintelligence (initially a sovereign CEV‑like system, later a minimal Task‑AGI for a pivotal act).
Around 2013 it professionalized and narrowed to technical FAI work while still effectively planning to be (or seed) that project.
From 2017 onward it concentrated on an internal, engineering‑heavy attempt to find new foundations for alignable AI.
By 2020–2023 it concluded that this had failed and pivoted to policy and communication, with technical work now secondary and survival odds seen as hinging largely on halting or drastically slowing frontier AI.
That is the best evidence‑backed reconstruction of SIAI/MIRI’s strategic trajectory given the post, its links, and MIRI’s own later retrospectives.
https://timelines.issarice.com/wiki/Timeline_of_Machine_Intelligence_Research_Institute
https://intelligence.org/files/CFAI.pdf
https://intelligence.org/2024/01/04/miri-2024-mission-and-strategy-update/
https://intelligence.org/files/strategicplan2011.pdf
https://www.lesswrong.com/users/elityre
https://intelligence.org/2013/06/05/friendly-ai-research-as-effective-altruism/
https://www.lesswrong.com/posts/dGotimttzHAs9rcxH/relitigating-the-race-to-build-friendly-ai
https://www.alignmentforum.org/posts/dGotimttzHAs9rcxH/relitigating-the-race-to-build-friendly-ai
https://intelligence.org/2013/04/13/miris-strategy-for-2013/
https://www.openphilanthropy.org/wp-content/uploads/10-27-2013-conversation-about-MIRI-strategy.pdf
https://www.lesswrong.com/w/coherent-extrapolated-volition-alignment-target
https://www.lesswrong.com/w/pivotal-act
https://forum.effectivealtruism.org/posts/GxmJ2ntyMiaG2PPSu/miri-2017-fundraiser-and-strategy-update
https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/
https://openphilanthropy.org/files/Grants/MIRI/MIRI_Technical_Research_Agenda.pdf
https://intelligence.org/2020/12/21/2020-updates-and-strategy/
https://intelligence.org/2022/06/10/agi-ruin/
https://www.alignmentforum.org/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
https://www.lesswrong.com/posts/q3bJYTB3dGRf5fbD9/miri-2024-mission-and-strategy-update
https://forum.effectivealtruism.org/posts/Lvd2DFaHKfuveaCyQ/miri-2024-mission-and-strategy-update
https://intelligence.org/summary/
https://dash.harvard.edu/bitstreams/e9d128d7-ae0b-48a2-a88f-531fd238dac2/download
https://www.lesswrong.com/posts/SZ2h2Yte9pDi95Hnf/on-miri-s-new-research-directions
https://www.lesswrong.com/posts/ppmtsKiodNceMe3k6/what-are-some-alternative-approaches-to-understanding-agency
https://intelligence.org/2018/11/28/2017-in-review/
https://www.alignmentforum.org/posts/tD9zEiHfkvakpnNam/a-challenge-for-agi-organizations-and-a-challenge-for-1
https://v-dem.net/media/publications/wp_122_final.pdf
https://en.wikipedia.org/wiki/Singularity_Group
https://en.wikipedia.org/wiki/Eliezer_Yudkowsky
http://www.superintelligence.ch/Superintelligence.pdf
https://www.alignmentforum.org
https://zoo.cs.yale.edu/classes/cs671/12f/12f-papers/yudkowsky-ai-pos-neg-factor.pdf
https://www.lesswrong.com/posts/4iqhkNv7woSnkpMRa/holden-karnofsky-s-singularity-institute-objection-1
http://sl4.org/archive/0207/4638.html
https://louisdelmonte.com/tag/strong-artificial-intelligence/
https://www.reddit.com/r/DebateAnarchism/comments/1z9l24/anarchotranshumanism_ama/
https://www.lesswrong.com/posts/2cxNvPtMrjwaJrtoR/ai-regulation-may-be-more-important-than-ai-alignment-for
https://www.alignmentforum.org/w/taskagi_open_problems?l=2mx
https://www.lesswrong.com/posts/FsEDu6CvzyzJkrQd8/pivotal-acts-using-an-unaligned-agi
https://gyro.substack.com/p/on-approaching-doom
https://futureoflife.org/recent-news/machine-learning-ai-risk/
https://arbital.greaterwrong.com/p/unforeseen_maximum/
https://arxiv.org/pdf/2503.05628.pdf
https://www.lesswrong.com/posts/tD9zEiHfkvakpnNam/a-challenge-for-agi-organizations-and-a-challenge-for-1
https://forum.effectivealtruism.org/topics/machine-intelligence-research-institute
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
https://www.lawfaremedia.org/article/the-case-for-ai-doom-rests-on-three-unsettled-questions
https://thezvi.substack.com/p/on-agi-ruin-a-list-of-lethalities
https://intelligence.org/2022/07/30/july-2022-newsletter/
Follow-up answer to “Can you dig deeper, for example into LessWrong posts/comments, to find out when Eliezer or MIRI first started thinking or talking about task AGI?”
Based on a deeper analysis of LessWrong posts, Arbital edit histories, and MIRI archives, the concept of Task AGI (and the associated shift to “pivotal acts” rather than immediate “Sovereign/CEV”) emerged earlier than the 2017 strategy update.
The specific “origin date” for the term and formalized concept is June 9, 2015.
Key Dates and Evidence
June 9, 2015 (The “Smoking Gun”): Eliezer Yudkowsky created the page “Task-directed AGI” on Arbital on this exact date.
Evidence: Metadata from Arbital scrapes explicitly list the page as “by Eliezer Yudkowsky Jun 9 2015 updated Mar 25 2017.”arbital.obormot
Significance: This proves the concept was being actively formalized in mid-2015, fully two years before it became the centerpiece of the 2017 public strategy.
May 24, 2016: The Arbital page for the “Do-What-I-Mean hierarchy” was created, which integrated “Task AGI” into a broader safety framework. This confirms that by 2016, MIRI had a developed internal language around “Task AGI” vs “Sovereign AGI” that was distinct from their earlier “Tool AI” discussions.arbital
2014 (The Intellectual Catalyst): The term likely crystallized as a response to Nick Bostrom’s book Superintelligence (published July 2014).
Context: Bostrom popularized the typology of “Oracle, Genie, Sovereign, Tool.”
Connection: The Arbital definition of Task AGI explicitly references Bostrom: “An advanced AI that’s meant to pursue a series of limited-scope goals… In Bostrom’s terminology, a Genie.”arbital.obormot
Pivot Logic: MIRI had previously been skeptical of “Tool AI” (arguing that passive tools are unsafe). “Task AGI” appears to be their 2015 attempt to construct a safe version of a Genie/Tool—one that is agentic enough to be useful (perform a pivotal act) but limited in scope (“corrigible” and “mildly optimizing”) to avoid taking over the future.
Summary of the Evolution
The timeline of MIRI’s internal pivot to “Task AGI” can be reconstructed as:
2013–2014: Debate over “Tool AI” vs. “Oracle AI.” Eliezer is critical of the idea that AI can just be a “passive tool” (see Holden Karnofsky’s Singularity Institute Objection, 2012–2014).
July 2014: Bostrom’s Superintelligence provides new vocabulary (Genie vs. Sovereign).
June 2015: Eliezer formally documents “Task-directed AGI” on Arbital. The strategic hope shifts from “Build CEV immediately” to “Build a Task AGI to perform a pivotal act.”
2016: Internal consensus solidifies. MIRI researchers (like Eli Tyre and Nate Soares) begin treating “Task AGI for a pivotal act” as the standard plan. This is the “intentional switch around 2016″ mentioned in the comments.
Nov 2017: The strategy is officially announced to the public in the 2017 Fundraiser and Strategy Update.
In short: While the public saw the pivot in 2017, the intellectual shift happened in mid-2015, likely triggered by the digestion of Bostrom’s 2014 book and the need for a more pragmatic “minimum viable” target than full CEV.
https://arbital.obormot.net/page/task_agi.html
https://arbital.com/p/dwim/
https://www.lesswrong.com/posts/keiYkaeoLHoKK4LYA/six-dimensions-of-operational-adequacy-in-agi-projects
https://intelligence.org/2022/06/07/six-dimensions-of-operational-adequacy-in-agi-projects/
https://arbital.com/p/soft_optimizer/
https://www.lesswrong.com/w/coherent-extrapolated-volition-alignment-target
https://arbital.com/p/cev/
https://www.lesswrong.com/posts/zj7rjpAfuADkr7sqd/could-we-automate-ai-alignment-research-1
https://arbital.com/p/task_agi/
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than
https://www.lesswrong.com/posts/bnY3L48TtDrKTzGRb/ai-safety-success-stories
https://www.lesswrong.com/w/autonomous-agi
https://en.wikipedia.org/wiki/Artificial_general_intelligence
https://www.reddit.com/r/SneerClub/comments/1i1wmhf/risaacarthur_fan_learns_about_lesswrong_is/
https://philarchive.org/archive/TURCDA
https://thezvi.wordpress.com/2022/06/13/on-a-list-of-lethalities/
https://www.ibm.com/think/topics/artificial-general-intelligence
https://manifold.markets/MartinRandall/is-there-a-pivotal-weak-act
https://bayesianinvestor.com/blog/index.php/2024/12/29/corrigibility-should-be-an-ais-only-goal/
https://intelligence.org/2022/06/10/agi-ruin/
https://intelligence.org/2013/08/11/what-is-agi/
https://www.greaterwrong.com/w/soft_optimizer?l=2r8&sort=new
https://www.lesswrong.com/posts/dGotimttzHAs9rcxH/relitigating-the-race-to-build-friendly-ai
https://www.alignmentforum.org/posts/AqsjZwxHNqH64C2b6/let-s-see-you-write-that-corrigibility-tag
https://www.greaterwrong.com/posts/dGotimttzHAs9rcxH/relitigating-the-race-to-build-friendly-ai/comment/jg5z7WMELvm6rymfa
https://www.greaterwrong.com/tag/task-directed-agi
https://arbital.greaterwrong.com/p/AGI_typology?l=1g0
https://www.lesswrong.com/users/tsvibt
If Eliezer or MIRI as a whole had said something like this, especially the first part “I think no one should build AGI.” while pursuing their plans, I would be more tempted to give them a pass. But I don’t recall them saying this, and a couple of AIs I asked couldn’t find any such statements (until after their latest pivot).
Also I wouldn’t actually endorse this statement, because because it doesn’t take into account human tendency/bias to think of oneself as good/careful and others as evil/reckless.
Eliezer claiming to have solved metaethics. Saying that he wouldn’t “flinch from” trying to solve all philosophical problems related to FAI by himself. (man, it took me 30-60 minutes to find this link) Being overconfident on other philosophical positions like altruism and identity.
I would be more ok with this (but still worried about unknown unknowns) if “comprehensive scrutiny” meant scrutiny by thousands of world-class researchers over years/decades with appropriate institutional design to help mitigate human biases (e.g., something like academic cryptography research + NIST’s open/public standardization process for crypto algorithms). But nothing like this was part of MIRI’s plans, and couldn’t be because of the need for speed and secrecy.
Ok. I think I might bow out for now unless there’s something especially salient that I should look at, but by way of a bit of summary: I think we agree that Yudkowsky was somewhat overconfident about solving FAI, and that there’s a super high bar that should be met before making an AGI, and no one knows how to meet that bar; my guess would be that we disagree about
the degree to which he was overconfident,
how high a bar would have to be met before making an AGI, in desperate straits.
Definitely read the second link if you haven’t already (it’s very short and salient), but otherwise, sure.
(I did read that one; it’s interesting but basically in line with how I think he’s overconfident; it’s possible one or both of us is incorrectly reading in / not reading in to what he wrote there, about his absolute level of confidence in solving the philosophical problems involved.)
Hmm, did you also read my immediate reply to him, where I made the point “if you’re the only philosopher in the team, how will others catch your mistakes?” How to understand his (then) plan except that he would have been willing to push the “launch” button even if there were zero other similarly capable philosophers available to scrutinize his philosophical ideas?
(Also just recording that I appreciate the OP and these threads, and people finding historical info. I think the topic of how “we” have been going wrong on strategy is important. I’m participating because I’m interested, though my contributions may not be very helpful because
I was a relative latecomer, in that much of the strategic direction (insofar as that existed) had already been fixed and followed;
I didn’t especially think about strategy that much initially, so I didn’t have many mental hooks for tracking what was happening in the social milieu in terms of strategic thinking and actions.)
(Oh I hadn’t read the full thread, now I have; still no big update? Like, I continue to see him being seemingly overconfident in his ability to get those solutions, but I’m not seeing “oh he would have mistakenly come to think he had a solution when he didn’t”, if that’s what you’re trying to say.)
I think you can get less of the tradeoff here by explicitly and deliberately aiming for AI ‘tools’ for improving human (group) strategic competence. It sounds subtle, but I think it has quite different connotations and implications for what you actually go and do!
This isn’t implausible, but could you point to instances / evidence of this? (I.e. that MIRI’s plan / other participation caused this.)