Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Richard_Ngo
Conditional on being around to look back, it seems pretty plausible to me that lack of trust and competence within major powers will have made the outcome of AGI significantly worse than it could have been.
A (partial, not very good) analogy is that, at this point, the developed world is pretty altruistic towards the developing world (e.g. to the tune of many billions of dollars of aid per year). But the developing world might still really wish it’d had fewer internal ethno-religious fractures during the Industrial Revolution (or indeed at at any time since then).
Copying over my response to Scott from Twitter (with a few additions in square brackets):
I think my biggest disagreement here is about the concept of strategic communications.
In particular, you claim that MIRI should have been more PR-strategic to avoid hyping AI enough that DeepMind and OpenAI were founded.
Firstly, a lot of this was not-very-MIRI. E.g. contrast Bostrom’s NYT bestseller with Eliezer popularizing AI risk via fanfiction, which is certainly aimed much more at sincere nerds. And I don’t think MIRI planned (or maybe even endorsed?) the Puerto Rico conference.
But secondly, even insofar as MIRI was doing that, creating a lot of hype about AI is also what a bunch of the allegedly PR-strategic people are doing right now! Including stuff like Situational Awareness and AI 2027, as well as Anthropic. [So it’s very odd to explain previous hype as a result of not being strategic enough.]
You could claim that the situation is so different that the optimal strategy has flipped. That’s possible, although I think the current round of hype plausibly exacerbates a US-China race in the same way that the last round exacerbated the within-US race, which would be really bad.
But more plausible to me is the idea that being loud and hype-y is often a kind of self-interested PR strategy which gets you attention and proximity to power without actually making the situation much better, because power is typically going to do extremely dumb stuff in response. And so to me a much better distinction is something like “PR strategies driven by social cognition” (which includes both hyping stuff and also playing clever games about how you think people will interpret you) vs “honest discourse”.
To be clear I don’t have a strong opinion about how much IABIED fits into one category vs the other, seems like a mix. A more central example of the former is Situational Awareness. A more central example of the latter is the Racing to the Precipice paper, which lays out many of the same ideas without the social cognition.
My other big disagreement is about which alignment work will help, and how. Here I have a somewhat odd position of both being relatively optimistic about alignment in general, and also thinking that almost all work in the field is bad. This seems like too big a thing to debate here but maybe the core claim is that there’s some systematic bias which ends up with “alignment researchers” doing stuff that in hindsight was pretty clearly mainly pushing capabilities.
Probably the clearest example is how many alignment researchers worked on WebGPT, the precursor to ChatGPT. If your “alignment research” directly leads to the biggest boost for the AI field maybe ever, you should get suspicious! I have more detailed modes of this which I’ll write up later but suffice to say that we should strongly expect Ilya to fall into similar traps (especially given the form factor of SSI) and probably Jan too. So without defusing this dynamic, a lot of your claimed wins don’t stand up.
have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after which becomes both likely and effectively outside of our control?
I think you’re missing a word after “which”. But also, the “outside of our control” part seems like a bad definition of losing, insofar as there are other actors who might be able to steer things instead.
Glad to see these kinds of reflections in general, though.
“Opportunity cost” is another slippery concept that in the economic framework seems similar to other costs, but in a sociopolitical framework seems extremely different.
Suppose I steal $1000 of your stuff. You can describe this as me imposing a $1000 cost on you.
But suppose instead that I offer you $1000 if you quit your job. Assuming you’re happy enough with your job that this doesn’t move you to act, then what I’ve done is just to “impose” a $1000 opportunity cost on you. But of course this does no harm to you.
And so the phrase “opportunity cost” is inherently a misleading one, especially when used as you do above (i.e. talking about “paying” an opportunity cost in the same way that you pay taxes). You have elided the distinction between me freely choosing to optimize for other things than financial returns, versus me having my money taken away from me using the threat of force.
Related: the faction most worried about building superintelligent AI evolve from the faction most worried about not building superintelligent AI.
I broadly agree with this comment too, though not as much as I agree with the other one.
Power felt can also be a kind of honesty—e.g. if a law is backed by force, then it’s often better for this to be unambiguous, so that people can track the actual landscape of power.
(Of course, being unambiguous about how much force backs up your laws can also be a kind of power move. I expect that there are ways to get the benefits of honesty without making it a power move, but I don’t have enough experience with this to be confident.)
In other words, I expect that the kind of inefficiency Val is talking about here is actually sometimes load-bearing for accountability.
Yes, great summary, I fully endorse it.
I claim that there are some people such that, if they were dictators of China, that would be much worse than the current situation. And there are some people such that, if they were dictators of China, that would be much better than the current situation. Which category a given person falls into depends a lot on their honesty, integrity, wisdom, ability to understand political dynamics, ability to resist manipulation, etc.
There are no particular limits I’d want to place on a sufficiently virtuous Superman. E.g. I want Superman to follow a policy that leads him to overthrow the government of China iff he is in the latter category. The big question is how Superman can gain justified confidence that he’s in the latter category, given that unvirtuous people are prone to a lot of self-deception. One way he can do it is by setting limits on his own behavior so that he can gain more evidence about what kind of person he is. E.g. maybe he thinks he’s really wise about politics—wise enough that him having control over US electoral policy is a good idea. If so, he should try to test that wisdom by implementing political change without using violence. If he starts telling you that he doesn’t need to pass such tests, because he’s already so confident that his plan is a good idea, then you should start getting worried.
In other words, when I think about a question like “should Superman forcibly institutes electoral reforms to make the US government more functional”, I expect that there are some ways to do this that are really good, and some ways to do this that are really bad. And the kinds of people who are capable of doing it in a really good way (given that they’re Superman) are also generally the kinds of people who wouldn’t need to use much force to make it happen (given that they’re Superman).
Are you saying that if the diplomatic negotiations deteriorate to the point of military action, that means that our hypothetical superman has failed, and he would be better off retiring? Don’t existing legitimate countries go to war for far less noble reasons all the time?
I indended this to refer to scenarios where the US itself (or other leading western powers) were taking military action against Superman. I care much less about whether he destabilizes North Korea or Eritrea or even countries similar to those but better-governed. But I care a lot about whether he destabilizes the countries I consider the best and most important ones.
Many nations would consider Superman’s property damage to their factories to be an act of war by a foreign power
Maybe. Or maybe they really wouldn’t want to pick a fight with Superman. Or maybe they would issue an angry press release then not do anything. In a setting where Superman holds basically all the cards in terms of physical force, most nations would try quite hard to defuse tensions with him (unless, as I discussed, he’s very unskilled).
I think Alexander Hamilton was the beginning, but this seems like a big step. Vassar talks about how, from the civil war onwards, the American legal system needed to be optimized to rule a vassal state while also pretending that they weren’t ruling a vassal state. Can’t remember the specific examples he cited to me but I found it fairly compelling.
Oh, I think of “ending factory farming” as very far from “taking over the world”.
If Superman were a skilled political operator it could be as simple as arranging to take photoshoots with whichever politicians legislated the end of factory farms.
Or if he were less skilled it could involve doing various kinds of property damage to factory farms (potentially even things which there aren’t laws against, like flying around them in a way which blows the buildings over).
This might escalate to the government trying to arrest him, and outright conflict, but honestly if Superman isn’t skillful enough to defuse that kind of thing, given his influence, then he doesn’t have much business imposing political changes on the world anyway. A politically unskilled and/or unvirtuous Superman trying to end factory farming could quite easily destabilize society in a way that is far worse long-term than letting factory farming end on whatever the natural counterfactual timeline is (without AI, maybe 20 or 30 years?)
Relatedly I’m increasingly coming to believe that this reasoning applies to Lincoln, and that we’d be in a much better position if he’d let the Confederacy secede and then imposed strong economic and moral pressure on them to end slavery.
Eliezer has said that one of the reasons he writes fanfiction, is that he doesn’t have to invent the world. All of the horror and badness was already present in the source material.
A large majority of fantasy settings don’t have literal hells as a key component, so I think my point is still applicable to Project Lawful if you replace “design them that way” with the more general “select for that trait”.
I do agree that this is a good point with regards to HPMoR, which is one reason why I didn’t include HPMoR in my original list of examples.
This is pretty straightforwardly not true, there are plenty of academics (for example) who are as smart as rationalists but don’t do very broad instrumental reasoning.
There are also plenty of people who don’t fantasize about becoming all-powerful dictators.
I think that the hunger to become god is an unusually rationalist trait. Honestly it’s somewhat reminiscent of sociopathy, but fortunately few rationalists seem to be sociopaths. However, I do think a sufficient level of fear of death causes some overlapping traits, e.g. a mentality in which more power is crucial to solving problems. (This is not meant in a particularly blame-y way, I’m just as much an example of this as anyone else around here.)
Very interesting.
Relatedly, A Practical Guide to Evil is one of my favorite books/series, and grapples with the tension between trust and power very well. It’s one of the very few narratives I’ve seen written skillfully enough that the protagonist giving up power didn’t seem straightforwardly stupid to me (even when I was in a classic rationalist mindset).
I do actually think that the general trope of “the rebels winning is sufficient for a happy ending” is pretty indicative of poor ethical thinking.
But even Hollywood balks at their heroes ending up with literal godlike control of the world. For example (though I haven’t watched the series) my impression of the Avengers franchise is that they introduce a plot device (the infinity gauntlet) that gives its wielder godlike powers, the heroes use it specifically to defeat the bad guy and undo the damage he caused, and then they destroy the device.
In other words, they got to the exact point that ratfic heroes got to, and then their happy ending specifically involves them giving up the same kind of godlike power that ratfic heroes typically use to make themselves dictators of the universe.
Similarly for Superman: his happy endings involve him successfully using his godlike powers to beat the bad guys without changing the established world power structures basically at all. And I feel pretty confident that a big reason Superman doesn’t end up taking over the world is because the writers and viewers would have moral qualms about that kind of ending.
tl;dr: there are many ways to make a story have a happy ending, and it’s quite indicative of the authors’ ethical and political views which endings they consider to be happy. The kind of endings that rationalists often portray as happy, mainstream scriptwriters seem to go out of their way to avoid.
So, they are writing fictional analogies for the situation they expect to actually happen in real life. Except of course, since they are writing fiction, it has to have a happy ending.
Well, exactly what I’m disputing here is how happy the ending is. For example, imagine that all of these stories played out exactly the same, with the exact same amount of concentration of power. But instead of the heroes getting to use that power to reshape the world, the power instead goes to.… a random person off the street. I expect that these authors, if they were to write that kind of ending, would portray it as a maybe-happy-ish ending, but one that’s still pretty scary and uncertain.
And indeed, this is roughly how I’d describe the stories mentioned above where a mostly-aligned AI gets total power—Friendship is Optimal, Branches on the Tree of Time, and Metamorphosis of Prime Intellect. These stories really grapple with the sense of unease and tension that comes with almost everyone losing almost all their power.
Whereas when I look at the examples of ratfic above, the stance they’re taking seems to be “our heroes became dictators of the universe. This is a straightforwardly happy ending.” And indeed, on several occasions (maybe as many as half a dozen?) I’ve heard people describe the ending of Worth the Candle as one of the best utopias they can imagine. All of this really seems like a big ideological blind spot.
Yes, I agree. However, as I mentioned in my OP, I think that the prominence of Hell in stories like Unsong and Project Lawful is partly due to them functioning as plot devices to make taking over the world not just ethical but in fact morally obligatory.
Analogously, if a bunch of 19th-century Marxist fiction featured working conditions far harsher than any that existed in the real world, which compelled the heroes to launch a proletarian revolution, you wouldn’t just think “this makes total sense given the fictional premise”, you’d also think “the fictional premise was chosen to help the authors make the thing they already supported (and wanted to write about) seem morally good”.
And “take over the world for good reasons” was IIRC MIRI’s actual plan (hidden under the terminology “decisive strategic advantage” or “pivotal act” or similar).
Another way of putting this is in terms of the distinction between two types of optimization: selection and control.
Ratfic typically thinks of improving the world as a selection problem. Selecting a better world from the space of possible worlds is neat and elegant and lets you solve all problems at once. The only issue is that you need to gain absolute power first in order to be able to select the future you want.
Whereas you can also think about improving the world as a control problem, where you’re gradually nudging the world towards being better. This is less narratively satisfying when you’re a highly systematizing thinker, because you want to be able to identify the single root problem and take it out in one fell swoop (the same style of thinking that the communists were doing). But it’s much more robust when you’re in a world full of other people all of whom are also trying to exert influence.
IIRC none of Harry in HPMoR, Aaron in Unsong or Naruto in Waves Arisen actually meaningfully improved the world before taking it over—if anything, they mostly made it worse.
(Harder to evaluate this for the r!Animorphs or Keltham, because they were operating in such adversarial environments. And I don’t remember Worth the Candle well enough to say one way or the other.)
Your analogy seems a bit skewed because on the scale “how much of a world takeover is this?”, “gaining the ability to unilaterally destroy the world” scores much higher than merely “gaining nukes”. If becoming a nuclear power let you unilaterally destroy the world, the US would have tried much harder to limit their spread!
It seems more like “a small powerless country seizes the USSR’s entire nuclear weapons stockpile (or creates an equivalently large one of their own) and tells everyone that they’ll cause nuclear armageddon unless their demands are satisfied”. Which is pretty world-takeover-adjacent even if it’s not exactly “taking over the world” in the classic sense.
(I’d also describe it as a central example of “gaining the power to design a new world order”, but not a central example of “gaining the power to design a new world order from scratch”.)
On the meta level: why are there this many net upvotes and agreement votes for planning horizons of “at most 5 years”? This updates me towards thinking that some aspect of collective epistemics is notably worse than I had been tracking.