Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Richard_Ngo
Shannon’s information theory tells us about how much information is sent from a sender to a receiver via a channel. But the implications of that information might be very different depending on how much the sender trusts the receiver, in a way which seems like it might point to some interesting theoretical concepts.
Imagine you’ve just received a message M. There are two possibilities for who it came from: your best friend, or your worst enemy. The message might contain an equal amount of information either way—however, you’ll react to it very differently. In the former case, you want to propagate the contents of the message directly into your world-model: whatever your friend says, you fully believe.
By contrast, you definitely don’t want to just believe what your enemy tells you. Nor should you actively disbelieve it, though, otherwise they could still predictably mislead you. Instead, the main update you want to make is just “my enemy wants me to see the particular string of characters that constitutes message M”.
Another way of putting this: in the former case you update your world-model by conditioning on the semantics of the message, whereas in the latter case you update your world-model by conditioning on the syntax of the message. Why? Because you don’t trust your enemy enough for communication with them to support semantic meaning.
—
Now let’s push it a step further. Instead of a message from your friend or enemy, the message is either from an omnibenevolent god or an omnimalevolent devil. Here we also have a disparity in how much you trust them—but this time, both possible senders are (we’ll assume) far more intelligent than you, and could design messages which manipulate you in all sorts of ways.
So for the message from the devil, you probably don’t want to update “the devil wants me to see M”, because that still involves representing M in your mind, which could allow them to manipulate you. Instead you want to firewall yourself from their message, ideally by not even reading it at all.
The opposite is true for the message from god. In that case, even injecting the message directly into your world-model isn’t sufficient. Ideally you’d inject it straight into your policy—you’d meditate on the message, letting the words echo like poetry through your mind, until you’d internalized it on such a deep level that it was able to directly shape your instincts and impulses. After that point, you’d trust those instincts much more than by default—because they’d been rewired by someone who knows exactly what the right instincts for you to have are.
(This might be how an animal whose nervous system was finely-honed by evolution feels, by the way. Their evolved instincts are way smarter than they are, and so they have to trust them implicitly. And see also Sahil’s discussion of how cells can call for help, and get it, without any understanding of how or why that help arrived.)
I don’t know where to go with all of this but there’s something that seems very interesting here.
it seems strange to blame the resulting unproductivity of the conversation mainly on lesswrong
I didn’t intend to assign blame. If I had a different intellectual style (e.g. if I were more methodical about building up chains of logic) then I agree it’d be much easier for people to productively engage with me.
For the record: I do agree that a bunch of my political thinking is sloppy. Right now it feels like I’m facing a tradeoff between speed of conceptual progress and precision of thinking, and I’m optimizing primarily for the former.
One reason I discussed the analogy to ML above is because I hoped it would help people understand why I’m making this tradeoff. For example, I suspect that many LWers remember their thinking about AGI being called sloppy by the mainstream ML community because it didn’t have equations. I think in hindsight it was the correct choice for LW to focus on this kind of “sloppy” exploratory thinking.
Having said that, it’s clearly possible to go too far in this direction, and I regret giving the EAG talk in particular. More generally, there’s a difference between doing sloppy thinking with intellectual collaborators vs broadcasting sloppy thinking to the world. Part of what I’m trying to figure out is the extent to which I should think of LW posts as the former vs the latter.
I regret giving the example of the disagree-votes, it’s not that important to me, and I agree there are all sorts of reasons you might want to disagree-vote my previous post. I’m trying to point at a broader dynamic (and elaborate more on it in this reply to Raemon).
Thanks for asking. I think the underlying issue here is that I’m in a period of boggling at wtf is going on with society. I have a sense that there’s a bunch of insane stuff happening all over the place. Funnily enough, one of the people who’s most sharply articulated a similar sense is Eliezer, when he wrote (I think in some glowfic) about how Earth is fractally disequilibriated, and the whole planet is made out of coordination failure.
But I think Eliezer and many rationalists maybe just take “the world is inadequate” as some kind of brute fact that doesn’t really have a clear socio-historical-political explanation. Like, we used to be able to do Manhattan projects, and now the US govt is nowhere near coordinated enough to do that, but… eh, that’s just how entropy works. Whereas it seems to me that actually it might be possible to trace the historical forces that contributed towards this, and the social principles that maintain it, and so on, to develop a fairly principled understanding of the situation.
However, this is a sufficiently ambitious project that my default strategy is to do a lot of exploration in a bunch of directions, which then leads to a lot of individual claims that people think are sloppy, which then leads to the kind of engagement that’s frustrating on both sides—where to them it feels like I’m just throwing out crazy takes, and to me it feels like they’re not trying to engage with the core ideas. (I don’t think these ideas should be sufficient to change people’s strategic frame, yet, but I do think they should be sufficient to make people confused.)
As one example: in response to this shortform, a bunch of people have commented about why they disagree-voted my previous post, how I should interpret that, and so on. But literally zero people have mentioned either my ML analogy, or the thing where Scott Alexander is calling himself a Nazi, which to me were the two most substantive parts of the shortform, that are pointing at an extremely important dynamic. So in hindsight it feels like even just mentioning my previous post derailed this one, and the move of going meta was insufficient to defuse this.
Basically I think the kind of engagement I want is more like “riff with me”, but that’s just unrealistic to expect from a community in public on controversial topics (at least without requiring me to put a level of care into phrasing things that would make it no longer “riffing” on my end).
Thanks, good comment. The quick low-effort version that doesn’t require actually writing the posts is that without taking heritable IQ into account, I think you will be confused about:
Various ways in which post-apartheid South Africa is a bad place to live.
Why Israel is so good at defending itself even against far larger countries surrounding it (and the last few centuries of Jewish history more generally).
Why the growth curves for East Asia and Africa looked so different over the last century.
Twitter + my blog (mindthefuture.info).
But most of my collaboration on this stuff is via 1:1 discussions, reading groups, etc. I host a politics reading group which has been very productive for me (separate from the 21civ.com groups, which have also been interesting).
Two memos from 2024
I had trouble figuring out which part of your post was intended to be the main question. I’ve left a few comments responding to various parts of it.
Re what outcomes I’m aiming for, honestly a lot of what motivates my thinking is how much I care about cooperation. I just expect that for cooperation to work at large scales and over the long term, you need to do a bunch of exclusion/separation at smaller scales.
I suspect that I’ve interpreted you to be saying that those previous outcomes were bad (“While perhaps at first this taboo was valuable”)
Yes, ethnonationalism by white Europeans previously led to some very bad outcomes. The reason I said “perhaps” is that I don’t know if implementing the taboo was in fact necessary to prevent it from happening again (as opposed to, say, trusting people to learn from history).
if your view is that we can discuss some features of outcomes that make an outcome good or bad, then I would want to discuss what virtues can lead us towards better outcomes by that standard. virtue utilitarianism, if you will, rather than rule utilitarianism
I endorse the first sentence but not the second. Virtue utilitarianism holds the consequences to be the ultimate thing that matter, and virtues as a mechanism for reaching them. Whereas I think of “produces good outcomes” as one criterion by which to evaluate virtues, but not the only one (because humans in fact care about more than just outcomes, and because figuring out what counts as “good” is hard to disentangle from understanding virtues themselves).
A “friendly, non-destructive ethnonationalism” that is prosocial towards nearby ethnonational groups is something I could imagine being a worthy success, though it would strike me as odd
Note that this is not too different from the National Conservatism movement, which includes nationalists from a bunch of different countries.
I currently doubt that ethno- or national- are the grouping types that work best
One intuition that might be useful here: two types of entities competing for the future are countries and companies. Countries are in some sense on the “human side”, because they have a bunch of mechanisms that prevent AIs from gaining political power. Conversely, companies could very easily be run almost entirely by AIs. So while I’m open to the possibility of different grouping types, part of what I’m doing is trying to strengthen the best thing we have before it gets strongly challenged by the rise of AI.
I’ve struggled to engage very productively with the rest of the AI safety community about politics—e.g. there are a lot of disagreeing votes and comments on this post. By and large people have been respectful and polite (as per the high upvote count), but right now it doesn’t feel like LW is a place where I can collaboratively make intellectual progress on this very important topic.
This is sad, so before I stop trying I want to attempt to go meta, by explaining how this seems analogous to the ways that the alignment community struggled to engage with mainstream ML researchers throughout the 2010s and early 2020s. My explanation of those dynamics ended up growing into this post; in this shortform I’ll discuss the analogy to politics specifically.
The main point is the following. With enough discussion, you could often get a mainstream ML researcher to admit that something like situational awareness or recursive self-improvement might in principle be possible. But it’s very hard to get them to take it seriously enough that it propagates through their ontology—and so after that one conversation they’d typically just go back to their standard ML research. This is for a bunch of reasons—in part because it’s genuinely hard to update one’s ontology, in part because their social incentives and identity push away from doing so, in part because they’re scared about the implications of propagating this belief.
Similarly, I think AI safety people (especially those in the LW cluster) are usually intellectually honest enough to be able to acknowledge that various heretical political beliefs might be true. However, there’s an additional step of propagating it through their ontology which typically doesn’t happen due to mental blocks.
For example, in this post Scott Alexander is willing to mention the possibility that heritable racial IQ gaps exist as one of four hypotheses for observed racial disparities. However, observe how he immediately distances himself from the position:
White people have average IQ 100, black people have average IQ 85, this IQ difference accurately predicts the different proportions of whites and blacks in most areas, most IQ differences within race are genetic, maybe across-race ones are genetic too. I love Hitler and want to marry him.
This is particularly striking because I’m pretty confident he himself believes some version of this hypothesis! So he’s basically calling himself a Nazi to defuse the discomfort of even raising a hypothesis this controversial (let alone endorsing it). This is objectively a very odd thing to do.
Now, Scott Alexander has historically been very brave in a bunch of ways that I wasn’t, so I don’t want to try to take any kind of moral high ground. I merely want to point out that it’s pretty obvious how this kind of mental block might make it hard to actually propagate your beliefs. Another example: when I talked to one of the most curious and polymathic people I know about a controversial topic, his immediate response was “but does that even matter?” On other topics, he would have followed his curiosity to play around with the ideas; on this one, he tried to block it off as quickly as possible.
I was in a pretty similar mindset for a long time, and required a strong sense of social safety before I managed to get myself out of it. My experience since then has been that, when you move from this kind of blocked partial acceptance of controversial beliefs, to a mindset where you’re actually able to follow them wherever they lead, there are a lot of important implications. I want to keep this post pretty meta so that more people feel comfortable engaging with it, but as one fairly milquetoast example: after seeing just how strong self-deception around taboos can be in humans, it seems pretty important to prevent anything similar from happening in AIs.
And just like with alignment, I think that all of this is giving us clues towards a whole new ontology that actually conveys very important principles about how the world works. Trying to do AI governance in the standard political ontology really feels to me like trying to do alignment research while stuck in the ML ontology. More on this in other posts, but for now I hope that this post helps other LWers better understand where I’m coming from.
The ML ontology and the alignment ontology
inherent chaotic nature of social and political processes
Everything seems inherently chaotic until you understand it well! The motion of the planets across the sky seems very arbitrary until you understand Kepler’s laws, and so on.
Re “simulations”, the easiest way to build a simulation of something is to have a principled model/theory of it.
My post on pessimization talks about a bunch of the mechanisms by which you might have negative impact.
I have some posts in the works on virtue ethics, but for now probably the most relevant thing I’ve written is this sequence on replacing fear. My sense is that a lot of self-deception is caused by fear-based motivations.
good post, thanks.
one thing I’d highlight: there’s a point where you conflate two claims that feel very different to me:
“It seems plausible that the best thing to do if you really take AI x-risk seriously is to just stop working on AI at all.”
And that’s what I’ve been trying to say this whole time, whenever anyone asks me about my career. That I don’t want to try to have a big impact, if I can’t be certain that that impact will be positive rather than negative for the world — and I can’t be certain.
I think that there are a bunch of “AI safety” people such that the world would be better off if they stopped working on AI at all. But that doesn’t mean that they (or anyone else) should be aiming to have certainty of positive impact—that’s a very high bar.
Instead, one way I think about it is that there’s a skill of avoiding self-deception (and being virtuous more generally), and the more you cultivate this skill, then the more you’re able to have a robustly positive impact even when you’re not certain.
Nice post. I like this distinction, and expect that I’ll use it going forward.
The goals compete with each other for focus, but not very strategically and not strategically without our knowledge
I think that the prevalence of self-deception weighs heavily against this claim. Status-seeking behaviors (and ego maintenance more generally) seems to happen largely on the edges of one’s knowledge. But those are pretty paradigm examples of subgoals scheming against each other.
both of your points seem to implicitly assume that I am wrong, rather than demonstrating that I am wrong.
You were portraying the “we’re confused about agency” position as being that agency is “pre-paradigmatic”. I think that’s a mischaracterization, and corrected it. This doesn’t implicitly assume that you’re wrong.
However, I accept that my claim “this is a bad model of how scientific progress works” does implicitly rely on the idea that there’s a clean new paradigm to be discovered for bounded agents. I didn’t have that cached in my head as the core crux between us, but I’d phrase it differently now that I do.
Huh, looks like the list of Astra mentors has changed significantly. In particular, all the Forethought people are off it now (maybe other changes I’m not tracking). Not sure if that reflects a substantive change or a PR optimization.
I hadn’t thought of the simulacra levels connection; that’s useful, thanks. Though for what it’s worth all of these things feel closely-related to me. At simulacrum level 4 you should think of words as similar to other actions, i.e. you calculate the expected utility of uttering some sequence of words/morphemes, rather than distinguishing “true” from “false” from “meaningless” sentences. (I was intending the “worst enemy” thing to gesture at the idea that, given that you know that this person is trying to hurt you with their words, there’s not even really a sense in which you can find their words meaningful.)