Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Richard_Ngo
Ooops, good catch. It should have linked to this: https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform?commentId=W9N9tTbYSBzM9FvWh (and I’ve changed the link now).
Here is a broad sketch of how I’d like AI governance to go. I’ve written this in the form of a “plan” but it’s not really a sequential plan, more like a list of the most important things to promote.
Identify mechanisms by which the US government could exert control over the most advanced AI systems without strongly concentrating power. For example, how could the government embed observers within major AI labs who report to a central regulatory organization, without that regulatory organization having strong incentives and ability to use their power against their political opponents?
In practice I expect this will involve empowering US elected officials (e.g. via greater transparency) to monitor and object to misbehavior by the executive branch.
Create common knowledge between the US and China that the development of increasingly powerful AI will magnify their own internal conflicts (and empower rogue states) disproportionately more than it empowers them against each other. So instead of a race to world domination, in practice they will face a “race to stay standing”.
Rogue states will be empowered because human lives will be increasingly fragile in the face of AI-designed WMDs. This means that rogue states will be able to threaten superpowers with “mutually assured genocide” (though I’m a little wary of spreading this as a meme, and need to think more about ways to make it less self-fulfilling).
Set up channels for flexible, high-bandwidth cooperation between AI regulators in China and the US (including the “AI regulators” in each who try to enforce good behavior from the rest of the world).
Advocate for an ideology roughly like the one I sketched out here, as a consensus alignment target for AGIs.
This is of course all very vague; I’m hoping to flesh it out much more over the coming months, and would welcome thoughts and feedback. Having said that, I’m spending relatively little of my time on this (and focusing on technical alignment work instead).
Here is the broad technical plan that I am pursuing with most of my time (with my AI governance agenda taking up most of my remaining time):
Mathematically characterize a scale-free theory of intelligent agency which describes intelligent agents in terms of interactions between their subagents.
A successful version of this theory will retrodict phenomena like the Waluigi effect, solve theoretical problems like the five-and-ten problem, and make new high-level predictions about AI behavior.
Identify subagents (and subsubagents, and so on) within neural networks by searching their weights and activations for the patterns of interactions between subagents that this theory predicts.
A helpful analogy is how Burns et al. (2022) search for beliefs inside neural networks based on the patterns that probability theory predicts. However, I’m not wedded to any particular search methodology.
Characterize the behaviors associated with each subagent to build up “maps” of the motivational systems of the most advanced AI systems.
This would ideally give you explanations of AI behavior that scales in quality based on how much effort you put in. E.g. you might be able to predict 80% of the variance in an AI’s choices by looking at which highest-level subagents are activated, then 80% of the remaining variance by looking at which subsubagents are activated, and so on.
Monitor patterns of activations of different subagents to do lie detection, anomaly detection, and other useful things.
This wouldn’t be fully reliable—e.g. there’d still be some possible failures where low-level subagents activate in ways that, when combined, leads to behavior that’s very surprising given the activations of high-level subagents. (ARC’s research seems to be aimed at these worst-case examples.) However, I expect it would be hard even for AIs with significantly superhuman intelligence to deliberately contort their thinking in this way. And regardless, in order to solve worst-case examples it seems productive to try to solve the average-case examples first.
I’m focusing on step 1 right now. Note that my pursuit of it is overdetermined—I’m excited enough about finding a scale-free theory of intelligent agency that I’d still be working on it even if I didn’t think steps 2-4 would work, because I have a strong heuristic that pursuing fundamental knowledge is good. Trying to backchain from an ambitious goal to reasons why a fundamental scientific advance would be useful for achieving that goal feels pretty silly from my perspective. But since people keep asking me why step 1 would help with alignment, I decided to write this up as a central example.
- Jun 9, 2025, 9:38 PM; 19 points) 's comment on Richard Ngo’s Shortform by (
Yepp, see also some of my speculations here: https://x.com/richardmcngo/status/1815115538059894803?s=46
Have you read The Metamorphosis of Prime Intellect? Fits the bill.
Interesting. Got a short summary of what’s changing your mind?
I now have a better understanding of coalitional agency, which I will be interested in your thoughts on when I write it up.
Our government is determined to lose the AI race in the name of winning the AI race.
The least we can do, if prioritizing winning the race, is to try and actually win it.
This is a bizarre pair of claims to make. But I think it illustrates a surprisingly common mistake from the AI safety community, which I call “jumping down the slippery slope”. More on this in a forthcoming blog post, but the key idea is that when you look at a situation from a high level of abstraction, it often seems like sliding down a slippery slope towards a bad equilibrium is inevitable. From that perspective, the sort of people who think in terms of high-level abstractions feel almost offended when people don’t slide down that slope. On a psychological level, the short-term benefit of “I get to tell them that my analysis is more correct than theirs” outweighs the long-term benefit of “people aren’t sliding down the slippery slope”.One situation where I sometimes get this feeling is when a shopkeeper charges less than the market rate, because they want to be kind to their customers. This is typically a redistribution of money from a wealthier person to less wealthy people; and either way it’s a virtuous thing to do. But I sometimes actually get annoyed at them, and itch to smugly say “listen, you dumbass, you just don’t understand economics”. It’s like a part of me thinks of reaching the equilibrium as a goal in itself, whether or not we actually like the equilibrium.
This is obviously a much worse thing to do in AI safety. Relevant examples include Situational Awareness and safety-motivated capability evaluations (e.g. “building great capabilities evals is a thing the labs should obviously do, so our work on it isn’t harmful”). It feels like Zvi is doing this here too. Why is trying to actually win it the least we can do? Isn’t this exactly the opposite of what would promote crucial international cooperation on AI? Is it really so annoying when your opponents are shooting themselves in the foot that it’s worth advocating for them to stop doing that?
It kinda feels like the old joke:
On a beautiful Sunday afternoon in the midst of the French Revolution the revolting citizens led a priest, a drunkard and an engineer to the guillotine. They ask the priest if he wants to face up or down when he meets his fate. The priest says he would like to face up so he will be looking towards heaven when he dies. They raise the blade of the guillotine and release it. It comes speeding down and suddenly stops just inches from his neck. The authorities take this as divine intervention and release the priest.
The drunkard comes to the guillotine next. He also decides to die face up, hoping that he will be as fortunate as the priest. They raise the blade of the guillotine and release it. It comes speeding down and suddenly stops just inches from his neck. Again, the authorities take this as a sign of divine intervention, and they release the drunkard as well.
Next is the engineer. He, too, decides to die facing up. As they slowly raise the blade of the guillotine, the engineer suddenly says, “Hey, I see what your problem is …”
Approximately every contentious issue has caused tremendous amounts of real-world pain. Therefore the choice of which issues to police contempt about becomes a de facto political standard.
I think my thought process when I typed “risk-averse money-maximizer” was that an agent could be risk-averse (in which case it wouldn’t be an EUM) and then separately be a money-maximizer.
But I didn’t explicitly think “the risk-aversion would be with regard to utility not money, and risk-aversion with regard to money could still be risk-neutral with regard to utility”, so I appreciate the clarification.
Your example bet is a probabilistic mixture of two options: $0 and $2. The agent prefers one of the options individually (getting $2) over any probabilistic mixture of getting $0 and $2.
In other words, your example rebuts the claim that an EUM can’t prefer a probabilistic mixture of two options to the expectation of those two options. But that’s not the claim I made.
Hmm, this feels analogous to saying “companies are an unnecessary abstraction in economic theory, since individuals could each make separate contracts about how they’ll interact with each other. Therefore we can reduce economics to studying isolated individuals”.
But companies are in fact a very useful unit of analysis. For example, instead of talking about the separate ways in which each person in the company has committed to treating each other person in the company, you can talk about the HR policy which governs all interactions within the company. You might then see emergent effects (like political battles over what the HR policies are) which are very hard to reason about when taking a single-agent view.
Similarly, although in principle you could have any kind of graph of which agents listen to which other agents, in practice I expect that realistic agents will tend to consist of clusters of agents which all “listen to” each other in some ways. This is both because clustering is efficient (hence animals having bodies made up of clusters of cells; companies being made of clusters of individuals; etc) and because when you even define what counts as a single agent, you’re doing a kind of clustering. That is, I think that the first step of talking about “individual rationality” is implicitly defining which coalitions qualify as individuals.
a superintelligent AI probably has a pretty good guess of the other AI’s real utility function based on its own historical knowledge, simulations, etc.
This seems very unclear to me—in general it’s not easy for agents to predict the goals of other agents with their own level of intelligence, because the amount of intelligence aimed at deception increases in proportion to the amount of intelligence aimed at discovering that deception.
(You could look at the AI’s behavior from when it was less intelligent, but then—as with humans—it’s hard to distinguish sincere change from improvement at masking undesirable goals.)
But regardless, that’s a separate point. If you can do that, you don’t need your mechanism above. If you can’t, then my objection still holds.
One argument for being optimistic: the universe is just very big, and there’s a lot to go around. So there’s a huge amount of room for positive-sum bargaining.
Another: at any given point in time, few of the agents that currently exist would want their goals to become significantly simplified (all else equal). So there’s a strong incentive to coordinate to reduce competition on this axis.
Lastly: if at each point in time, the set of agents who are alive are in conflict with potentially-simpler future agents in a very destructive way, then they should all just Do Something Else. In particular, if there’s some decision-theoretic argument roughly like “more powerful agents should continue to spend some of their resources on the values of their less-powerful ancestors, to reduce the incentives for inter-generational conflict”, even agents with very simple goals might be motivated by it. I call this “the generational contract”.
- Apr 24, 2025, 3:15 PM; 4 points) 's comment on Richard Ngo’s Shortform by (
I found this a very interesting question to try to answer. My first reaction was that I don’t expect EUMs with explicit utility functions to be competitive enough for this to be very relevant (like how purely symbolic AI isn’t competitive enough with deep learning to be very relevant).
But then I thought about how companies are close-ish to having an explicit utility function (maximize shareholder value) which can be merged with others (e.g. via acquisitions). And this does let them fundraise better, merge into each other, and so on.
Similarly, we can think of cases where countries were joined together by strategic marriages (the unification of Spain, say) as only being possible because the (messy, illegible) interests of the country were rounded off to the (relatively simple) interests of their royals. And so the royals being guaranteed power over the merged entity via marriage allowed the mergers to happen much more easily than if they had to create a merger which served the interests of the “country as a whole”.
For a more modern illustration: suppose that the world ends up with a small council who decide how AGI goes. Then countries with a dictator could easily bargain to join this coalition in exchange for their dictator getting a seat on this council. Whereas democratic countries would have a harder time doing so, because they might feel very internally conflicted about their current leader gaining the level of power that they’d get from joining the council.
(This all feels very related to Seeing Like a State, which I’ve just started reading.)
So upon reflection: yes, it’s reasonable to interpret me as trying to solve the problem of getting the benefits of being governed by a set of simple and relatively legible goals, without the costs that are usually associated with that.
Note that I say “legible goals” instead of “EUM” because in my mind you can be an EUM with illegible goals (like a neural network that implements EUM internally), or a non-EUM with legible goals (like a risk-averse money-maximizer), and merging is more bottlenecked on legibility than EUM-ness.
I’ve changed my mind. Coming up with the ideas above has given me a better sense of how agent foundations progress could be valuable.
More concretely, I used to focus more on the ways in which agent foundations applied in the infinite-compute limit, which is not a setting that I think is very relevant. But I am more optimistic about agent foundations as the study of idealized multi-agent interactions. In hindsight a bunch of the best agent foundations research (e.g. logical induction) was both at once, but I’d only been viewing it as the former.
More abstractly, this update has made me more optimistic about conceptual progress in general. I guess it’s hard to viscerally internalize the extent to which it’s possible for concepts to break and reform in new configurations without having experienced that. (People who were strongly religious then deconverted have probably had a better sense of this all along, but I never had that experience.)
I’m glad I did the compromise before; it ended up reducing the consequences of this mistake somewhat (though I’d guess including one week of agent foundations in the course had a pretty small effect).
I think this addresses the problem I’m discussing only in the case where the source code contains an explicit utility function. You can then create new source code by merging those utility functions.
But in the case where it doesn’t (e.g. the source code is an uninterpretable neural network) you are left with the same problem.
Edited to add: Though even when the utility function is explicit, it seems like the benefits of lying about your source code could outweigh the cost of changing your utility function. For example, suppose A and B are bargaining, and A says “you should give me more cake because I get very angry if I don’t get cake”. Even if this starts off as a lie, it might then be in A’s interests to use your mechanism above to self-modify into A’ that does get very angry if it doesn’t get cake, and which therefore has a better bargaining position (because, under your protocol, it has “proved” that it was A’ all along).
Consider you and I merging (say, in a marriage). Suppose that all points on the Pareto frontier involve us pursuing a fully-consistent strategy. But if some decisions are your responsibility, and other decisions are my responsibility, then we might end up with some of our actions inconsistent with others (say, if we haven’t had a chance to communicate before deciding). That’s not on the Pareto frontier.
What is on the Pareto frontier is you being dictator, and then accounting for my utility function when making your dictatorial decisions. But of course this is something I will object to, because in any realistic scenario I wouldn’t trust you enough to give you dictatorial power over me. Once you have that power, continuing to account for my utility is strongly non-incentive-compatible for you. So we’re more likely to each want to retain some power, even if it sometimes causes inefficiency. (The same is true on the level of countries, which accept a bunch of inefficiency from democratic competition in exchange for incentive-compatibility and trust.)
Another way of putting this: I’m focusing on the setting where you cannot do arbitrary merges, you can only do merges that are constructible via some set of calls to the existing agents. It’s often impossible to construct a fully-consistent merged agent without concentrating power in ways that the original agents would find undesirable (though there sometimes is, e.g. with I-cut-you-choose cake-cutting). So in this setting we need a different conception of rationality than Pareto-optimal.
Suppose you’re in a setting where the world is so large that you will only ever experience a tiny fraction of it directly, and you have to figure out the rest via generalization. Then your argument doesn’t hold up: shifting the mean might totally break your learning. But I claim that the real world is like this. So I am inherently skeptical of any result (like most convergence results) that rely on just trying approximately everything and gradually learning which to prefer and disprefer.
I just keep coming back to this comment, because there are a couple of lines in it that are downright poetic. I particularly appreciate:
“BE NOT AFRAID.” said the ants.
and
Yeah, and this universe’s got time in it, though.
and
can you imagine how horrible the world would get if we didn’t honour our precommitments?
Have you considered writing stories of your own?
Consider the version of the 5-and-10 problem in which one subagent is assigned to calculate U | take 5, and another calculates U | take 10. The overall agent solves the 5-and-10 problem iff the subagents reason about each other in the “right ways”, or have the right type of relationship to each other. What that specifically means seems like the sort of question that a scale-free theory of intelligent agency might be able to answer.
I’m mostly trying to extend pure mathematical frameworks (particularly active inference and a cluster of ideas related to geometric rationality, including picoeconomics and ergodicity economics).