Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.
Richard_Ngo
Consequentialism and utility functions or policies could in principle be about virtues and integrity as about hamburgers, but hamburgers are more legible and easier to administer.
Here’s one concrete way in which this isn’t true: one common simplifying assumption in economics is that goods are homogeneous, and therefore that you’re indifferent about who to buy from. However, virtuous behavior involves rewarding people you think are more virtuous (e.g. by preferentially buying things from them).
In other words, economics is about how agents interact with each other via exchanging goods and services, while virtues are about how agents interact with each other more generally.
Sufficiently different versions of yourself are just logically uncorrelated with you and there is no game-theoretic reason to account for them.
Seems odd to make an absolute statement here. More different versions of yourself are less and less correlated, but there’s still some correlation. And UDT should also be applicable to interactions with other people, who are typically different from you in a whole bunch of ways.
there’s often no internal conflict when someone is caught up in some extreme form of the morality game
Belated reply, sorry, but I basically just think that this is false—analogous to a dictator who cites parades where people are forced to attend and cheer as evidence that his country lacks internal conflict. Instead, the internal conflict has just been rendered less legible.
In the subagents frame, I would say that the subagents have an implicit contract/agreement that any one of them can seize control, if doing so seems good for the overall agent in terms of power or social status.
Note that this is an extremely non-robust agent design! In particular, it allows subagents to gain arbitrary amounts of power simply by lying about their intentions. If you encounter an agent which considers itself to be structured like this, you should have a strong prior that it is deceiving itself about the presence of more subtle control mechanisms.
Crossposted from Twitter:
This year I’ve been thinking a lot about how the western world got so dysfunctional. Here’s my rough, best-guess story:
1. WW2 gave rise to a strong taboo against ethnonationalism. While perhaps at first this taboo was valuable, over time it also contaminated discussions of race differences, nationalism, and even IQ itself, to the point where even truths that seemed totally obvious to WW2-era people also became taboo. There’s no mechanism for subsequent generations to create common knowledge that certain facts are true but usefully taboo—they simply act as if these facts are false, which leads to arbitrarily bad policies (e.g. killing meritocratic hiring processes like IQ tests).
2. However, these taboos would gradually have lost power if the west (and the US in particular) had maintained impartial rule of law and constitutional freedoms. Instead, politicization of the bureaucracy and judiciary allowed them to spread. This was enabled by the “managerial revolution” under which govt bureaucracy massively expanded in scope and powers. Partly this was a justifiable response to the increasing complexity of the world (and various kinds of incompetence and nepotism within govts) but in practice it created a class of managerial elites who viewed their intellectual merit as license to impose their ideology on the people they governed. This class gains status by signaling commitment to luxury beliefs. Since more absurd beliefs are more costly-to-fake signals, the resulting ideology is actively perverse (i.e. supports whatever is least aligned with their stated core values, like Hamas).
3. On an ideological level the managerial revolution was facilitated by a kind of utilitarian spirit under which technocratic expertise was considered more important for administrators than virtue or fidelity to the populace. This may have been a response to the loss of faith in traditional elites after WW1. The enlightened liberal perspective wanted to maintain a fiction of equality, under which administrators were just doing a job the same as any other, rather than taking on the heavy privileges and responsibilities associated with (healthy) hierarchical relationships.
4. On an economic level, the world wars led to centralization of state power over currency and the abandonment of the gold standard. While at first govts tried to preserve the fiction that fiat currencies were relevantly similar to gold-backed currencies, again there was no mechanism for later generations to create common knowledge of what had actually been done and why. The black hole of western state debt that will never be repaid creates distortions across the economy, which few economists actually grapple with because they are emotionally committed to thinking of western govts as “too big to fail”.
5. All of this has gradually eroded the strong, partly-innate sense of virtue (and respect for virtuous people) that used to be common. Virtue can be seen as a self-replicating memeplex that incentivizes ethical behavior in others—e.g. high-integrity people will reward others for displaying integrity. This is different from altruism, which rewards others regardless of their virtue. Indeed, it’s often directly opposed to altruism, since altruists disproportionately favor the least virtuous people (because they’re worse-off). Since consequentialists think that morality is essentially about altruism, much moral philosophy actively undermines ethics. So does modern economics, via smuggling in the assumption that utility functions represent selfish preferences.
6. All of this is happening against a backdrop of rapid technological progress, which facilitates highly unequal control mechanisms (e.g. a handful of people controlling global newsfeeds or AI values). The bad news is that this enables ideologies to propagate even when they are perverse and internally dysfunctional. The good news is that it makes genuine truth-seeking and virtuous cooperation increasingly high-leverage.
Addenda:I led with the ethnonationalism stuff because it’s the most obvious, but in some sense it’s just a symptom: a functional society would have rejected the taboos when they got too obviously wrong (e.g. by defending Murray).
The deeper issue seems to be a kind of toxic egalitarianism that is against accountability, hierarchy or individual agency in general. You can trace this thread (with increasing uncertainty) thru e.g. Wilson, Marx, the utilitarians, and maybe even all the way back to Jesus.
Michael Vassar thinks of it as Germanic “kultur” (as opposed to “zivilization”); I’m not well-read enough to evaluate that claim though. I’m more confident about it being driven by fear-based motivations, especially envy—as per Girard, Lacan, etc.
Some prescriptions I’m currently considering:
- reviving virtue ethics
- AI-based tools for facilitating small, high-trust, high-accountability groups. Even if we can’t have freedom of association or reliable arbitration via legal or corporate mechanisms, perhaps we can still have it via social mechanisms (especially as more and more people become functionally post-economic)
- better therapeutic interventions, especially oriented to resolving fear of deathBut I spend most of my time trying to figure out the formal theory that encodes these intuitions—in which agents are understood in terms of goals (in the predictive processing sense) and boundaries rather than utility functions and credences. That feels upstream of a lot of other stuff. More here, though it’s a bit out of date.
A related post I wrote recently.
+1 to ChristianKl’s observation below though that Geoffrey Miller is unrepresentative of MAGA because he’s already part of the broader AI safety community.
You might be interested in this post of mine which makes some related claims.
(Interested to read your post more thoroughly but for now have just skimmed it and not sure when I’ll find time to engage more.)
FWIW your writings on neuroscience are a central example of “real thinking” in my mind—it seems like you’re trying to actually understand things in a way that’s far less distorted by social pressures and incentives than almost any other writing in the field.
Reading this post led me to find a twitter thread arguing (with a bunch of examples):
One of the curious things about von Neumann was his ability to do extremely impressive technical work while seemingly missing all the big insights.
I then responded to it with my own thread arguing:
I’d even go further—I think we’re still recovering from Von Neumann’s biggest mistakes:
1. Implicitly basing game theory on causal decision theory
2. Founding utility theory on the independence axiom
3. Advocating for nuking the USSR as soon as possibleI’m not confident in my argument, but it suggests the possibility that von Neumann’s concern about his legacy was tracking something important (though, even if so, it’s unlikely that feeling insecure was a good response).
If someone predicts in advance that something is obviously false, and then you come to believe that it’s false, then you should update not just towards thought processes which would have predicted that the thing is false, but also towards thought processes which would have predicted that the thing is obviously false. (Conversely, if they predict that it’s obviously false, and it turns out to be true, you should update more strongly against their thought processes than if they’d just predicted it was false.)
IIRC Eliezer’s objection to bioanchors can be reasonably interpreted as an advance prediction that “it’s obviously false”, though to be confident I’d need to reread his original post (which I can’t be bothered to do right now).
It’s not that moderates and radicals are trying to answer different questions (and the questions moderates are answering are epistemically easier like physics).
That seems totally wrong. Moderates are trying to answer questions like “what are some relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget?” and “how can I cause AI companies to marginally increase that budget?” These questions are very different from—and much easier than—the ones the radicals are trying to answer, like “how can we radically change the governance of AI to prevent x-risk?”
The argument “there are specific epistemic advantages of working as a moderate” isn’t just a claim about categories that everyone agrees exist, it’s also a way of carving up the world. However, you can carve up the world in very misleading ways depending on how you lump different groups together. For example, if a post distinguished “people without crazy-sounding beliefs” from “people with crazy-sounding beliefs”, the latter category would lump together truth-seeking nonconformists with actual crazy people. There’s no easy way of figuring out which categories should be treated as useful vs useless but the evidence Eliezer cites does seem relevant.
On a more object level, my main critique of the post is that almost all of the bullet points are even more true of, say, working as a physicist. And so structurally speaking I don’t know how to distinguish this post from one arguing “one advantage of looking for my keys closer to a streetlight is that there’s more light!” I.e. it’s hard to know the extent to which these benefits come specifically from focusing on less important things, and therefore are illusory, versus the extent to which you can decouple these benefits from the costs of being a “moderate”.
Yes, that can be a problem. I’m not sure why you think that’s in tension with my comment though.
Thank you Habryka (and the rest of the mod team) for the effort and thoughtfulness you put into making LessWrong good.
I personally have had few problems with Said, but this seems like an extremely reasonable decision. I’m leaving this comment in part to help make you feel empowered to make similar decisions in the future when you think it necessary (and ideally, at a much lower cost of your time).
I think one effect you’re missing is that the big changes are precisely the ones that tend to mostly rely on factors that are hard to specify important technical details about. E.g. “should we move our headquarters to London” or “should we replace the CEO” or “should we change our mission statement” are mostly going to be driven by coalitional politics + high-level intuitions and arguments. Whereas “should we do X training run or Y training run” are more amenable to technical discussion, but also have less lasting effects.
people in companies care about technical details so to be persuasive you will have to be familiar with them
Big changes within companies are typically bottlenecked much more by coalitional politics than knowledge of technical details.
By thinking about reward in this way, I was able to predict[1] and encourage the success of this research direction.
Congratulations on doing this :) More specifically, I think there are two parts of making predictions: identifying a hypothesis at all, and then figuring out how likely the hypothesis is to be true or false. The former part is almost always the hard part, and that’s the bit where the “reward reinforces previous computations” frame was most helpful.
(I think Oliver’s pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)
Ty for the reply. A few points in response:
Of course, you might not know which problem your insights allow you to solve until you have the insights. I’m a big fan of constructing stylized problems that you can solve, after you know which insight you want to validate.
That said, I think it’s even better if you can specify problems in advance to help guide research in the field. The big risk, then, is that these problems might not be robust to paradigm shifts (because paradigm shifts could change the set of important problems). If that is your concern, then I think you should probably give object-level arguments that solving auditing games is a bad concrete problem to direct attention to. (Or argue that specifying concrete problems is in general a bad thing.)
The bigger the scientific advance, the harder it is to specify problems in advance which it should solve. You can and should keep track of the unresolved problems in the field, as Neel does, but trying to predict specifically which unresolved problems in biology Darwinian evolution would straightforwardly solve (or which unresolved problems in physics special relativity would straightforwardly solve) is about as hard as generating those theories in the first place.
I expect that when you personally are actually doing your scientific research you are building sophisticated mental models of how and why different techniques work. But I think that in your community-level advocacy you are emphasizing precisely the wrong thing—I want junior researchers to viscerally internalize that their job is to understand (mis)alignment better than anyone else does, not to optimize on proxies that someone else has designed (which, by the nature of the problem, are going to be bad proxies).
It feels like the core disagreement is that I intuitively believe that bad metrics are worse than no metrics, because they actively confuse people/lead them astray. More specifically, I feel like your list of four problems is closer to a list of things that we should expect from an actually-productive scientific field, and getting rid of them would neuter the ability for alignment to make progress:
“Right now, by default research projects get one bit of supervision: After the paper is released, how well is it received?” Not only is this not one bit, I would also struggle to describe any of the best scientists throughout history as being guided primarily by it. Great researchers can tell by themselves, using their own judgment, how good the research is (and if you’re not a great researcher that’s probably the key skill you need to work on).
But also, note how anti-empiricism your position is. The whole point of research projects is that they get a huge amount of supervision from reality. The job of scientists is to observe that supervision from reality and construct theories that predict reality well, no matter what anyone else thinks about them. It’s not an exaggeration to say that discarding the idea that intellectual work should be “supervised” by one’s peers is the main reason that science works in the first place (see Strevens for more).“Lacking objective, consensus-backed progress metrics, the field is effectively guided by what a small group of thought leaders think is important/productive to work on.” Science works precisely because it’s not consensus-backed—see my point on empiricism above. Attempts to make science more consensus-backed undermine the ability to disagree with existing models/frameworks. But also: the “objective metrics” of science are the ability to make powerful, novel predictions in general. If you know specifically what metrics you’re trying to predict, the thing you’re doing is engineering. And some people should be doing engineering (e.g. engineering better cybersecurity)! But if you try to do it without a firm scientific foundation you won’t get far.
I think it’s good that “junior researchers who do join are unsure what to work on.” It is extremely appropriate for them to be unsure what to work on, because the field is very confusing. If we optimize for junior researchers being more confident on what to work on, we will actively be making them less truth-tracking, which makes their research worse in the long term.
Similarly, “it’s hard to tell which research bets (if any) are paying out and should be invested in more aggressively” is just the correct epistemic state to be in. Yes, much of the arguing is unproductive. But what’s much less productive is saying “it would be good if we could measure progress, therefore we will design the best progress metric we can and just optimize really hard for that”. Rather, since evaluating the quality of research is the core skill of being a good scientist, I am happy with junior researchers all disagreeing with each other and just pursuing whichever research bets they want to invest their time in (or the research bets they can get the best mentorship when working on).
Lastly, it’s also good that “it’s hard to grow the field”. Imagine talking to Einstein and saying “your thought experiments about riding lightbeams are too confusing and unquantifiable—they make it hard to grow the field. You should pick a metric of how good our physics theories are and optimize for that instead.” Whenever a field is making rapid progress it’s difficult to bridge the gap between the ontology outside the field and the ontology inside the field. The easiest way to close that gap is simply for the field to stop making rapid progress, which is what happens when something becomes a “numbers-go-up” discipline.
I think that e.g. RL algorithms researchers have some pretty deep insights about the nature of exploration, learning, etc.
They have some. But so did Galileo. If you’d turned physics into a numbers-go-up field after Galileo, you would have lost most of the subsequent progress, because you would’ve had no idea which numbers going up would contribute to progress.
I’d recommend reading more about the history of science, e.g. The Sleepwalkers by Koestler, to get a better sense of where I’m coming from.
I strongly disagree. “Numbers-Go-Up Science” is an oxymoron: great science (especially what Kuhn calls revolutionary science) comes from developing novel models or ontologies which can’t be quantitatively compared to previous ontologies.
Indeed, in an important sense, the reason the alignment problem is a big deal in the first place is that ML isn’t a science which tries to develop deep explanations of artificial cognition, but instead a numbers-go-up discipline.
And so the idea of trying to make (a subfield of) alignment more like architecture design, performance optimization or RL algorithms feels precisely backwards—it steers people directly away from the thing that alignment research should be contributing.
Strongly upvoted. Alignment researchers often feel so compelled to quickly contribute to decreasing x-risk that they end up studying non-robust categories that won’t generalize very far, and sometimes actively make the field more confused. I wish that most people doing this were just trying to do the best science they could instead.
That’s a mechanism by which I might overestimate the support for Hamas. But the thing I’m trying to explain is the overall alignment between leftists and Hamas, which is not just a twitter bubble thing (e.g. see university encampments).
More generally, leftists profess many values which are upheld the most by western civilization (e.g. support for sexual freedom, women’s rights, anti-racism, etc). But then in conflicts they often side specifically against western civilization. This seems like a straightforward example of pessimization.