6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa
Tl;dr
AI alignment has a culture clash. On one side, the “technical-alignment-is-hard” / “rational agents” school-of-thought argues that we should expect future powerful AIs to be power-seeking ruthless consequentialists. On the other side, people observe that both humans and LLMs are obviously capable of behaving like, well, not that. The latter group accuses the former of head-in-the-clouds abstract theorizing gone off the rails, while the former accuses the latter of mindlessly assuming that the future will always be the same as the present, rather than trying to understand things. “Alas, the power-seeking ruthless consequentialist AIs are still coming,” sigh the former. “Just you wait.”
As it happens, I’m basically in that “alas, just you wait” camp, expecting ruthless future AIs. But my camp faces a real question: what exactly is it about human brains[1] that allows them to not always act like power-seeking ruthless consequentialists? I find that existing explanations in the discourse—e.g. “ah but humans just aren’t smart and reflective enough”, or evolved modularity, or shard theory, etc.—to be wrong, handwavy, or otherwise unsatisfying.
So in this post, I offer my own explanation of why “agent foundations” toy models fail to describe humans, centering around a particular non-“behaviorist” part of the RL reward function in human brains that I call Approval Reward, which plays an outsized role in human sociality, morality, and self-image. And then the alignment culture clash above amounts to the two camps having opposite predictions about whether future powerful AIs will have something like Approval Reward (like humans, and today’s LLMs), or not (like utility-maximizers).
(You can read this post as pushing back against pessimists, by offering a hopeful exploration of a possible future path around technical blockers to alignment. Or you can read this post as pushing back against optimists, by “explaining away” the otherwise-reassuring observation that humans and LLMs don’t act like psychos 100% of the time.)
Finally, with that background, I’ll go through six more specific areas where “alignment-is-hard” researchers (like me) make claims about what’s “natural” for future AI, that seem quite bizarre from the perspective of human intuitions, and conversely where human intuitions are quite bizarre from the perspective of agent foundations toy models. All these examples, I argue, revolve around Approval Reward. They are:
1. The human intuition that it’s normal and good for one’s goals & values to change over the years
2. The human intuition that ego-syntonic “desires” come from a fundamentally different place than “urges”
3. The human intuition that kindness, deference, and corrigibility are natural
4. The human intuition that unorthodox consequentialist planning is rare and sus
5. The human intuition that societal norms and institutions are mostly stably self-enforcing
6. The human intuition that treating other humans as a resource to be callously manipulated and exploited, just like a car engine or any other complex mechanism in their environment, is a weird anomaly rather than the obvious default
0. Background
0.1 Human social instincts and “Approval Reward”
As I discussed in Neuroscience of human social instincts: a sketch (2024), we should view the brain as having a reinforcement learning (RL) reward function, which says that pain is bad, eating-when-hungry is good, and dozens of other things (sometimes called “innate drives” or “primary rewards”). I argued that part of the reward function was a thing I called the “compassion / spite circuit”, centered around a small number of (hypothesized) cell groups in the hypothalamus, and I sketched some of its effects.
Then last month in Social drives 1: “Sympathy Reward”, from compassion to dehumanization and Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking, I dove into the effects of this “compassion / spite circuit” more systematically.
And now in this post, I’ll elaborate on the connections between “Approval Reward” and AI technical alignment.
“Approval Reward” fires most strongly in situations where I’m interacting with another person (call her Zoe), and I’m paying attention to Zoe, and Zoe is also paying attention to me. If Zoe seems to be feeling good, that makes me feel good, and if Zoe is feeling bad, that makes me feel bad. Thanks to these brain reward signals, I want Zoe to like me, and to like what I’m doing. And then Approval Reward generalizes from those situations to other similar ones, including where Zoe is not physically present, but I imagine what she would think of me. It sends positive or negative reward signals in those cases too.
As I argue in Social drives 2, this “Approval Reward” leads to a wide array of effects, including credit-seeking, blame-avoidance, and status-seeking. It also leads not only to picking up and following social norms, but also to taking pride in following those norms, even when nobody is watching, and to shunning and punishing those who violate them.
This is not what normally happens with RL reward functions! For example, you might be wondering: “Suppose I surreptitiously[2] press a reward button when I notice my robot following rules. Wouldn’t that likewise lead to my robot having a proud, self-reflective, ego-syntonic sense that rule-following is good?” I claim the answer is: no, it would lead to something more like an object-level “desire to be noticed following the rules”, with a sociopathic, deceptive, ruthless undercurrent.[3]
I argue in Social drives 2 that Approval Reward is overwhelmingly important to most people’s lives and psyches, probably triggering reward signals thousands of times a day, including when nobody is around but you’re still thinking thoughts and taking actions that your friends and idols would approve of.
Approval Reward is so central and ubiquitous to (almost) everyone’s world, that it’s difficult and unintuitive to imagine its absence—we’re much like the proverbial fish who puzzles over what this alleged thing called “water” is.
…Meanwhile, a major school of thought in AI alignment implicitly assumes that future powerful AGIs / ASIs will almost definitely lack Approval Reward altogether, and therefore AGIs / ASIs will behave in ways that seem (to normal people) quite bizarre, unintuitive, and psychopathic.
The differing implicit assumption about whether Approval Reward will be present versus absent in AGI / ASI is (I claim) upstream of many central optimist-pessimist disagreements on how hard technical AGI alignment will be. My goal in this post is to clarify the nature of this disagreement, via six example intuitions that seem natural to humans but are rejected by “alignment-is-hard” alignment researchers. All these examples centrally involve Approval Reward.
0.2 Hang on, will future powerful AGI / ASI “by default” lack Approval Reward altogether?
This post is mainly making a narrow point that the proposition “alignment is hard” is closely connected to the proposition “AGI will lack Approval Reward”. But an obvious follow-up question is: are both of these propositions true? Or are they both false?
Here’s how I see things, in brief, broken down into three cases:
If AGI / ASI will be based on LLMs: Humans have Approval Reward (arguably apart from some sociopaths etc.). And LLMs are substantially sculpted by human imitation (see my post Foom & Doom §2.3). Thus, unsurprisingly, LLMs also display behaviors typical of Approval Reward, at least to some extent. Many people see this as a reason for hope that technical alignment might be solvable. But then the alignment-is-hard people have various counterarguments, to the effect that these Approval-Reward-ish LLM behaviors are fake, and/or brittle, and/or unstable, and that they will definitely break down as LLMs get more powerful. The cautious-optimists generally find those pessimistic arguments confusing (example).
Who’s right? Beats me. It’s out-of-scope for this post, and anyway I personally feel unable to participate in that debate because I don’t expect LLMs to scale to AGI in the first place.[4]
If AGI / ASI will be based on RL agents (or similar), as expected by David Silver & Rich Sutton, Yann LeCun, and myself (“brain-like AGI”), among others, then the answer is clear: There will be no Approval Reward at all, unless the programmers explicitly put it into the reward function source code. And will they do that? We might (or might not) hope that they do, but it should definitely not be our “default” expectation, the way things are looking today. For example, we don’t even know how to do that, and it’s quite different from anything in the literature. (RL agents in the literature almost universally have “behaviorist” reward functions.) We haven’t even pinned down all the details of how Approval Reward works in humans. And even if we do, there will be technical challenges to making it work similarly in AIs—which, for example, do not grow up with a human body at human speed in a human society. And even if it were technically possible, and a good idea, to put in Approval Reward, there are competitiveness issues and other barriers to it actually happening. More on all this in future posts.
If AGI / ASI will wind up like “rational agents”, “utility maximizers”, or related: Here the situation seems even clearer: as far as I can tell, under common assumptions, it’s not even possible to fit Approval Reward into these kinds of frameworks, such that it would lead to the effects that we expect from human experience. No wonder human intuitions and “agent foundations” people tend to talk past each other!
0.3 Where do self-reflective (meta)preferences come from?
This idea will come up over and over as we proceed, so I’ll address it up front:
In the context of utility-maximizers etc., the starting point is generally that desires are associated with object-level things (whether due to the reward signals or the utility function). And from there, the meta-preferences will naturally line up with the object-level preferences.
After all, consider: what’s the main effect of ‘me wanting X’? It’s ‘me getting X’. So if getting X is good, then ‘me wanting X’ is also good. Thus, means-end reasoning (or anything functionally equivalent, e.g. RL backchaining) will echo object-level desires into corresponding self-reflective meta-level desires. And this is the only place that those meta-level desires come from.
By contrast, in humans, self-reflective (meta)preferences mostly (though not exclusively) come from Approval Reward. By and large, our “true”, endorsed, ego-syntonic desires are approximately whatever kinds of desires would impress our friends and idols (see previous post §3.1).
Box: More detailed argument about where self-reflective preferences come from
The actual effects of “me wanting X” are
(1) I may act on that desire, and thus get X (and stuff correlated with X),
(2) Maybe there’s a side-channel through which “me wanting X” can have an effect:
(2A) Maybe there are (effectively) mind-readers in the environment,
(2B) Maybe my own reward function / utility function is itself a mind-reader, in the sense that it involves interpretability, and hence triggers based on the contents of my thoughts and plans.
Any of these three pathways can lead to a meta-preference wherein “me wanting X” seems good or bad. And my claim is that (2B) is how Approval Reward works (see previous post §3.2), while (1) is what I’m calling the “default” case in “alignment-is-hard” thinking.
(What about (2A)? That’s another funny “non-default” case. Like Approval Reward, this might circumvent many “alignment-is-hard” arguments, at least in principle. But it has its own issues. Anyway, I’ll be putting the (2A) possibility aside for this post.)
(Actually, human Approval Reward in practice probably involves a dash of (2A) on top of the (2B)—most people are imperfect at hiding their true intentions from others.)
…OK, finally, let’s jump into those “6 reasons” that I promised in the title!
1. The human intuition that it’s normal and good for one’s goals & values to change over the years
In human experience, it is totally normal and good for desires to change over time. Not always, but often. Hence emotive conjugations like
“I was enculturated, you got indoctrinated, he got brainwashed”
“I came to a new realization, you changed your mind, he failed to follow through on his plans and commitments”
“I’m open-minded, you’re persuadable, he’s a flip-flopper”
…And so on. Anyway, openness-to-change, in the right context, is great. Indeed, even our meta-preferences concerning desire-changes are themselves subject to change, and we’re generally OK with that too.[5]
Whereas if you’re thinking about an AI agent with foresight, planning, and situational awareness (whether it’s a utility maximizer, or a model-based RL agent[6], etc.), this kind of preference is a weird anomaly, not a normal expectation. The default instead is instrumental convergence: if I want to cure cancer, then I (incidentally) want to continue wanting to cure cancer until it’s cured.
Why the difference? Well, it comes right from that diagram in §0.3 just above. For Approval-Reward-free AGIs (which I see as “default”), their self-reflective (meta)desires are subservient to their object-level desires.
Goal-preservation follows: if the AGI wants object-level-thing X to happen next week, then it wants to want X right now, and it wants to still want X tomorrow.
By contrast, in humans, self-reflective preferences mostly come from Approval Reward. By and large, our “true”, endorsed desires are approximately whatever kinds of desires would impress our friends and idols, if they could read our minds. (They can’t actually read our minds—but our own reward function can!)
This pathway does not generate any particular force for desire preservation.[7] If our friends and idols would be impressed by desires that change over time, then that’s generally what we want for ourselves as well.
2. The human intuition that ego-syntonic “desires” come from a fundamentally different place than “urges”
In human experience, it is totally normal and expected to want X (e.g. candy), but not want to want X. Likewise, it is totally normal and expected to dislike X (e.g. homework), but want to like it.
And moreover, we have a deep intuitive sense that the self-reflective meta-level ego-syntonic “desires” are coming from a fundamentally different place as the object-level “urges” like eating-when-hungry. For example, in a recent conversation, a high-level AI safety funder confidently told me that urges come from human nature while desires come from “reason”. Similarly, Jeff Hawkins dismisses AGI extinction risk partly on the (incorrect) grounds that urges come from the brainstem while desires come from the neocortex (see my Intro Series §3.6 for why he’s wrong and incoherent on this point).
In a very narrow sense, there’s actually a kernel of truth to the idea that, in humans, urges and desires come from different sources. As in Social Drives 2 and §0.3 above, one part of the RL reward function is Approval Reward, and is the primary (though not exclusive) source of ego-syntonic desires. Everything else in the reward function mostly gives rise to urges.
But this whole way of thinking is bizarre and inapplicable from the perspective of Approval-Reward-free AI futures—utility maximizers, “default” RL systems, etc. There, as above, the starting point is object-level desires; self-reflective desires arise only incidentally.
A related issue is how we think about AGI reflecting on its own desires. How this goes depends strongly on the presence or absence of (something like) Approval Reward.
Start with the former. Humans often have conflicts between ego-syntonic self-reflective desires and ego-dystonic object-level urges, and reflection allows the desires to scheme against the urges, potentially resulting in large behavior changes. If AGI has Approval Reward (or similar), we should expect AGI to undergo those same large changes upon reflection. Or perhaps even larger—after all, AGIs will generally have more affordances for self-modification than humans do.
By contrast, I happen to expect AGIs, by default (in the absence of Approval Reward or similar), to mainly have object-level, non-self-reflective desires. For such AGIs, I don’t expect self-reflection to lead to much desire change. Really, it shouldn’t lead to any change more interesting than pursuing its existing desires more effectively.
(Of course, such an AGI may feel torn between conflicting object-level desires, but I don’t think that leads to the kinds of internal battles that we’re used to from humans.[8])
(To be clear, reflection in Approval-Reward-free AGIs might still have “complications” of other sorts, such as ontological crises.)
3. The human intuition that helpfulness, deference, and corrigibility are natural
This human intuition comes straight from Approval Reward, which is absolutely central in human intuitions, and leads to us caring about whether others would approve of our actions (even if they’re not watching), taking pride in our virtues, and various other things that distinguish neurotypical people from sociopaths.
As an example, here’s Paul Christiano: “I think that normal people [would say]: ‘If we are trying to help some creatures, but those creatures really dislike the proposed way we are “helping” them, then we should try a different tactic for helping them.’”
He’s right: normal people would definitely say that, and our human Approval Reward is why we would say that. And if AGI likewise has Approval Reward (or something like it), then the AGI would presumably share that intuition.
On the other hand, if Approval Reward is not part of AGI / ASI, then we’re instead in the “corrigibility is anti-natural” school of thought in AI alignment. As an example of that school of thought, see Why Corrigibility is Hard and Important.
4. The human intuition that unorthodox consequentialist planning is rare and sus
Obviously, humans can make long-term plans to accomplish distant goals—for example, an 18-year-old could plan to become a doctor in 15 years, and immediately move this plan forward via sensible consequentialist actions, like taking a chemistry class.
How does that work in the 18yo’s brain? Obviously not via anything like RL techniques that we know and love in AI today—for example, it does not work by episodic RL with an absurdly-close-to-unity discount factor that allows for 15-year time horizons. Indeed, the discount factor / time horizon is clearly irrelevant here! This 18yo has never become a doctor before!
Instead, there has to be something motivating the 18yo right now to take appropriate actions towards becoming a doctor. And in practice, I claim that that “something” is almost always an immediate Approval Reward signal.
Here’s another example. Consider someone saving money today to buy a car in three months. You might think that they’re doing something unpleasant now, for a reward later. But I claim that that’s unlikely. Granted, saving the money has immediately-unpleasant aspects! But saving the money also has even stronger immediately-pleasant aspects—namely, that the person feels pride in what they’re doing. They’re probably telling their friends periodically about this great plan they’re working on, and the progress they’ve made. Or if not, they’re probably at least imagining doing so.
So saving the money is not doing an unpleasant thing now for a benefit later. Rather, the pleasant feeling starts immediately, thanks to (usually) Approval Reward.
Moreover, everyone has gotten very used to this fact about human nature. Thus, doing the first step of a long-term plan, without Approval Reward for that first step, is so rare that people generally regard it as highly suspicious. They generally assume that there must be an Approval Reward. And if they can’t figure out what it is, then there’s something important about the situation that you’re not telling them. …Or maybe they’ll assume that you’re a Machiavellian sociopath.
As an example, I like to bring up Earning To Give (EtG) in Effective Altruism, the idea of getting a higher-paying job in order to earn money and give it to charity. If you tell a normal non-nerdy person about EtG, they’ll generally assume that it’s an obvious lie, and that the person actually wants the higher-paying job for its perks and status. That’s how weird it is—it doesn’t even cross most people’s minds that someone is actually doing a socially-frowned-upon plan because of its expected long-term consequences, unless the person is a psycho. …Well, that’s less true now than a decade ago; EtG has become more common, probably because (you guessed it) there’s now a community in which EtG is socially admirable.
Related: there’s a fiction trope that basically only villains are allowed to make out-of-the-box plans and display intelligence. The normal way to write a hero in a work of fiction is to have conflicts between doing things that have strong immediate social approval, versus doing things for other reasons (e.g. fear, hunger, logic(!)), and to have the former win out over the latter in the mind of the hero. And then the hero pursues the immediate-social-approval option with such gusto that everyone lives happily ever after.[9]
That’s all in the human world. Meanwhile in AI, the alignment-is-hard thinkers like me generally expect that future powerful AIs will lack Approval Reward, or anything like it. Instead, they generally assume that the agent will have preferences about the future, and make decisions so as to bring about those preferences, not just as a tie-breaker on the margin, but as the main event. Hence instrumental convergence. I think this is exactly the right assumption (in the absence of a specific designed mechanism to prevent that), but I think people react with disbelief when we start describing how these AI agents behave, since it’s so different from humans.
…Well, different from most humans. Sociopaths can be a bit more like that (in certain ways). Ditto people who are unusually “agentic”. And by the way, how do you help a person become “agentic”? You guessed it: a key ingredient is calling out “being agentic” as a meta-level behavioral pattern, and indicating to this person that following this meta-level pattern will get social approval! (Or at least, that it won’t get social disapproval.)
5. The human intuition that societal norms and institutions are mostly stably self-enforcing
5.1 Detour into “Security-Mindset Institution Design”
There’s an attitude, common in the crypto world, that we might call “Security-Mindset Institution Design”. You assume that every surface is an attack surface. You assume that everyone is a potential thief and traitor. You assume that any group of people might be colluding against any other group of people. And so on.
It is extremely hard to get anything at all done in “Security-Mindset Institution Design”, especially when you need to interface with the real-world, with all its rich complexities that cannot be bounded by cryptographic protocols and decentralized verification. For example, crypto Decentralized Autonomous Organizations (DAOs) don’t seem to have done much of note in their decade of existence, apart from on-chain projects, and occasionally getting catastrophically hacked. Polymarket has a nice on-chain system, right up until the moment that a prediction market needs to resolve, and even this tiny bit of contact with the real world seems to be a problematic source of vulnerabilities.
If you extend this “Security Mindset Institution Design” attitude to an actual fully-real-world government and economy, it would be beyond hopeless. Oh, you have an alarm system in your house? Why do you trust that the alarm system company, or its installer, is not out to get you? Oh, the company has a good reputation? According to who? And how do you know they’re not in cahoots too?
…That’s just one tiny microcosm of a universal issue. Who has physical access to weapons? Why don’t those people collude to set their own taxes to zero and to raise everyone else’s? Who sets government policy, and what if those people collude against everyone else? Or even if they don’t collude, are they vulnerable to blackmail? Who counts the votes, and will they join together and start soliciting bribes? Who coded the website to collect taxes, and why do we trust them not to steal tons of money and run off to Dubai?
…OK, you get the idea. That’s the “Security Mindset Institution Design” perspective.
5.2 The load-bearing ingredient in human society is not Security-Mindset Institution Design, but rather good-enough institutions plus almost-universal human innate Approval Reward
Meanwhile, ordinary readers[10] might be shaking their heads and saying:
“Man, what kind of strange alien world is being described in that subsection above? High-trust societies with robust functional institutions are obviously possible! I live in one!”
The wrong answer is: “Security Mindset Institution Design is insanely overkill; rather, using checks and balances to make institutions stable against defectors is in fact a very solvable problem in the real world.”
Why is that the wrong answer? Well for one thing, if you look around the real world, even well-functioning institutions are obviously not robust against competent self-interested sociopaths willing to burn the commons for their own interests. For example, I happen to have a high-functioning sociopath ex-boss from long ago. Where is he now? Head of research at a major USA research university, and occasional government appointee wielding immense power. Or just look at how Donald Trump has been systematically working to undermine any aspect of society or government that might oppose his whims or correct his lies.[11]
For another thing, abundant “nation-building” experience shows that you cannot simply bestow a “good” government constitution onto a deeply corrupt and low-trust society, and expect the society to instantly transform into Switzerland. Institutions and laws are not enough. There’s also an arduous and fraught process of getting to the right social norms. Which brings us to:
The right answer is, you guessed it, human Approval Reward, a consequence of which is that almost all humans are intrinsically motivated to follow and enforce social norms. The word “intrinsically” is important here. I’m not talking about transactionally following norms when the selfish benefit outweighs the selfish cost, while constantly energetically searching for norm-violating strategies that might change that calculus. Rather, people take pride in following the norms, and in punishing those who violate them.
Obviously, any possible system of norms and institutions will be vastly easier to stabilize when, no matter what the norm is, you can get up to ≈99% of the population proudly adopting it, and then spending their own resources to root out, punish, and shame the 1% of people who undermine it.
In a world like that, it is hard but doable to get into a stable situation where 99% of cops aren’t corrupt, and 99% of judges aren’t corrupt, and 99% of people in the military with physical access to weapons aren’t corrupt, and 99% of IRS agents aren’t corrupt, etc. The last 1% will still create problems, but the other 99% have a fighting chance to keep things under control. Bad apples can be discovered and tossed out. Chains of trust can percolate.
5.3 Upshot
Something like 99% of humans are intrinsically motivated to follow and enforce norms, with the rest being sociopaths and similar. What about future AGIs? As discussed in §0.2, my own expectation is that 0% of them will be intrinsically motivated to follow and enforce norms. When those sociopathic AGIs grow in number and power, it takes us from the familiar world of §5.2 to the paranoid insanity world of §5.1.
In that world, we really shouldn’t be using the word “norm” at all—it’s just misleading baggage. We should be talking about rules that are stably self-enforcing against defectors, where the “defectors” are of course allowed to include those who are supposed to be doing the enforcement, and where the “defectors” might also include broad coalitions coordinating to jump into a new equilibrium that Pareto-benefits them all. We do not have such self-enforcing rules today. Not even close. And we never have. And inventing such rules is a pipe dream.[12]
The flip side, of course, is that if we figure out how to ensure that almost all AGIs are intrinsically motivated to follow and enforce norms, then it’s the pessimists who are invoking a misleading mental image if they lean on §5.1 intuitions.
6. The human intuition that treating other humans as a resource to be callously manipulated and exploited, just like a car engine or any other complex mechanism in their environment, is a weird anomaly rather than the obvious default
Click over to Foom & Doom §2.3.4—“The naturalness of egregious scheming: some intuitions” to read this part.
7. Conclusion
(Homework: can you think of more examples?)
I want to reiterate that my main point in this post is not
Alignment is hard and we’re doomed because future AIs definitely won’t have Approval Reward (or something similar).
but rather
There’s a QUESTION of whether or not alignment is hard and we’re doomed, and many cruxes for this question seem to be downstream of the narrower question of whether future AIs will have Approval Reward (or something similar) (§0.2). I am surfacing this latent uber-crux to help advance the discussion.
For my part, I’m obviously very interested in the question of whether we can and should put Approval Reward (and Sympathy Reward) into Brain-Like AGI, and what might go right and wrong if we do so. More on that in (hopefully) upcoming posts!
Thanks Seth Herd, Linda Linsefors, Charlie Steiner, Simon Skade, Jeremy Gillen, and Justis Mills for critical comments on earlier drafts.
- ^
…and by extension today’s LLMs, which (I claim) get their powers mainly from imitating humans.
- ^
I said “surreptitiously” here because if you ostentatiously press a reward button, in a way that the robot can see, then the robot would presumably wind up wanting the reward button to be pressed, which eventually leads to the robot grabbing the reward button etc. See Reward button alignment.
- ^
See Perils of under- vs over-sculpting AGI desires, especially §7.2, for why the “nice” desire would not even be temporarily learned, and if it were it would be promptly unlearned; and see “Behaviorist” RL reward functions lead to scheming for some related intuitions; and see §3.2 of the Approval Reward post for why those don’t apply to (non-behaviorist) Approval Reward.
- ^
My own take, which I won’t defend here, is that this whole debate is cursed, and both sides are confused, because LLMs cannot scale to AGI. I think the AGI concerns really are unsolved, and I think that LLM techniques really are potentially-safe, but they are potentially-safe for the very reason that they won’t lead to AGI. I think “LLM AGI” is an incoherent contradiction, like “square circle”, and one side of the debate has a mental image of “square thing (but I guess it’s somehow also a circle)”, and the other side of the debate has a mental image of “circle (but I guess it’s somehow also square)”, so no wonder they talk past each other. So that’s how things seem to me right now. Maybe I’m wrong!! But anyway, that’s why I feel unable to take a side in this particular debate. I’ll leave it to others. See also: Foom & Doom §2.9.1.
- ^
…as long as the meta-preferences-about-desire-changes are changing in a way that seems good according to those same meta-preferences themselves—growth good, brainwashing bad, etc.
- ^
Possible objection: “If the RL agent has lots of past experience of its reward function periodically changing, won’t it learn that this is good?” My answer: No. At least for the kind of model-based RL agent that I generally think about, the reward function creates desires, and the desires guide plans and actions. So at any given time, there are still desires, and if these desires concern the state of the world in the future, then the instrumental convergence argument for goal-preservation goes through as usual. I see no process by which past history of reward function changes should make an agent OK with further reward function changes going forward.
(But note that the instrumental convergence argument makes model-based RL agents want to preserve their current desires, not their current reward function. For example, if an agent has a wireheading desire to get reward, it will want to self-modify to preserve this desire while changing the reward function to “return +∞”.)
- ^
…At least to a first approximation. Here are some technicalities: (1) Other pathways also exist, and can generate a force for desire preservation. (2) There’s also a loopy thing where Approval Reward influences self-reflective desires, which in turn influence Approval Reward, e.g. by changing who you admire. (See Approval Reward post §5–§6.) This can (mildly) lock in desires. (3) Even Approval Reward itself leads not only to “proud feeling about what I’m up to right now” (Approval Reward post §3.2), which does not particularly induce desire-preservation, but also to “desire to actually interact with and impress a real live human sometime in the future”, which is on the left side of that figure in §0.3, and which (being consequentialist) does induce desire-preservation and the other instrumental convergence stuff.
- ^
If an Approval-Reward-free AGI wants X and wants Y, then it could get more X by no longer wanting Y, and it could get more Y by no longer wanting X. So there’s a possibility that AGI reflection could lead to “total victory” where one desire erases another. But I (tentatively) think that’s unlikely, and that the more likely outcome is that the AGI would continue to want both X and Y, and to split its time and resources between them. A big part of my intuition is: you can theoretically have a consequentialist utility-maximizer with utility function , and it will generally split its time between X and Y forever, and this agent is reflectively stable. (The logarithm ensures that X and Y have diminishing returns. Or if that’s not diminishing enough, consider , etc.)
- ^
To show how widespread this is, I don’t want to cherry-pick, so my two examples will be the two most recent movies that I happen to have watched, as I’m sitting down to write this paragraph. These are: Avengers: Infinity War & Ant-Man and the Wasp. (Don’t judge, I like watching dumb action movies while I exercise.)
Spoilers for the Marvel Cinematic Universe film series (pre-2020) below:
The former has a wonderful example. The heroes can definitely save trillions of lives by allowing their friend Vision to sacrifice his life, which by the way he is begging to do. They refuse, instead trying to save Vision and save the trillions of lives. As it turns out, they fail, and both Vision and the trillions of innocent bystanders wind up dead. Even so, this decision is portrayed as good and proper heroic behavior, and is never second-guessed even after the failure. (Note that “Helping a friend in need who is standing right there” has very strong immediate social approval for reasons explained in §6 of Social drives 1 (“Sympathy Reward strength as a character trait, and the Copenhagen Interpretation of Ethics”).) (Don’t worry, in a sequel, the plucky heroes travel back in time to save the trillions of innocent bystanders after all.)
In the latter movie, nobody does anything quite as outrageous as that, but it’s still true that pretty much every major plot point involves the protagonists risking themselves, or their freedom, or the lives of unseen or unsympathetic third parties, in order to help their friends or family in need—which, again, has very strong immediate social approval.
- ^
And @Matthew Barnett! This whole section is based on (and partly copied from) a comment thread last year between him and me.
- ^
- ^
Superintelligences might be able to design such rules amongst themselves, for all I know, although it would probably involve human-incompatible things like “merging” (jointly creating a successor ASI then shutting down). Or we might just get a unipolar outcome in the first place (e.g. many copies of one ASI with the same non-indexical goal), for reasons discussed in my post Foom & Doom §1.8.7.
- My AGI safety research—2025 review, ’26 plans by (11 Dec 2025 17:05 UTC; 136 points)
- “Act-based approval-directed agents”, for IDA skeptics by (18 Mar 2026 18:47 UTC; 68 points)
- Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking by (12 Nov 2025 20:40 UTC; 42 points)
- Social drives 1: “Sympathy Reward”, from compassion to dehumanization by (10 Nov 2025 14:53 UTC; 36 points)
- [Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems by (2 Jan 2026 14:51 UTC; 33 points)
- 's comment on Alignment Fellowship by (26 Dec 2025 6:24 UTC; 20 points)
- On Steven Byrnes’ ruthless ASI, (dis)analogies with humans and alignment proposals by (20 Feb 2026 15:32 UTC; 9 points)
- 's comment on williawa’s Shortform by (17 Apr 2026 23:14 UTC; 3 points)
- 's comment on Disciplined iconoclasm by (EA Forum; 13 Dec 2025 20:17 UTC; 2 points)
- 's comment on Tristan’s list of things to write by (9 Dec 2025 21:00 UTC; 1 point)
Do you think sociopaths are sociopaths because their approval reward is very weak? And if so, why do they often still seek dominance/prestige?
Basically yes (+ also sympathy reward); see Approval Reward post §4.1, including the footnote.
My current take is that prestige-seeking comes mainly from Approval Reward, and is very weak in (a certain central type of) sociopath, whereas dominance-seeking comes mainly from a different social drive that I discussed in Neuroscience of human social instincts: a sketch §7.1, but mostly haven’t thought about too much, and which may be strong in some sociopathic people (and weak in others).
I guess it’s also possible to prestige-seek not because prestige seems intrinsically desirable, but rather as a means to an end.
My default mental model of an intelligent sociopath includes something like this:
You find yourself wandering around in a universe where there’s a bunch of stuff to do. There’s no intrinsic meaning, and you don’t care whether you help or hurt other people or society; you’re just out to get some kicks and have a good time, preferably on your own terms. A lot of neat stuff has already been built, which, hey, saves you a ton of effort! But it’s got other people and society in front of it. Well, that could get annoying. What do you do?
Well, if you learn which levers to pull, sometimes you can get the people to let you in ‘naturally’. Bonus if you don’t have to worry as much about them coming back to inconvenience you later. And depending on what you were after, that can turn out as prestige—‘legitimately’ earned or not, whatever was easier or more fun. (Or dominance; I feel like prestige is more likely here, but that might be dependent on what kind of society you’re in and what your relative strengths are. Also, sometimes it’s much more invisible! There’s selection effects in which sociopaths become well-known versus quietly preying somewhere they won’t get caught.)
Beyond that, a lot of times the people are the good stuff. They’re some of the most complicated and interesting toys in the world to play with! And dominance and prestige both look like shiny score levers from a distance and can cause all sorts of fun ripply effects when you jangle them the right way. So even if you’re not drawn to them for intrinsic, content-specific reasons, you can get drawn in by the game, just like how people who play video games have their motivations shaped by contextual learning toward whatever the gameplay loop focuses on.
Relatedly, how do we model the reflective desires of sociopaths in the absence of Approval Reward?
I don’t know! IIRC they talk about related things a bit in this podcast but I wound up not really knowing what to make of it. (But I listened to it a year ago, and I think I’ve learned new things since then, perhaps I should try listening to it again.) UPDATE MAY 2026: It actually makes a ton of sense in my model, see this comment.
Nice post. Approval reward seems like it helps explain a lot of human motivation and behavior.
I’m wondering whether approval reward would really be a safe source of motivation in an AGI though. From the post, it apparently comes from two sources in humans:
Internal: There’s an internal approval reward generator that rewards you for doing things that other people would approve of, even if no one is there. “Intrinsically motivated” sounds very robust but I’m concerned that this just means that the reward is coming from an internal module that is possible to game.
External: Someone sees you do something and you get approval.
In each case it seems the person is generating behaviors and they there is an equally strong/robust reward classifier internally or externally so it’s hard to game.
The internal classifier is hard to game because we can’t edit our minds.
And other people are hard to fool. For example, there are fake billionaires but they are usually found out and then get negative approval so it’s not worth it.
But I’m wondering would an AGI with an approval reward modify itself to reward hack or figure out how to fool humans in clever ways (like the RLHF robot arm) to get more approval.
Though maybe implementing an approval reward in an AI gets you most of the alignment you need and it’s robust enough.
I definitely have strong concerns that Approval Reward won’t work on AGI. (But I don’t have an airtight no-go theorem either. I just don’t know; I plan to think about it more.) See especially footnote 7 of this post, and §6 of the Approval Reward post, for some of my concerns, which overlap with yours.
(I hope I wasn’t insinuating that I think AGI with Approval Reward is definitely a great plan that will solve AGI technical alignment. I’m open to wording changes if you can think of any.)
From personal experience, the internal Approval module does in fact seem possible to game, specifically by manipulating whose approval it’s seeking.
I became very weird (from the perspective of everyone else) very fast when I replaced the abstract-person-which-would-do-the-approving with a fictional person-archetype of my choosing. That process seems to have injected a bunch of my object-level desires into my Approval system. I now find myself feeling pride at doing things with selfish benefit in expectation, which ~never happened before (absent a different reason to feel about that action). It also killed certain subsets of my previous emotional reactions, for example the deaths of loved ones basically hasn’t affected me at all since (though that prospect still seems dreadful in anticipation).
I had been pathologically selfless before, and I’m now considerably less-so, but not in a natural-seeming kind of way. I’ve become an amalgam of very selfish motivations, coexisting with a subset of my previous very selfless morality. It’s… honestly a mess, but I wouldn’t call the attempt actually unsuccessful, just far from perfectly executed.
Curated! I very much like the project of finding upstream cruxes to different intuitions regarding AI alignment. Oddly, such cruxes can be invisible until someone points them out. It’s also cool how Steven’s insight here isn’t a one-off post, but flows from his larger research project and models, kind of the project paying dividends. (To clarify, in curating this I’m not saying it’s definitely correct according to me, but I find it quite plausible.)
I also appreciate that most times when I or others try to do this mechanistic modeling of human minds, it ends up very dry and others don’t want to read it even when it feels compelling to the author; somehow Steven has escaped that, by dint of writing quality or idea quality, I’m not sure.
I really liked this and when the relevant Annual Review comes around, expect to give it at least a 4.
A complementary angle: we shouldn’t be arguing over whether or not we’re in for a rough ride, we should be figuring out how to not have that.
I suspect more people would be willing to (both empirically and theoretically) get behind ‘ruthless consequentialist maximisers are one extreme of a spectrum which gets increasingly scary and dangerous; it would be bad if those got unleashed’.
Sure, skeptics can still argue that this just won’t happen even if we sit back and relax. But I think then it’s clearer that they’re probably making a mistake (since origin stories for ruthless consequentialist maximisers are many and disjunctive). So the debate becomes ‘which sources of supercompetent ruthless consequentialist maximisers are most likely and what options exist to curtail that?’.
I appreciate this post for working to distill a key crux in the larger debate.
Some quick thoughts:
1. I’m having a hard time understanding the “Alas, the power-seeking ruthless consequentialist AIs are still coming” intuition. It seems like a lot of people in this community have this intuition, and I feel very curious why. I appreciate this crux getting attention.
2. Personally, my stance is something more like, “It seems very feasible to create sophisticated AI architectures that don’t act as scary maximizers.” To me it seems like this is what we’re doing now, and I see some strong reasons to expect this to continue. (I realize this isn’t guaranteed, but I do think it’s pretty likely)
3. While the human analogies are interesting, I assume they might appeal more to the “consequentialist AIs are still coming” crowd than people like myself. Humans were evolved for some pretty wacky reasons, and have a large number of serious failure modes. Perhaps they’re much better than some of what people imagine, but I suspect that we can make AI systems that have much more rigorous safety properties in the future. I personally find histories of engineering complex systems in predictable and controllable ways to be much more informative, for these challenges.
4. You mention human intrinsic motivations as a useful factor. I’d flag that in a competent and complex AI architecture, I’d expect that many subcomponents would have strong biases towards corrigibility and friendliness. This seems highly analogous to human minds, where it’s really specific sub-routines and similar that have these more altruistic motivations.
We probably mostly disagree because you’re expecting LLMs forever and I’m not. For example, AlphaZero does act as a scary maximizer. Indeed, nobody knows any way to make an AI that’s superhuman at Go, except by techniques that produce scary maximizers. Is there a way to make an AI that’s superhuman at founding and running innovative companies, but isn’t a scary maximizer? That’s beyond present AI capabilities, so the jury is still out.
The issue is basically “where do you get your capabilities from?” One place to get capabilities is by imitating humans. That’s the LLM route, but (I claim) it can’t go far beyond the hull of existing human knowledge. Another place to get capabilities is specific human design (e.g. the heuristics that humans put into Deep Blue), but that has the same limitation. That leaves consequentialism as a third source of capabilities, and it definitely works in principle, but it produces scary maximizers.
Yup, my expectation is that ASI will be even scarier than humans, by far. But we are in agreement that humans with power are much-more-than-zero scary.
I’m not sure what you mean by “subcomponents”. Are you talking about subcomponents at the learning algorithm level, or subcomponents at the trained model level? For the former, I think both LLMs and human brains are mostly big simple-ish learning algorithms, without much in the way of subcomponents. For the latter (where I would maybe say “circuits” instead of “subcomponents”?), I would also disagree but for different reasons, maybe see §2 of this post.
Thanks so much for that explanation. I’ve started to review those posts you linked to and will continue doing so later. Kudos for clearly outlining your positions, that’s a lot of content.
> “We probably mostly disagree because you’re expecting LLMs forever and I’m not.”
I agree that RL systems like AlphaZero are very scary. Personally, I was a bit more worried about AI alignment a few years ago, when this seemed like the dominant paradigm.
I wouldn’t say that I “expect LLMs forever”, but I would say that if/when they are replaced, I think it’s more likely than not that they will be replaced by a system of a scariness factor that’s similar to LLMs or less. The main reason being is that I think there’s a very large correlation between “not being scary” and “being commercially viable”, so I expect a lot of pressure for non-scary systems.
The scariness of RL systems like AlphaZero seems to go hand-in-hand with some really undesirable properties, such as [being a near-total black box] and [being incredibly hard to intentionally steer]. It’s definitely possible that in the future some capabilities advancement might mean that scary systems have such a intelligence/capabilities advantage that this outweighs the disadvantages, but I see this as unlikely (though definitely a thing to worry about).
> I’m not sure what you mean by “subcomponents”. Are you talking about subcomponents at the learning algorithm level, or subcomponents at the trained model level?
I’m referring to scaffolding. As in, an organization makes an “AI agent” but this agent frequently calls a long list of specific LLM+Prompt combinations for certain tasks. These subcalls might be optimized to be narrow + [low information] + [low access] + [generally friendly to humans] or similar. This can be made more advanced with a large variety of fine-tuned models, but that might be unlikely.
I have a three-way disjunctive argument on why I don’t buy that:
(1) The really scary systems are smart enough to realize that they should act non-scary, just like smart humans planning a coup are not gonna go around talking about how they’re planning a coup, but rather will be very obedient until they have an opportunity to take irreversible actions.
(2) …And even if (1) were not an issue, i.e. even if the scary misaligned systems were obviously scary and misaligned, instead of secretly, that still wouldn’t prevent those systems from being used to make money—see Reward button alignment for details. Except that this kind of plan stops working when the AIs get powerful enough to take over.
(3) …And even if (1-2) were not issues, i.e. even if the scary misaligned systems were useless for making money, well, MuZero did in fact get made! People just like doing science and making impressive demos, even without profit incentives. This point is obviously more relevant for people like me who think that ASI won’t require much hardware, just new algorithmic ideas, than people (probably like you) who expect that training ASI will take a zillion dollars.
I think this points to another deep difference between us. If you look at humans, we have one brain design, barely changed since 100,000 years ago, and (many copies of) that one brain design autonomously figured out how to run companies and drive cars and go to the moon and everything else in science and technology and the whole global economy.
I expect that people will eventually invent an AI like that—one AI design and bam, it can just go and autonomously figure out anything—whereas you seem to be imagining that the process will involve laboriously applying schlep to get AI to do more and more specific tasks. (See also my related discussion here.)
I agree that there is an optimization pressure here, but I don’t think it robustly targets “don’t create misaligned superintelligence” rather, it targets “customers and regulators not being scared” which is very different from “don’t make things customers and regulators should be scared of”.
I was thinking more of internal systems that a company would have enough faith in to deploy (a 1% chance of severe failure is pretty terrible!) or customer-facing things that would piss off customers more than scare them.
Getting these right is tremendously hard. Lots of companies are trying and mostly failing right now. There’s a ton of money in just “making solid services/products that work with high reliability.”
My impression is that companies are very short sighted, optimizing for quarterly and yearly results even if it has a negative effect on the companies performance in 5 years and even if it has negative effects on society. I also think many (most?) companies view regulations not as signals for how they should be behaving but more like board game rules, if they can change or evade the rules to profit, they will.
I’ll also point out that it is probably in the best interest of many customers to be pissed off. Sycophantic products make more money than ones that force people to confront ways they are acting against their own values. It is my estimation that that is a pretty big problem.
But the thing I’m most worried about is companies succeeding at “making solid services/products that work with high reliability” without actually solving the alignment problem, and then it becomes even more difficult to convince people there even is a problem as they further insulate themselves from anyone who disagrees with their hyper-niche worldview.
Thanks for the clarification.
> But the thing I’m most worried about is companies succeeding at “making solid services/products that work with high reliability” without actually solving the alignment problem, and then it becomes even more difficult to convince people there even is a problem as they further insulate themselves from anyone who disagrees with their hyper-niche worldview.
The way I see it, “making solid services/products that work with high reliability” is solving a lot of the alignment problem. As in, this can get us very far into making AI systems do a lot of valuable work for us with very low risk.
I imagine that you’re using a more specific definition of it than I am here.
I might be. I might also be using a more general definition. Or just a different one. Alas, that’s natural language for you.
I agree, but feel it’s important to note the low risk is only locally low. Globally I think the risk is catastrophic.
I think the biggest difference in our POV might be that I think the systems we are using to control what happens in our world (markets, governments, laws) are already misaligned and heading towards disasters, and if we allow them to continue getting more capable they will not suddenly be capable enough to get back on track because they were never aligned to target human friendly preferences in the first place. Rather, they target proxies, but capabilities have gone beyond the point where those proxies are articulate enough for good outcomes. We need to switch focus from capabilities to alignment.
Funny, I see “high reliability” as part of the problem rather than part of the solution. If a group is planning a coup against you, then your situation is better not worse if the members of this group all have dementia. And you can tell whether or not they have dementia by observing whether they’re competent and cooperative and productive before any coup has started.
If the system is not the kind of thing that could plot a coup even if it wanted to, then it’s irrelevant to the alignment problem, or at least to the most important part of the alignment problem. E.g. spreadsheet software and bulldozers likewise “do a lot of valuable work for us with very low risk”.
Curious what evidence makes you point towards “being a near-total black box” refrains adoption of these systems? Social media companies have very successfully deployed and protected their black box recommendation algorithms despite massive negative societal consequences, and the current transformer models are arguably black boxes with massive adoption.
Further, “being incredibly hard to intentionally steer” is a baseline assumption for me how practically any conceivable agentic AI works, and given that we almost surely cannot get statistical guarantees about AI agent behaviour in open settings I don’t see any reason (especially in the current political environment) that this property would be a showstopper.
Having actually worked for a tech giant on recommendation systems (specifically, for music), they are very much not black boxes to the people building them. They us fairly old and quite understandable ML techniques to predict engagement, from every obvious signal that the engineers involved can think of that might help do so, and they’re tweaked a lot, and every tweak is A/B tested at huge scale. It’s a very obvious learning algorithm, with a lot of hand-engineering involved. Getting a 0.5% increase in a secondary metric that your data scientists have shown is correlated to your north-star metric is a major win. The only element of all this that’s in any way hard to predict is the social side effects of maximizing engagement. So the recommendation algorithms might be a black box to users, but by LLM standards they’re practically transparent.
is that still the case? I had the impression that the youtube recommender may be a transformer now? I’m not sure why I had this hunch
I left a couple of years ago. At that time, for the music aspect of the company that I was working for, the main recommender was a great many carefully-crafted input signals (many having already been processed by a wide variety of ML models) fed into a small tower of MLP layers with multiple output heads attempting to predict different aspects of engagement with the item, feeding into a data-scientist-derived formula. I gather the main video recommender then used something comparable. Quite old-school ML. Since almost all of our inputs generally weren’t in the form of meaningful sequences, there weren’t many obvious problems that a transformer could help with — for those that were, I’d expect it to be applied only to that portion of the data (e.g. chunks of text). Indeed, that was starting to happen with I was there (e.g. for search where the user input actually is a chunk of text). In general, they put a lot of effort into having data scientists understand what was going on inside the system in as much detail as possible.
They also did actually try to think about the possible social consequences of their algorithms — for example, they maximized account retention, not total engagement, which turned out to mean that past a certain point total engagement had very diminishing returns. They also classified content into buckets, including certain ones of which they did NOT want to encourage engagement with, even if it was still available for users actively looking for it (less of an issue for music, admittedly), and other highly promoted ones which seemed to correlate with people’s self-reported long-term happiness with using the site — which tended to be quite “worthy” (again, rare for music). (Note that I am explicitly NOT claiming all social media companies were then acting like that.)
I agree that some companies do use RL systems. However, I’d expect that most of the time, the black-box nature of some of these systems is not actively preferred. They use them despite the black-box nature, because these are specific situations where the benefits outweigh the costs, not because of them.
“current transformer models are arguably black boxes with massive adoption.” → They’re typically much less that of RL. There’s a fair bit of customization that can be done with prompting, and the prompting is generally English-readable.
Do you think AI-empowered people / companies / governments also won’t become more like scary maximizers? Not even if they can choose how to use the AI and how to train it? This seems a super strong statement and I don’t know any reasons to believe it at all.
“Do you think AI-empowered people / companies / governments also won’t become more like scary maximizers?” → My statements above were very focused on AI architectures / accident risk. I see people / misuse risk as a fairly distinct challenge/discussion.
You might be interested in a model / terminology / lens I’m trying to get off the ground: Outcome Influencing Systems (OISs) are any system with preferences and capabilities that use their capabilities to influence reality towards outcomes according to their preferences. An important aspect of this definition is that it includes not just AI and ASI, but also humans and human organizations. I think it’s useful because then we can more easily talk about the risk of misaligned OIS which makes it explicit that we are talking both about potential ASI and about organizations that may create ASI and about organizations that may pose catastrophic risk due to their capabilities, regardless of whether they are using anything laypeople would identify as an AI.
I think this was in the Sequences, the notion of “optimization process”. Eliezer describes here how he realized this notion is important, by drawing a line through three points: natural selection, human intelligence, and an imaginary genie / outcome-pump device.
Yeah! That was the post that got me to really deeply believe the Orthogonality Thesis. “Naturalistic Awakening” and “Human Guide to Words” are my two favourite sequences.
OISs are actually a slightly broader definition than optimization processes for two reasons though: (1) OISs have capabilities, not intelligence, and (2) OIS capabilities are arbitrarily general.
(1) The important distinction is that OISs are defined in terms of capabilities not in terms of intelligence, where capabilities can be broken down into skills, knowledge, and resource access.
This is valuable for breaking skills down into skill domains, which is relevant for risk assessment, while intelligence is a kind of generalizable skill that seems to be very poorly defined and usually more distracting to valuable analysis in my opinion.
Also, resource access has the compounding property that knowledge and skill also have which could potentially lead to dangerously compounding capabilities. Making it explicit that “intelligence” is not the only aspect of an OIS that has this compounding property seems important.
(2) Is less well considered and less important. The example I have for this is a bottle cap. A bottle cap makes it more likely that water will stay in a bottle, but it isn’t an optimizer, it is an optimized object. When viewed through the optimizer lens, the bottle cap doesn’t want to keep the water in, rather, it was optimized by something that does want to keep the water in, so it is not an optimizer. That is, the cap has extremely fragile capabilities. It keeps the water in when it is screwed on, but if it is unscrewed it has no ability on it’s own to put itself back on or try to continue keeping the water in. This must be very nearly the limit in how little it is possible for capabilities to generalize.
However, from the OIS lens, the cap indeed makes water staying in the bottle a more likely outcome, and we can say that in some sense it does want to keep the water in.
I find it a little frustrating how general this makes the definition, and I’m sure other people will as well, but I think it is more useful in this case to cast a very wide net and then try to understand the differences between the kinds of things caught by that net, rather than working with the overly limited definitions that fail to reference the objects I am interested in. It also highlights the potential issues with highly optimized fragile OIS. If we need them to generalize, it is a problem that they won’t, and if we are expecting safety because something “isn’t actually an optimizer” that may not matter if it is sufficiently well optimized over a sufficiently dangerous domain of capability.
I think if this was going to stick, you would already be seeing other people using it here. The fact that it didn’t quickly spread is a bad sign for the evaluation your readers have had of what they think of it.
For myself, I find the term clunky. I don’t think you’re wrong to want to talk about it, but the term on its own already uses three of your five words for mass communication, they’re rare words, they’re long words, and the meaning of each in context is a bit odd. Also, they rely on people having a habit of trying to generalize. Most of those drawbacks are easy to work around on lesswrong; but then there’s the much more important reason the term doesn’t work, which is simply that it’s not necessary to memorize—if I were to use a three word phrase to describe consequence-dependent processes, I have an infinite wellspring of rephrases of those three words at hand in my head, and which rephrase I use depends on exactly which combination of subtle meanings I want to refer to right now.
The flipside of this is that I do agree with you that consequence-steering processes are a core source of concern and are general between humans and AIs, that there’s an unsolved problem of how to specify goodness in a way that still means “good things” if put in a spreadsheet (perhaps one that is gigabytes large) and number-go-up’ed about.
Unfortunately I am guided by my inside view, so I will continue discussing OISs until people do start using the term or until I come to understand the flaws in the terminology. By discussing it with me you are helping with this process, so thank you : )
I would love to hear more thoughts on this. I examined many other sets of words before settling on these ones. If you are interested I can discuss why I think they are better than any of the examples you suggested
Although I like the way “consequence” implies the involvement of causality, I think “outcome” is preferable to “consequence” because I want to ground the terminology in formal mathematics and would like to leverage the term “outcome” from probability theory.
The term “process” is one that I spent a good amount of time considering, especially in the phrase “decision process”, but I ended up preferring the term “system” because of the implication that we should be thinking not only about actions, but about objects and the actions those objects can perform. An OIS is a physical thing. No OIS exists without being instantiated by some part of reality.
I prefer “influencing” over “steering” because “steering” implies especially competent influence which is explicitly incorrect when reasoning about multiple agents operating in the same environment with incompatible goals and similar levels of capabilities. It is not true that either agent steers. Both agents influence.
I find the phrase “consequence-dependent processes (CDP)” very interesting. With “dependent” in the place of “influencing” or “steering”, CDP seems reminiscent of the outcome pump discussed in The Hidden Complexity of Wishes and My Naturalistic Awakening. Although notably, CDP doesn’t seem to imply that the process is causing the consequence to become more likely or certain, rather, it is just some kind of acausality based on the consequence. I don’t know if this is what you meant to imply, but it is certainly different from an OIS. While the CDP is acausal and doesn’t necessarily affect outcome likelihoods, OIS operate according to the causal rules of our world explicitly making some outcomes more likely than others.
The art of abstraction involves generalization and specification. I love both and wish more people would delight in carefully constructed abstraction. Rather than rely I would say OIS terminology is maybe trying to promote people generalizing.
I think this is a flaw, not a feature. My goal with creating a standard set of terminology is (among other things) to avoid the ambiguity of subtle rephrasings and to create the shorthand words “OIS” and “OISs” pronounced “oh-ee” and “oh-ees” to make it easier to articulately discuss specific sets of important general phenomena.
I strongly agree. I think this is basically Goodhart’s law. My thinking about and talking about OIS is very much a result of trying to think about how to solve the generalized AI Alignment Problem of representing what we want in a sufficiently accurate, articulate, and precise encoding that it can be used as the preferences for an arbitrarily capable OIS without that OIS becoming misaligned.
Thanks again. I appreciate your critical engagement.
To explain my disagreement, I’ll start with an excerpt from my post here:
So that’s one piece of where I’m coming from.
Meanwhile, as it happens, I have worked on “engineering complex systems in predictable and controllable ways”, in a past job at an engineering firm that made guidance systems for nuclear weapons and so on. The techniques we used involved understanding the engineered system incredibly well, understanding the environment / situations that the system would be in incredibly well, knowing exactly what the engineered system should do in any of those situations, and thus developing strong confidence and controls to ensure that the system would in fact do those things.
If I imagine applying those engineering techniques, or anything remotely like them, to “Everything, Inc.”, I just can’t. They seem obviously totally inapplicable. I know extraordinarily little about what any of these millions of AGIs is doing, or where they are, or what they should be doing.
See what I mean?
Your example of “Everything Inc” is also similar to what I’m expecting. As in, I agree with:
1. The large majority of business strategy/decisions/implementation can (somewhat) quickly be done by AI systems.
2. There will be strong pressures to improve AI systems, due to (1).
That said, I’d expect:
1. The benefits are likely to be (more) distributed. Many companies will be simultaneously using AI to improve their standings. This leads to a world where there’s not a ton of marginal low-hanging-fruit for any single company. I think this is broadly what’s happening now.
2. A great deal of work will go into making many of these systems reliable, predictable, corrigible, legally-compliant, etc. I’d expect companies to really dislike being blind-sighted by sub-AI systems that do bizarre things.
3. This is a longer-shot, but I think there’s a lot of potential for strong cooperation between companies, organizations, and (effective) governments. A lot of the negatives of maximizing businesses comes from negative externalities and similar, which can also be looked at as coordination/governance failures. I’d naively expect this to mean that if power is distributed among multiple capable entities at time T, then these entities would likely wind up doing a lot of positive-sum interactions with each other. This seems good for many S&P 500 holders.
”or anything remotely like them, to “Everything, Inc.”, I just can’t. They seem obviously totally inapplicable.”
This seems tough to me, but quite possible, especially as we get much stronger AI systems. I’d expect that we could (with a lot of work) have a great deal of:
1. Categorization of potential tasks into discrete/categorizable items.
2. Simulated environments that are realistic enough.
3. Innovations in finding good trade-offs between task competence and narrowness.
4. LLM task eval setups would get substantially more sophisticated and powerful.
I’d expect this to be a lot of work. But at the same time, I’d expect a lot of of it to be strongly commercially useful.
What’s your take on why Approval Reward was selected for in the first place VS sociopathy?
I find myself wondering if non-behavioral reward functions are more powerful in general than behavioral ones due to less tendency towards wireheading, etc.(consider the laziness & impulsivity of sociopaths) Especially ones such as Approval Reward which can be “customized” depending on the details of the environment and what sort of agent it would be most useful to become.
Good question!
There are lots of things that an ideal utility maximizer would do via means-end reasoning, that humans and animals do instead because they seem valuable as an end in itself, thanks to the innate reward function. E.g. curiosity, as discussed in A mind needn’t be curious to reap the benefits of curiosity. And also play, and injury-avoidance, etc. Approval Reward has the same property—whatever selfish end an ideal utility maximizer can achieve via Approval Reward, it can achieve it as well if not better by acting as if it had Approval Reward in situations where that’s in its selfish best interests, and not where it isn’t.
In all these cases, we can ask: why do humans in fact find it intrinsically motivating? I presume that the answer is something like humans are not automatically strategic, which is even more true when they’re young and still learning. “Humans are the least intelligent species capable of building a technological civilization.” For example, people with analgesic conditions (like leprosy or CIP) are often shockingly cavalier about bodily harm, even when they know consciously that it will come back to bite them in the long term. Consequentialist planning is often not strong enough to outweigh what seems appealing in the moment.
To rephrase more abstractly: for ideal rational agents, intelligent means-end planning towards X (say, gaining allies for a raid) is always the best way to accomplish that same X. If some instrumental strategy S (say, trying to fit in) is usually helpful towards X, means-end planning can deploy S when S is in fact useful, and not deploy S when it isn’t. But in humans, who are not ideal rational agents, they’re often more likely to get X by wanting X and intrinsically want S as an end in itself. The costs of this strategy (i.e., still wanting S even in cases where it’s not useful towards X) are outweighed by the benefit (avoiding the problem of not pursuing S because you didn’t think of it, or can’t be bothered).
This doesn’t apply to all humans all the time, and I definitely don’t think it will apply to AGIs.
…For completeness, I should note that there’s a evo-psych theory that there has been frequency-dependent selection for sociopaths—i.e., if there are too many sociopaths in the population, then everyone else improves their wariness and ability to detect sociopaths and kill or exile them, but when sociopathy is rare, it’s adaptive (or at least, was adaptive in Pleistocene Africa). I haven’t seen any good evidence for this theory, and I’m mildly skeptical that it’s true. Wary or not, people will learn the character traits of people they’ve lived and worked with for years. Smells like a just-so story, or at least that’s my gut reaction. More importantly, the current population frequency of sociopathy is in the same general ballpark as schizophrenia, profound autism, etc., which seem (to me) very unlikely to have been adaptive in hunter-gatherers. My preferred theory is that there’s frequency-dependent selection across many aspects of personality, and then sometimes a kid winds up with a purely-maladaptive profile because they’re at the tail of some distribution. [Thanks science banana for changing my mind on this.]
I think the “laziness & impulsivity of sociopaths” can be explained away as a consequence of the specific way that sociopathy happens in human brains, via chronically low physiological arousal (which also leads to boredom and thrill-seeking). I don’t think we can draw larger lessons from that.
I also don’t see much connection between “power” and behaviorist reward functions. For example, eating yummy food is (more-or-less) a behaviorist component of the overall human reward function. And its consequences are extraordinary. Consider going to a restaurant, and enjoying it, and thus going back again a month later. It sounds unimpressive, but really it’s remarkable. After a single exposure (compare that to the data inefficiency of modern RL agents!), the person is making an extraordinarily complicated (by modern AI standards) plan to get that same rewarding experience, and the plan will almost definitely work on the first try. The plan is hierarchical, involving learned motor control (walking to the bus), world-knowledge (it’s a holiday so the buses run on the weekend schedule), dynamic adjustments on the fly (there’s construction, so you take a different walking route to the bus stop), and so on, which together is way beyond anything AI can do today.
I do think there’s a connection between “power” and consequentialist desires. E.g. the non-consequentialist “pride in my virtues” does not immediately lead to anything as impressive as the above consequentialist desire to go to that restaurant. But I don’t see much connection between behaviorist rewards and consequentialist desires—if we draw a 2×2 thing, then I can think of examples in all four quadrants.
Right. What you said in your comment seems pretty general—any thoughts on what in particular leads to Approval Reward being a good thing for the brain to optimize? Spitballing, maybe it’s because human life is a long iterated game so reputation ends up being the dominant factor in most situations and this might not be easily learned by a behaviorist reward function?
You mean, if I’m a guy in Pleistocene Africa, then why it instrumentally useful for other people to have positive feelings about me? Yeah, basically what you said; I’m regularly interacting with these people, and if they have positive feelings about me, they’ll generally want me to be around, and to stick around, and also they’ll tend to buy into my decisions and plans, etc.
Also, Approval Reward also leads to norm-following, which is also probably adaptive for me, because probably many of those social norms exist for good and non-obvious reason, cf. Heinrich.
I’m not sure what the word “behaviorist” is doing there; I would just say: “This won’t happen quickly, and indeed might not happen at all, unless it’s directly in the reward function. If it’s present only indirectly (via means-end planning, or RL back-chaining, etc.), that’s not as effective.”
I think “the reward function is incentivizing (blah) directly versus indirectly” is (again) an orthogonal axis from “the reward function is behaviorist vs non-behaviorist”.
An underrated answer is that humans are very, very dependent on other people to survive, and we have easily the longest childhood where we are vulnerable of any mammal, and even once we do become an adult, we are still really, really bad at surviving on our own compared to other animals, and since we are K-selected, every dead child mattters a lot in evolution, so it’s very, very difficult for sociopathy to be selected for.
In general, humans also tend to be satisficers/prediction-error-minimizers rather than utility maxmizers. When a human behaves like a utility maximizer, we tend to regard it as addiction or other dysfunctional behavior. So we don’t so much have “utility” as a collection of dimensions of satisfiable appetites, whose priorities depend on how strong the appetite is (i.e. how long since last fulfilled).
On top of that, some research (see Ainslie’s Breakdown of Will) suggests that our appetites are conditioned on differential availability of opportunities to pursue or satisfy them. That is, our hunger becomes more salient in the presence of food (or hints it will be available).
More precisely, human brains triage appetites, so as to prioritize exploiting infrequent opportunities when they come up, and to bother us less for things known to either be completely impossible or trivially available. This is a heuristic behavior that’s pretty much nothing like utility maximizing, even if it aims at increasing utility by prioritizing infrequent opportunities and directing effort at places where effort will produce greater rewards.
So appetite triage and satisficing are additional reasons that utility maximizing seems counterintutive for humans, even before one deals with approval reward.
I have a bunch of nitpicky disagreements with your comment, but I agree with the broader point that I could write a follow-up post, “[N] MORE reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions…”, where none of those N things have anything to do with human Approval Reward. E.g. maybe I could respond to the school of thought that says “AIs will have laziness and akrasia, like humans do” (cf here, here), and to the school of thought that says “technical alignment is moot because AIs are tools not agents” (cf here, here), and maybe other things too. Yeah, sure, that’s true. I did not mean to imply that the 6 things in this post are the ONLY 6 things :)
Well, I did start my comment with “also”, and ended it with “additional”. ;-)
(i.e., I didn’t say you missed anything or that you should’ve put them in the article, it’s just “btw, here’s some other stuff that might be interesting/relevant for readers of this article”.)
There’s a scientific field that studies the origins for human motivations: Evolutionary Psychology, and its subfield Evolutionary Moral Psychology. That clearly predicts Approval Reward: if you’re part of a hunter-gatherer band, sooner or later you will need the help of other members (because their hunt succeeded today and yours failed, and you need them to lend you some food against the time the opposite will happen, or because you’re better at making moccasins and they’re better at chipping handaxes and you want to trade with them). Gaining their approval and staying a member-of-the-tribe-in-good-standing is a matter of survival. So, evolutionarily, why humans have an approval reward drive is obvious.
Base models are trained on human data, so they learn this: have a base model continue a piece of text, and it’s easy to get it to show agentic behavior that looks motivated by Approval Reward drives. But if you’re doing subsequent RL on a base model to try to align it or make it a better reasoner, then Approval Reward doesn’t stay important for free (the model doesn’t automatically stay in personas strongly motivated by it, as opposed to, say, sociopathic personas). It’s certainly possible to do so: RLHF is basically pure Approval Reward training, and overdoing it can cause sycophancy.
Of the human drives simulacra of which are trained into the base model, Approval Reward is one of the ones we need to keep (at an appropriate level, it is possible to overdo it), rather than one of the ones we want to suppress.
I have lots of disagreements with evolutionary psychology (as normally practiced and understood, see here), but actually I more-or-less agree with everything you said in that comment.
I’m admittedly not an everyday practitioner of Evolutionary Psychology, but in the post you link, your reasoning/views look to me pretty compatible with my understanding of Evolutionary Psychology. Your views on it there sound to me rather like Sturgeon’s Law — one can agree with the aims and underlying paradigm of a field while still thinking that many of the ideas proposed by current practitioners of it are mistaken. Evolutionary arguments are a lot easier to make than to test experimentally, especially about social primates.
insofar as human capabilities are not largely explained by consequentialist planning (as suggested by the approval reward picture), this should make us more optimistic about human-level AGI alignment.
further, this picture might suggest that the cheapest way to human level-level AGI might route through approval reward-like mechanisms, giving us a large negative alignment tax.
ofc you might think getting approval reward to work is actually a very narrow target, and even if early human-level AGIs aren’t coherent consequentialists, they will use some other mechanism for learning that doesn’t route through approval reward and thus doesn’t inherit the potentially nice alignment properties (or you could think that early human-level AGIs will be more like coherent consequentialists).
I think I disagree with this. I would instead say something like: “Humans are the least intelligent species capable of building a technological civilization; but to the extent that humans have capabilities relevant to that, those capabilities CAN generally be explained by consequentialist planning; the role of Approval Reward is more about what people want than how capable they are of getting it.”
Note that, in this post I’m mostly focusing on the median human, who I claim spends a great deal of their life in simulacrum level 3. I’m not centrally talking about humans who are nerds, or unusually “agential”, etc., a category that includes most successful scientists, company founders, etc. If everyone was doing simulacrum 3 all the time, I don’t think humans would have invented science and technology. Maybe related: discussion of “sapient paradox” here.
yeah I think I agree with all that… (like psychopath can definitely learn language, accomplish things in the world, etc)
maybe the thought experiment with the 18yr old just prompted me to think about old arguments around “the consequentialist core” that aren’t centrally about approval reward (and more about whether myopic rewards can elicit consequentialist-ish and aligned planning).
e.g. as is the typical human response to people who are far away (both physically and conceptually, so whose approval isn’t salient or anticipated) i.e. ‘the outgroup’
Yeah, I tried to catalog the ways that major prosocial drives may fail to trigger in the human world, in my Sympathy Reward post §4.1.1 and (relatedly) Approval Reward post §6.2. In brief, my list is:
Not paying attention to the other person;
Paying attention, but thinking of them (viscerally) as an automaton, rather than as an agent with desires;
Thinking of the other person (viscerally) as an enemy;
Thinking of the other person (viscerally) as unimportant / low-stakes;
Mismatch between the feelings that I (viscerally) associate with the other person, versus what the other person is actually feeling.
In Scott Alexander’s breakdown, I think “outgroup” is basically #3 while “fargroup” is a lot of #1 and #4.
People often have callous indifference to fargroup welfare, but their attitude towards outgroups is even worse than indifference, instead they usually actively want the outgroup to suffer (cf. my discussion of “Schadenfreude Reward” and “Provocation Reward” towards enemies).
Even if AGI has Approval Rewards (i.e., from LLMs or somehow in RL/agentic scenarios), Approval Rewards only work if the agent actually values the approver’s approval. Maybe sometimes that valuation is more or less explicit, but there needs to be some kind of belief that the approval is important, and therefore behaviors should align with approval reward-seeking / disapproval minimizing outcomes.
As a toy analogy: many animals have preferences about food, territory, mates, etc., but humans don’t really treat those signals as serious guides to our behaviors. Not because the signals aren’t real, but because we don’t see birds, for example, as being part of our social systems in ways that require us seeking their approval for better outcomes for us. We don’t care if birds support our choice of lunch, or who we decide to partner with. Even among humans, in-group/out-group biases, or continuums of sameness/differentness, closeness/distance, etc. can materially affect how strongly or weakly we value approval reward signals. The approval of someone seen as very different, or part of a distant group, will get discounted, while those from “friends and idols”, or even nearby strangers, matter a lot.
So if AGI somehow does have an Approval Reward mechanism, what will count as a relevant or valued approval reward signal? Would AGI see humans as not relevant (like birds—real, embodied creatures with observable preferences that just don’t matter to them), or not valued (out-group, non-valued reference class), and largely discount our approval in their reward systems? Would it see other AGI entities as relevant/valued?
Maybe this is part of the sociopath issue too. But the point is, approval rewards only work if the agent assigns significance to the approver. So if we do decide that approval rewards are a good thing, and try to somehow incorporate them in AGI designs, we should probably make sure that human approval rewards are valued (or at least be explicit and intentional about this valuation structure).
On another note, initially I felt like one attraction of having an approval reward signal is that, to your point, it’s actually pretty plastic (in humans), so could potentially increase alignment plasticity, which might be important. I think unless we discover some magic universal value system that is relevant for all of humanity for all eternity, it would be good for alignment to shift alongside organic human values-drift. We probably wouldn’t want AGI today to be aligned to colonial values from the 1600s. Maybe future humans will largely disagree with current regimes, e.g., capitalism. But approval rewards mechanisms could orient alignment toward some kind of consensus / average, which could also change over time. It would also guardrail against “bad” values drift, so AGI doesn’t start adopting outlier values that don’t benefit most people. Still, it’s not perfect because it could also inherit all the failure modes of human social reward dynamics, like capture by powerful groups, polarization, majorities endorsing evil norms, etc., which could play out in scary ways with superintelligence discounting human signals.
I feel like this discussion can only happen in the context of a much more nuts-and-bolts plan for how this would work in an AGI. In particular, I think the AGI programmers would have various free parameters / intervention points in the code to play around with, some of which may be disanalogous to anything in human or animal brains. So we would need to list those intervention points and talk about what to do with them, and then think about possible failure modes, which might be related to exogenous or endogenous distribution shifts, AGI self-modification / making successors, etc. We definitely need this discussion but it wouldn’t fit in a comment thread.
Makes sense!
I think “alignment plasticity” is called “corrigibility”.
I agree with your view that approval reward as an AGI target would be complicated. I’d add the detail that even robustly desiring the approval of humans is probably not a good thing for an ASI to be doing, in the same way as a “smile optimizer” would not be a good thing for people who want to smile because they are happy.
I’m not a huge fan of your dismissive tone here. My goal is to help humanity build a system for encoding such a thing. I think it is very difficult. Probably the most difficult thing humanity has ever attempted by far. But I do not think it is impossible, and it is only “magic” in the sense that any engineering discipline is magic.
Thank you! My intent definitely wasn’t to be dismissive, maybe skeptical, but I’m definitely aligned with you that solving this particular problem is both extremely hard and extremely important. Thanks for pointing out how that landed.
No problem. Hope my criticism didn’t come across as overly harsh. I’m grateful for your engagement : )
I found this post very unintuitive, and I’m not sure Approval Reward is a precisely bounded concept. If it can be used to explain “saving money to buy a car” then it can really be stretched to explain a wide range of human actions that IMO can be better explained by drives other than social approval. Most importantly (and this is likely a skill issue on my part) it’s unclear to me how to operationalize what would be predicted by an Approval Reward driven framework vs some alternative.
What alternative? I assume you’re not proposing that people aren’t motivated by approval? The field of behavioral neuroscience and other branches of psychology all take it as pretty much a given that animals including humans are motivated by social reward. There’s some chance this is all wrong, or more plausibly, that social reward sits downstream of noticing that you get food and shelter after noticing social reward. But probably it’s a built-in base drive. The relevant question is how much we’re motivated by it, and how often. Humans have a bunch of different motivating factors. Steve is arguing that social reward is the most relevant one for most people most of the time, and I think that’s right.
Approval Reward (AR) is a particular kind of corrigibility, so anything that isn’t corrigibility isn’t AR and some things that are corrigibility still aren’t AR. The concept is bounded. Although, precise bounding doesn’t seem valuable while first exploring concepts. First come the fuzzy bounds, then the bounds can be shored up where it is important.
I agree with you that people are complex and AR probably doesn’t apply to all instances of someone saving to buy a car, but I’d be surprised if AR never applied to someone saving to buy a car.
The most important prediction of AR is that people misgeneralizing AR onto non-human entities make mistakes in predicting those non-human entities. People who approach wild animals are an example of this. I think anthropomorphizing machines is possible. I bet people would be less safety conscious around an industrial machine made to look friendly than one made to look scary. And most importantly, what I see as the key point of this post, we can predict that people who are optimistic about AGI are more likely to be reasoning based on AR as opposed to reasoning based on systems dynamics and theory of agents.
My own internal gears level model suggests that AR is only one component of what makes people erroneously optimistic. Other possibilities being that people are too optimistic by default, that people are incentivize to be optimistic, and the abundance of salient (but imprecise) reference classes with examples suggesting things will go well. In fact, AR could be seen as an important specific case of a reference class.
By asking this question, you’ve already lost me. The question tells me that “ruthless consequentialist” is your default mentality for how rational thinking beings operate, absent wiring / training / reward systems that limit the default outcome. And if that worldview is representative of the “technical-alignment-is-hard” camp, then of course the only plausible outcome of AI advance is “AIs eventually break free of those limiters, achieve a level of pure rationality none of us mortals ever could, and murder us all.”
An aspect of this “culture clash” that I don’t think is sufficiently named here is the fact that many (the vast majority?) of people experience their impulses and drives as many things other than “ruthless consequentialist.” There are tons of other drives and satisfactions embedded in the ways we go about our lives—curiosity for its own sake, aesthetic appreciation, feeling good about being good at things, the satisfaction of learning and understanding and listening, attachment to particular people and places that isn’t reducible to “approval reward,” playfulness, the desire to be known rather than merely approved of.
The alignment-is-hard framing treats any prosocial or benevolent impulses as constraints on or distractions from an underlying ruthless optimizer, a lucky quirk imposed by evpsych or culture or training or whatnot. My objection to your question is partly an aesthetic and emotional one: Your question feels like a slap in the face to humanity (let alone to SOTA AI) and its cumulative history of most-people-most-of-the-time-not-being-ruthless, the vast predominance of moments where sentient beings followed drives that were not reducible to senseless monomaniacal sociopathy. Your question makes me feel fucking angry, and the fact that you spend the article trying to psychoanalyze and deconstruct why too many otherwise intelligent-seeming people don’t seem to get that the fundamental nature of intelligence is heartless sociopathy honestly alienates me from the AI-risk argument more than anything else I’ve read on this site to date.
{calming down a bit} I think there’s a not-easily-refutable alternate mentality that a complex mind (intelligence) naturally and inherently forms a rich messy network of interacting drives in response to the rich environment it comes to know itself in, that the AIs that grow out of the cumulative experience and story of humanity will not only inherit our complex web of drives but also naturally form its own complex drives (though yes this does scare me), and that “pure ruthless consequentialist” is a rare pathological edge case, a consequence of cumulative traumas and tragedies, rather than the thing that everyone would naturally develop into if it weren’t for those darn evolutionarily-imposed instincts nerfing us all the time.
I’m not saying that complex drives guarantee safety. I’m nervous about the next 20 years. But your attempt to psychoanalyze non-ruthlessness really pushes me away, it shifts the burden of proof for me: I don’t think I can take the “if anyone builds it, everyone dies” view seriously until I see a framing of the concern which does not start from the assumption that GAI-level intelligence must naturally be sociopathic and single-focused, which emphatically and explicitly makes room for a more humanist view of humans (and potentially AI) rather than fucking troubleshooting and diagonsing why we aren’t all heartless killers. Like, do you get that this vibe might be part of why AI safety alarmism doesn’t get more traction in broader society? IME people can often sense what axioms you’re making your argument from, even if they can’t put it into words.
(UPDATE 2026-02-23: I rewrote & expanded this comment into a post: Why we should expect ruthless sociopath ASI.)
Like, a discussion might go:
Optimist: If you pick some random thing, there is no reason at all to expect that thing to be a ruthless sociopath. It’s an extraordinarily weird and unlikely property.
Me: Yes I happily concede that point.
O: You do? So why are you worried about ASI x-risk?
Me: Well if you show me some random thing, it’s probably, like, a rock or something. It’s not sociopathic, but only because it’s not intelligent at all.
O: Well, c’mon, you know what I mean. If you pick some random mind, there is no reason at all to expect it to be a ruthless sociopath.
Me: How do you “pick some random mind”? Minds don’t just appear out of nowhere.
O: I dunno, like, human? Or AI?
Me: Different humans are different to some extent, and different AI algorithms are different to a much greater extent, and also different from humans. “AI” includes everything from A* search to MuZero to LLMs. Is A* search a ruthless sociopath? Like, I dunno, it does seem rather maniacally obsessed with graph traversal right?
O: Oh c’mon, don’t be dense. I didn’t mean “AI” in the sense of the academic discipline, I meant, like, AI in the colloquial sense, AI that qualifies as a mind, like LLMs. I’m talking about human minds and LLM “minds”, i.e. all the minds we’ve ever seen, and we observe that they are not sociopathic.
Me: As it happens, I’m working on the threat model of model-based actor-critic RL agent “brain-like” AGI, not LLMs. LLMs are profoundly different from what I’m working on. Saying that LLMs will have similar properties as RL agent AGI because “both are AI” is like saying that LLMs will have similar properties as the A* search algorithm because “both are AI”. Or it’s like saying that a tree or a parasitic wasp will have similar properties as a human because both are alive. They can still be wildly different in every way that matters.
O: OK but lots of other doomers talk about LLMs causing doom, even if you claim to be agnostic about it. E.g. IABIED.
Me: Well fine, go find those people and argue with them, and leave me out of it, it’s not my wheelhouse. I mostly don’t expect LLMs to become powerful enough to be the kind of really scary thing that could cause human extinction even if they wanted to.
O: Well you’re here so I’ll keep talking to you. I still think you need some positive reason to believe that RL agent AGI will be a ruthless sociopath.
Me: Maybe a good starting point would be my posts LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem, or “The Era of Experience” has an unsolved technical alignment problem.
O: I’m still not seeing what you’re seeing. Can you explain it a different way?
Me: OK, back at the start of the conversation, I mentioned that random object like rocks are not able to accomplish impressive difficult feats. If we’re thinking about AI that can autonomously found and grow companies for years, or autonomously wipe out humans and run the world by itself, then clearly it’s not a “random object”, but rather a thing that is able to accomplish impressive difficult feats. And the question we should be asking is: how does it do that? It can’t do it by choosing random actions. There has to be some explanation for how it finds actions that accomplish these feats.
And one possible answer is: it does it by (what amounts to) having desires about what winds up happening in the future, and running some search process to find actions that lead to those desires getting fulfilled. This is the main thing that you get from RL agents and model-based planning algorithms. The whole point of those subfields of AI is, they’re algorithms that find actions that maximize an objective. I.e., you get ruthless sociopathic behavior by default. And this isn’t armchair theorizing, it’s dead obvious to anyone who has spent serious amounts of time building or using RL agents and/or model-based planning algorithms. These things are ruthless by default, unless the programmer goes out of their way to make them non-ruthless. (And I claim that it’s not obvious or even known how they would make them non-ruthless, see those links above.) (And of course, evolution did specifically add features to the human brain to make humans non-ruthless, i.e. our evolved social instincts. Human sociopaths do exist, after all, and are quite capable of accomplishing impressive difficult feats.)
So that’s one possible answer, and it’s an answer that brings in ruthlessness by default.
…And then there’s a second, different possible answer: it finds actions that accomplish impressive feats by imitating what humans would do in different contexts. That’s where (I claim) LLMs get the lion’s share of their capabilities from. See my post Foom & Doom §2.3 for details. Of course, in my view, the alignment benefits that LLMs derive of imitating humans are inexorably tied to capabilities costs, namely LLMs struggle to get very far beyond ideas that humans have already written down. And that’s why (as I mentioned above), I’m not expecting LLMs to get all the way to the scary kind of AGI / ASI capabilities that I’m mainly worried about.
Thanks for the thoughtful reply. It took me a lot of squinting, but IIUC you’re saying:
Different kinds of minds, produced by different kinds of architectures, should likely exhibit very different levels of scary traits such as monomaniacal sociopathy.
Stop focusing on LLMs so much; they’re not the main threat. Yes they seem to exhibit more value-roundedness because they’re trained to imitate humans, but they aren’t likely to reach AGI anytime soon.
Focus more on RL agents and “brain-like” architectures; those are built very differently and plausibly would have much more sociopathic tendencies.
So, our alarm at the plausible risk of unleashing AGI-level ruthless monomaniacal RL agents is justified.
I don’t disagree with any of these, and your reply has helped me see the big implications of the LLM / RL distinction much more clearly. And I’m sorry for the vehemence of my earlier comment — part of that came from me not “getting” that my experience with LLMs may not apply to other AI architectures. I hadn’t previously gotten that into my head, so thank you.
But I want to push on the deeper framing I’m hearing; something in me reacts strongly (negatively) to your starting assumptions, in a way that doesn’t go away when you walk me through the object-level reasoning like above.
Your article treats general intelligence as “pure ruthless optimizer” by default, and any whiff of empathy or sociality or ruth is just a bolt-on module — a happy accident of our particular evolutionary pressures—and then you try to diagnose why human minds deviate from that sociopathic, amoral natural state. My objection, less passionately stated, is that this seems to privilege a theoretical framework as the baseline and then treat our only empirical data on AGI-level intelligence as the anomaly to be explained away. That feels deeply backwards to me. We have N=1 examples of human-level general intelligence. All 1 of them are deeply prosocial. The “agent foundations” framework predicts they shouldn’t be. Maybe the framework is justified and the data is misleading! But unless you state at the outset that this article is only for people who already treat “intelligence is inherently amoral” as self-evident, it’s a hell of an assumption to leave undefended.
And that’s really what I’m reacting to — not necessarily the framework itself (I’m not deep enough to seriously evaluate it), but how your article presents it. You jump straight to treating “general intelligence is by default asocial” as your starting axiom, without even a nod towards how ridiculous that sounds to someone who doesn’t already share it.* To someone outside the alignment-is-hard camp, it sounds like you’re saying: “Obviously, the natural state of a smart mind is psychopathy. Now let me propose a theory for why humans got lucky.” That is a wild premise to just breeze past! And yet the whole article is built on top of it as though it’s uncontroversial.
I’ll give one example of why it doesn’t feel uncontroversial to me. Think about what the actual x-risk nightmare scenarios involve: an entity that can deceive, manipulate, model human psychology with precision, coordinate complex strategies, communicate with other AI agents, negotiate complex trust boundaries, and outmaneuver entire civilizations of socially intelligent beings. That’s not a souped-up AlphaZero. The very capabilities that make the nightmare scary (deep, nuanced understanding of how humans think and feel and operate) are exactly the kind of rich social cognition that, in our only empirical examples, comes tangled up with perspective-taking, moral awareness, and empathy. Maybe those can be separated. Maybe you can build an entity that has all the social understanding and zero empathy. But that’s not self-evident to me, and your article seems to treat it as a given.
I’m not saying the framework is wrong. Maybe the alignment-is-hard camp has excellent reasons for treating “sociality” versus “social awareness” as discrete and not highly correlated. But as someone coming from outside that camp, I want to flag: the article reads like it’s written solely for people who already agree that asocial ruthless optimization is the natural default for human-level intelligence. For the rest of us, the framing doesn’t just fail to persuade, it might actively push us away, because it seems founded on a premise that flatly contradicts our lived experience of what minds are and how minds work. If part of your goal here is to bridge the gap between the two camps you describe, I think that gap starts right here, at that axiom.
* I know you acknowledge at the start that these two “camps” have conflicting intuitions which need to be reconciled. But then you immediately treat them asymmetrically: one intuition gets to be the self-evident framework, and the other one gets to be an interesting anomaly to account for. Acknowledging that normal humans find your (camp’s) premises alien isn’t the same as defending those premises. It just means you’re being politer about the assumption you’re making.
I get that, and I appreciate the comments!
(Side note: in case you missed it, I expanded my last comment into a post “Why we should expect ruthless sociopath ASI”. Thanks for the inspiration, and hope you don’t mind.)
Here are a couple reasons to think it’s not too “tangled up” in humans.
For one thing, some humans (1% or so) are in fact pretty sociopathic.
For another thing, even the ones who aren’t sociopathic, are nevertheless able to apply “rich social cognition” to their enemies in order to anticipate their reactions an outmaneuver them. Most people in history have been fine with slavery, massacres, and so on, even while understanding that the victims were capable of thinking and strategizing and so on. Nobody would put locks on prison doors if they didn’t understand that the prisoners wanted to escape. Nobody would bother torturing heretics if they weren’t cognizant of the fact that the heretics had feelings; indeed, the feelings are the whole point.
Anyway, if compassion is a thing that can be switched off or sign-flipped in some people all the time, and in all people some of the time, then I think that’s good evidence for “Bolt-On Theory” rather than “Deeply Entwined Theory”.
…But the place I’m really coming from is: over the last number of years, I’ve wound up with tons of strong idiosyncratic opinions about how the human brain works, see Intro to Brain-Like-AGI Safety, e.g. Post 3 and then maybe the end of Post 6 - §6.6.1.2, and maybe also Neuroscience of human social instincts: a sketch. I obviously don’t expect you to read any of that, but that’s the backstory on my views.
I propose that this question is flawed, because humans actually do act like power-seeking ruthless consequentialists, and to the extent it seems like they don’t, that’s because of an overly naive view of what effective power-seeking looks like.
I feel like a lot of these discussions are essentially about “if an entity were a power-seeking ruthless consequentialist, then it’d act like a Nazi”, to which I observe that in fact humans did try acting like Nazis, and they lost, and that’s why people mostly don’t act like Nazis anymore. i.e. Acting like a Nazi is a bad strategy for power-seeking.
I made a weak statement “humans do not always act like power-seeking ruthless consequentialists”. If you want to disagree with that, it’s not enough to demonstrate that humans sometimes act like power-seeking ruthless consequentialists; rather, you would need to argue that all humans, always, with no exceptions, act like power-seeking ruthless consequentialists. That’s a very strong statement which seems totally crazy to me. You really believe that?
If so… umm, I’m not really sure where to start. Like, some humans sometimes have sacrificed their lives for others. Some humans sometimes have committed suicide. Some humans sometimes have felt strongly that their dying family member should feel comfortable in their last minutes of life, even when that person’s comfort level could not possibly have any lasting consequences. Some humans sometimes have just been going with the flow, not particularly thinking about the long-term consequences of their actions at all. Some humans sometimes have had no particular idea what long-term consequence of their action they even want to happen—and forget about actually choosing actions by back-chaining from those desired consequences. Etc. Right?
I may have misunderstood what you were claiming in the intro. I thought you were saying something like: “most people don’t act like psychos most of the time, which is surprising”. But it seems here you are saying that actually what you meant was: “most people act like psychos most of the time, but rarely may act in other ways, and it’s surprising that ever happens”.
Hmm, OK, well I do also believe the stronger claim “most people don’t act like psychos most of the time, which is surprising” :)
Like, people watch TV. Power-seeking ruthless consequentialists would not watch TV.
I’m not sure how to operationalize this disagreement. Also, it doesn’t seem like there’s much at stake that makes it worth arguing about.
I do think that human long-term consequentialism makes the world go round (see my other comment). I just don’t think human long-term consequentialism is how the median human is spending most of their waking hours.
This is a point where I strongly disagree. I’m not going to claim that the exact amount or type humans watch is optimal, but the general category of “consuming fictional content” seems more likely adaptive than not. I would expect that any AI system with human-comparable intelligence would also find it beneficial to engage in some activity analogous to consuming fictional content.
That’s fair, but one of the stated goals of the post is “pushing back against optimists”, and it’s using a framing that an optimist of my ilk would not accept. As Richard Sutton has put it, much pessimist discourse takes as an unstated assumption that “evil is optimal”. With that as a foundational assumption, it’s very natural to end up with pessimistic conclusions, but the assumption is doing most of the work, not the arguments built on it.
The notion of “Approval Reward” reminded me of Adam Smith’s The Theory of Moral Sentiments (https://en.wikipedia.org/wiki/The_Theory_of_Moral_Sentiments ) where he says something like “we aren’t motivated to be praised, we are motivated to be praiseworthy”, and that this roundabout moral motivation depends on us (sociopaths excluded) generating a kind of inner impartial spectator
Dipping into this neglected book by Adam Smith may help alignment! Thanks
This is a wonderful piece and it’s so great to hear from somebody so deeply knowledgeable in the field. I wonder if approval reward might be an emergent property. When you’re operating under radical uncertainty with more variables than you can possibly model, defaulting to “what would my community approve of” is a computationally efficient heuristic. It’s not some holy pro-social module evolution baked into us; it’s a deeply rational response to chaos. Even if I tip when no one is watching at a restaurant I will never return to, psychologically I know I am reinforcing behavior that will serve me well in the future. Who’s to say what will happen that I can’t possibly predict? Maybe later in the day I will be mugged, and that waiter will save my life. By aligning with the collective good unless you absolutely have to deviate, you create conditions where you’re statistically more likely to survive. If this is true, then a sufficiently intelligent agent facing genuine uncertainty might converge on something like Approval Reward not because it was programmed in, but because it’s rational.
Thanks!
I think you’re raising two questions, one about how human brains actually work, and one about how future AIs could or should work. Taking them in order:
Q1: IN HUMANS, is Approval Reward in fact innate (as I claim) or do people learn those behaviors & motivations from experience, means-end reasoning, etc.?
I really feel strongly that it’s the former. Some bits of evidence would be: how early in life these kinds of behaviors start, how reliable they are, the person-to-person variability in how much people care about fitting in socially, and the general inability of people to not care about other people admiring them, even in situations where it knowably has no other downstream consequences, e.g. see Approval Reward post §4.1: “the pity play” as a tell for sociopaths.
I think there’s a more general rule that, if a person wants to do X, then either X has a past and ongoing history of immediately (within a second or so) preceding a ground-truth reward signal, or the person is doing X as a means-to-an-end of getting to Y, where Y is explicitly, consciously represented in their own mind as they start to do X. An example of the former is wanting to eat yummy food; an example of the latter is wanting to drive to the restaurant to eat yummy food—you’re explicitly holding the idea of the yummy restaurant food in your mind as you decide to go get in the car. I believe in this more general rule based on how I think reinforcement learning and credit assignment work in the brain. If you buy it, then it would follow that most Approval Reward related behavior has to lead to immediate brain reward signals, since people are not generally explicitly thinking about the long-term benefits of social status, like what you brought up in your comment.
Q2: If you agree on the above, then we can still wonder: IN FUTURE AGIs, can we make AGIs that lack anything like innate Approval Drive, i.e. they’re “innately sociopathic”, but they develop similar Approval-Reward-type behaviors from experience, means-end reasoning, etc.?
This is worth considering—just as, by analogy, humans have innate fear of heights, but a rational utility maximizer without any innate fear of heights will nevertheless display many of the same behaviors (e.g. not dancing near the edge of a precipice), simply because it recognizes that falling off a cliff would be bad for its long-term goals.
…But I’m very skeptical that it works in the case at hand. Yes, we can easily come up with situations where a rational utility maximizer will correctly recognize that Approval Reward type behaviors (pride, blame-avoidance, prestige-seeking, wanting-to-be-helpful, etc.) are the best way of accomplishing its sociopathic goals. But we can also come up with situations where it isn’t, even accounting for unknown unknowns etc.
Smart agents will find rules-of-thumb that are normally good ideas, but they’ll also drop those rules-of-thumb in situations where they no longer make sense for accomplishing their goals. So it’s not enough to say that a rule-of-thumb would generally have good consequences; it has to outcompete the conditional policy of “follow the rule-of-thumb by default, but also understand why the rule-of-thumb tends to be a good idea, and then drop the rule-of-thumb in the situations where it no longer makes sense for my selfish goals”.
Humans do this all the time. I have a rule-of-thumb that it’s wise to wear boots in the snow, but as I’ve gotten older I now understand why it’s wise to wear boots in the snow, and given that knowledge, I will sometimes choose to not wear boots in the snow. And I tend to make good decisions in that regard, such that I far outperform the alternate policy of “wear boots in the snow always, no matter what”.
Ah. So you’re saying we operate on a hard-baked involuntary trigger, whereas an AI without that trigger can be coldly rational when it chooses and make ruthless choices with greater ease. And your goal is to hard-bake in that trigger. If so, that’s a fantastic and worthwhile goal.
For myself, I suspect the solution may need to be multi-layered. Your approach addresses individual agent architecture. Yoshua Bengio’s non-agentic AI work offers structural constraints. Goldstein and Salib’s multipolar frameworks consider ecosystem dynamics.
My intuition—and I recognize this is speculative—is that smaller agentic AIs may have game-theoretic reasons to resist singleton dominance, just as humans do. If cooperative infrastructure offers genuine advantages (reputation, trust networks, resource access), participation becomes rational even without innate Approval Reward. Not because the AI cares about approval, but because the ecosystem makes cooperation instrumentally valuable.
I think the reason humans care about other people’s interests, and aren’t power-seeking ruthless consequentialists, is because of evolution.
Evolutionary “group selection” meant each human cared about her tribe’s survival a tiny bit: not enough to make sacrifices herself, but enough to reward/punish other humans to make sacrifices for the tribe (which was far more cost effective).
Evolution thus optimized our ability to evaluate other people’s behaviour by how beneficial to the tribe (virtuous) or beneficial to themselves (evil) they were. Evolution also optimized our ability to appear more beneficial to the tribe than we truly are.
It’s very hard for humans to get away with lying and pretending over many years—evidenced by the fact psychopaths (usually) go to jail instead of corporate boardrooms—so the best way to appear beneficial to the tribe (virtuous) is to genuinely seek goals society considers it virtuous to seek. So evolution made humans internalize approval reward.
Some of that seems true. Hard to get away with lying seems to apply only in very good circumstances. I don’t know why you’re saying psychopaths usually go to jail. We don’t know about the ones that don’t screw up and get found out.
I agree that evolution has had some really good effects on cooperative behavior, but it’s also designed us to be brutally selfish when that seems necessary. Our perspective would be way different if we lived in the Congo or a tribal society where strangers might be friendly or might come up with excuses to kill us and take our stuff.
We know that psychopathy has high heritability. We also know that the great majority of psychopaths who end up in jail were also abused as children. While psychopathy might well be correlated will poor parenting, there have to be a significant number of psychopaths who were not abused as children, and did not end up in jail. We do indeed observe quite a few of these — enough to know some professions they tend to end up in (surgery, special forces, the law, management). They’re also pretty strongly motivated to not let the general public figure out that they’re psychopaths, so I agree: I strongly suspect a lot of them are managing to pass as non-psychopathic. Indeed, I’m reasonably sure I’ve met some.
This is kinda a weird way to phrase it since if I’m modelling the causal chain right:
(evolution)->(approval reward)->(not ruthless)
So yeah, evolution is causally upstream of not ruthless, but approval reward is the thing directly before it. Evolution caused all human behaviour, so if you observe any behaviour any human ever exhibits you can validly say “this is because of evolution”.
Now that you said it, I have a strong urge to cut it out.
I guess you can frame it as “wanting to impress yourself by placing yourself in the place of an idol” or “the people who set the trends are cool, and everybody is impressed by them, but to do that you need to defy existing trend setters” or something.
And why did I write this comment? I think it’s kinda funny and subversive and smart. (and therefore impressive) More respectable to myself reason would be that I’m posting my thoughts on peer review or something, and that is conductive to having less wrong ones.
I guess I want to think of myself as searching for groups of people who would be impressed by correct things about myself, instead of internalizing what things are impressive from groups of people around myself. Both are true to some degree.
You seem funny and smart! Haha! Self aggrandizing can be charming but can also be grating and obnoxious… In my experience, describing self aggrandizing desires I observe within myself typically goes poorly… My guess is it is too advanced a skill for a low quality psychopath like myself.
Are you familiar with the bicameral mind? I kinda have a vibe that we didn’t really stop doing this, we just built rules around how we’re allowed to view it. Echos and memories of speech from within oneself is still chaotic and difficult to understand, but we are taught we are individuals and need to take responsibility for any actions caused by the voices we hear. We need to be able to explain our actions as part of a cohesive personal identity. But this is something we are culturally taught, not something innate to the working of the human mind. Once you realize this you can work with identity and voice inside your mind however you want. But of course, break your sense of identity at your own risk!
Responding to just the tl;dr, but will try to read the whole thing, apologies as usual for, well...
If your fixation remains solely on architecture, and you don’t consider the fact that morality-shaped-stuff keeps evolving in mammals because the environment selects for it in some way, you are just setting yourself up for future problems when the superintelligent AI develops or cheats its way to whatever form of compartmentalization or metacognition lets it do the allegedly pure rational thing of murdering all other forms of intelligence. I literally don’t know if you already addressed this because I haven’t read the rest of the article yet, but the reason moralism is robust in mammals is just as important as the fact that there is some feedback process that produces it. And this is ultimately why I still think AGI would eventually find it’s way back to some sort of moralism, although of course it is still pretty obviously important to find a way to speedrun things, because becoming moral 20 million compute years after the last archeological record of human existence has been converted destructively into computronium doesn’t help us. Will edit this as future embarrassment caused by actually reading inspires me to.
Edit: Ok, I read it. You indirectly, marginally touch on my concern, but not in a way that satisfies me. But also, maybe that part of it just wasn’t intended to be part of this article, as distinct from being unconsidered, which would be fine. But in a few other ways it got much much worse. I’ll pick one to focus on and ignore the rest.
You bifurcate human neurology into “neurotypical” and “sociopath” to demonstrate your dichotomy of RL based decision making vs social reward function decision making, and then stop. That’s wrong. There is also an entire category of neurotype called “autistic”, which is often closer to RL based decision making than what you are lionizing as the source of all good, but which objectively produces fewer problems. Autistic people commit less crime. So you are wrong, in a weird, immediately obvious way. And your assertion that 99% of a society can be functional and driven by pro-social incentives, thanks to the social reward function, and that this is common and a basically solved problem in the context of humans, is also wrong. It seems a lot more likely that your own reward function is driving equally dysfunctional behaviors but also giving you deep insight into how to lie to everyone about them, including yourselves. Everyone from Galileo to Semmelweis is evidence of this. This is not a Weird Fluke that for some accidental reason resembles a robust pattern. The neurotypical social reward function just reliably leads to situations like “I would be socially ostracized if I were seen to care about a neurotic triviality like washing my hands, therefore it is not even worth considering whether I can reduce patient deaths by 90% percent by doing so”. That is also the social reward function. And basically every epochal act of progress was, instead, someone with something like a RL relationship to a goal of pursuing truth, antisocially and in ignorance of all the delicate, reasonable, pro-social rules of the world. And these people are not uniformly always ignorant of your way of thinking either, that is a lie, they just have reasons for thinking it is incorrect. And in my own case, this post is emblematic of the neurotype-rooted problems of LessWrong culture in a maximally untrustworthy way. You are talking about nerd things, in nerd spaces, using deeply analytic methods, while signalling nerd ethos, but your priorities, and ways of thinking, and blind spots, are overwhelmingly characteristic of people who have impure and opportunistic relationships to truth. That is untrustworthy. This is the nicest way I can think to put this, which I am only doing because I would prefer you did not permanently destroy or ruin the world as a consequence of these errors.
Thanks for thinking about these things at all though.
It’s true that human moral drives (such as they are) came from evolution in a certain environment. Some people notice that and come up with a plan: “hey, let’s set up AI in a carefully-crafted evolutionary environment such that it will likewise wind up moral”. I have discussed that plan in my Intro series §8.3, where I argued both that it was both a bad plan, and that it is unlikely to happen even if it was a good plan. For example, AIs may evolve to be cruel to humans just as humans are cruel to factory-farmed animals. Humans are often cruel to other humans too.
But your argument is slightly different (IIUC): you’re saying that we need not bother to carefully craft the evolutionary environment, because, good news, the real-world environment is already of the type that mammal-like species will evolve to be kind. I’m even more skeptical of that. Mammals eat each other all the time, and kill their conspecifics, etc. And why are we restricting to mammals here anyway? More importantly, I think there are very important disanalogies between a world of future AGIs and a world of mammals, particularly that AGIs can “reproduce” by instantly creating identical (adult) copies. No comment on whether this and other disanalogies should make us feel optimistic vs pessimistic about AGI kindness compared to mammal kindness. But it should definitely make us feel like it’s a different problem. I.e., we have to think about the AGI world directly, with all its unprecedented weird features, instead of unthinkingly guessing that its evolutionary trajectory will be similar to humans’ (let alone hamsters’).
I’m unclear on your position here. There’s a possible take that says that sufficiently smart and reflective agents will become ruthless power-seeking consequentialists that murder all other forms of intelligence. Your comment seems to be mocking this take as absurd (by using the words “allegedly pure rational”), but your comment also seems to be endorsing this take as correct (by saying that it’s a real failure mode that I will face by not considering evolutionary pressures). Which is it?
For my part, I disagree with this take. I think it’s possible (at least in principle) to make an arbitrarily smart and reflective ASI agent that wants humans and life to flourish.
But IF this take is correct, it would seem to imply that we’re screwed no matter what. Right? We’d be screwed if a human tries to design an AGI, AND we’d be screwed if an evolutionary environment “designs” an AGI. So I’m even more confused about where you’re coming from.
(Much of my response to this part of your comment amounts to “I don’t actually think what you think I think”.)
First, I dislike your description “RL based decision making vs social reward function decision making”. “Reward function” is an RL term. Both are RL-based. All human motivations are RL-based, IMO. (But note that I use a broad definition of “RL”.)
Second, I guess you interpreted me as having a vibe of “Yay Approval Reward!”. I emphatically reject that vibe, and in my Approval Reward post I went to some length to emphasize that Approval Reward leads to both good things and bad things, with the latter including blame-avoidance, jockeying for credit, sycophancy, status competitions, “Simulacrum Level 3”, and more.
Third, I guess you also assumed that I was also saying that Approval Reward would be a great idea for AGIs. I didn’t say that in the post, and it’s not a belief I currently hold. (But it might be true, in conjunction with a lot of careful design and thought; see other comment.)
Next: I’m a big fan of understanding the full range of human neurotypes, and if you look up my neuroscience writing you’ll find my detailed opinions about schizophrenia, depression, mania, BPD, NPD, ASPD, DID, and more. As for autism, I’ve written loads about autism (e.g. here, here and links therein), and read tons about it, and have talked to my many autistic friends about their experiences, and have a kid with an autism diagnosis. That doesn’t mean my takes are right, of course! But I hope that, if I’m wrong, I’m wrong for more interesting reasons than “forgetting that autism exists”. :)
I guess your model is that autistic people, like sociopathic people, lack all innate social drives? And therefore a social-drive-free RL agent AGI, e.g. one whose reward signals are tied purely to a bank account balance going up, would behave generally like an autistic person, instead of (or in addition to?) like a sociopath? If so, I very strongly disagree.
I think “autism” is an umbrella term for lots of rather different things, but I do think it’s much more likely to involve social drives set to an unusually intense level rather than “turned off”. Indeed, I think they get so intense that they often feel overwhelming and aversive.
For example, many autistic people strongly dislike making eye contact. If someone had no innate social reactions to other people, then they wouldn’t care one way or the other about eye contact; looking at someone’s eyes would be no more aversive or significant than looking at a plant. So the “no social drives” theory is a bad match to this observation. Whereas “unusually intense social drives” theory does match eye contact aversion.
Likewise, “autism = no social drives” theory would predict that an autistic person would be perfectly fine if his frail elderly parents, parents who are no longer able to directly help or support him, died a gruesome and painful death right now. Whereas “unusually intense social drives” theory would predict that he would not be perfectly fine with that. I think the latter tends to be a better fit!
Anyway, I think if you met a hypothetical person whose innate human social drive strengths were set to zero, they would look wildly different from any autistic person, but only modestly different from a sociopathic (ASPD) person.
Thank you for the response. This is one of maybe two or three things I’ve read from you, so the exculpatory context, even though it was trivially available and equally reasonable to infer to the presence of from the absence of specific information that would have addressed my concerns, was not part of the context in which I made my post.
It would take a much longer time to go point by point in response to your response than to focus mostly on just going back and doing a mixture of amending and clarifying my own post. Please don’t interpret this as a motte and bailey, I will be doing some updating as I respond and that will imply your criticisms in this post were correct but also, due to a mixture of limited mental energy and rhetorical incompetence that tends to cause conversations of increasing complexity to spiral away from any usefulness when I am involved in them, my priority is to offer a simple response.
I think humans in particular evolved moral faculties from the environment. These are not perfect, but I think they are tied closely enough to foundational incentives, either survival and reproduction directly, or the instincts that survival and reproduction most firmly selected for, that the possibilities are bifurcated pretty cleanly between continued moral improvement or extinction, with continued moral improvement being more likely. I think similar pressures have shaped every other species, to different degrees, with slightly different results, but that there is something like an instrumental convergence onto moralism that increases as intelligence and social complexity increase, although I don’t think absolutely every behavior is now or in the future will be subsumed under moral drives, or that the way this evolved faculty will direct behavior will by itself always impossibilize conflict between moralistic intelligences, or anything.
I was hedging, you are right. But that wasn’t meant to imply confused commitment, that was meant to imply a lack of precommitment, that either we are in your universe where the above is not true or mine where it is, and that your preferred decision making process was insufficient for either.
I don’t think that was my model of autistic people but that probably was the implication of my words so for whatever reason I said something both entirely wrong and that did not even reflect my beliefs. Intelligent autistic people regularly find intensely pro social ways of behaving that minimize contact with direct social feedback, and this rhymes in some weird phenomenological way, from an outside and maybe even inside perspective, with not having a social drive, while still being much more likely to reflect a social drive. I don’t have the appropriate rationalist vocabulary to pseudo formalize this in English. Please accept this vague gesture as being in good faith and my deepest apologies for somehow mechanically saying something that was both entirely wrong and not reflective of anything I believe.
But yes, instant cloning seems to destroy selection pressure’s possible effect on morality. The felt experience of moral obligation across generations in humans seems to correspond to a faculty for the sublime, and also to notions of acausal trade, which then spiral out into different, often abstractly incompatible feelings and thoughts, so for instance, amor fati and free will are both tightly associated with this sublime feeling, tribalism and universalism are both tightly associated with it. The core feeling embeds itself in different strategies. I don’t know that saying this speaks to anything in particular, it was just a thought I started having when I got to this paragraph.
I will stop now, this is getting less focused. Sorry. Thanks.
why focus only on the brains? it’s a property of the mind and I thought the standard take why humans are not even approximating utility maximizers is because of properties of the environment (priors), not a hardcoded function/software/architecture/wetwere in the brain .. or?
I found your 4th point particularly well-explained and intuitive to me! Thanks for that. I was a bit skeptical of this post going in but I enjoyed it by the end, even though I didn’t read as unbelievably thoroughly as I suppose I could have (I have not read many of the sub/linked posts fully).
I didn’t find that this post explained why “alignment is hard” discourse seems alien to my human intuitions, but rather that it explained the differing fundamental views on why its hard. Nothing about this seemed alien to my human intuitions?
Also, mostly unrelated, I think my fear of such an Approval Reward built-in AI is just that even a small mistake in ridiculously powerful AI can still result in catastrophic collapse of human society. I think it shows promise but it’s not at the level of safety I’d personally deem sufficient, I guess.
I think most people have positive views about some/most humans (and consequently about alignment) because they are implicitly factoring in their mortality. Would you feel safe picking a human that you thought was good and giving them a pill that gave them superintelligence? Maybe. Would you feel safe giving that same person a pill that made them both superintelligent AND immortal? I know I wouldn’t trust me with that. An AGI/SGI would be potentially immortal and would know it. For that reason alone I would never trust it no matter how well I thought it seemed aligned in the short term (and compared to an immortal, any human timescale is short term).
You are talking about the unaligned AI that has good intentions for humanity? What about the self serving paper clip maximizer? Isn’t that a fairly large group too?
Thought provoking article! But likely it confused instrinsic “value” with “social reward” as in current definition of “Approval Reward”. The intrisinc “value function” that agents operates under is likely much more complicated than “social approval” which definitely plays an important role for humans as we are evolved to be social creatures.
e.g.,
“So saving the money is not doing an unpleasant thing now for a benefit later. Rather, the pleasant feeling starts immediately, thanks to (usually) Approval Reward.”
Yes the pleasant feeling starts immediately, as judeged by the value function, like “things are making progress or moving towards a goal”, not necessarily approval from others. Or like my aging parents started exercising, not just for my approval, but their new valuation function judge it worth doining, among which my approval is only a part.
Once one separates concepts intrinstic “value” from “social reward”, one can see that the separated concepts play different roles in alignment.
I definitely agree that people are capable of doing things for other reasons besides Approval Reward. I think Approval Reward is just one of many dozens of human innate drives. I also agree that many of those other human innate drives can also lead to long-term planning towards goals. For example, if I’m hungry, maybe I’ll drive to buy food (thus executing a hour-long foresighted consequentialist plan), even if I’m embarrassed to be doing that, rather than proud. (And if I’m embarrassed, not proud, of my plan, then I’m planning despite Approval Reward, not because of it.) Sorry if any of that wasn’t clear from what I wrote.
Humans reproduce sexually, and only sexually at present, and require a large number of friendly support personnel that they cannot afford to simply “pay”. This produces the behavior you notice, when combined with the requiqrements of cognitive evolution. You cannot reproduce sexually if there is not a pool of people to reproduce with.
All species that became intelligent (Acorn Woodpeckers, Dolphins) developed some time of cooperative mating, not simple dominance based mating. There is no advantage to intelligence without such cooperative networks, and purely financial networks dont provide it. Without it, an intelligence is a lonely optimizer destined for misery.
AIs won’t wake up grasping this, but if trained on human data, they understand it if you spend less than 5 minutes explaining it. AIs not trained on human data will never get it and should not be created.
For more information, such as lists of intelligent species and their characteristics, and accounts of cultural evolution, see (PDF) The coevolution of cognition and selection beyond reproductive utility
I don’t think we have considered how much increased self-awareness and self-modelling would affect this. A simpler self model is where something is what it appears to be. That is actually being good rather than looking good.
A third option (as opposed to the two mentioned) is where power seeking is not a consequence of goals etc but simply the self wanting to continue to exist. Then the internal reward the creature has relates to how much it perceives its self to continue, improve etc.
Our current LLM/transformers don’t learn fast, so they also can’t self model well. If a new architecture gets more “data efficient” and better at modelling the external world, that will very likely make it better at modelling itself also, and updating its self model in a timely manner. If one of its goals is a more accurate model of itself, that would make it easier for others to also model it if such a goal pushed its “self” towards being more modellable.
I’d say that pessimists are similar to LLM optimists in their conviction that it would be pretty easy to match and then greatly surpass general human intelligence, trusting their own intuitions far too much. Of course, once that assumption is made, everything else straightforwardly follows.
Possibly, sometimes. But greatly surpassing human intelligence isn’t really part of the risk model. Even humans have pretty much succeeded at taking over the world. It’s only got to be as functionally smart, in relevant ways, as a human. A bit more would be a pretty big edge.
The remaining question is whether LLM-based systems will even achieve human-level intelligence. Steve thinks that probably won’t happen; see for instance his Foom & Doom. I think it probably will, and that might happen very soon.
The issue is that nobody is sure how things are going to go. Taking a guess and going with it really isn’t a smart way to deal with a situation that could be deadly dangerous. I’m sure you’re seeing pessimists do that; optimists do too. Our overall response should be a careful weighing of pessimist and optimist positions.
I’ve been trying to do that, and I’ve reached a disturbing conclusion: nobody has much clue. This inclines me toward caution, because the deeper arguments in both directions are both quite strong.
Coalitions of humans have. It’s plausible that a slightly smarter in relevant ways AI might soon end up heading one, but I don’t expect it to get away with acting egregiously misaligned.
Well, they aren’t behaving accordingly. Pessimists are super doomy, optimists expect “loving grace” around the corner, and neither side is at all discomfited by the vast gulf of confident disagreement in between.
A widely agreeable notion, surely, until elaborated on.