beren comments on Perils of under- vs over-sculpting AGI desires

beren 9 Aug 2025 21:04 UTC
LW: 12 AF: 8
0
AF
This is a really good post. Some minor musings:
If a human wound up in that situation, they would just think about it more, repeatedly querying their ‘ground truth’ social instincts, and come up with some way that they feel about that new possibility. Whereas AGI would … I dunno, it depends on the exact code. Maybe it would form a preference quasi-randomly? Maybe it would wind up disliking everything, and wind up sitting around doing nothing until it gets outcompeted? (More on conservatism here.)
Perhaps a difference in opinion is that it’s really unclear to me that an AGI wouldn’t do much the same thing of “thinking about it more, repeatedly querying their ‘ground truth’ social instincts” that humans do. Arguably models like Claude Opus already do this where it clearly can do detailed reasoning about somewhat out-of-distribution scenarios using moral intuitions that come from somewhere. Where this somewhere is going to be some inscrutable combination of similar scenarios in pretraining data, generalization from humans talking about morality, and intuitions derived from the RLAIF phase which embeds Claude’s constitution etc. Of course we can argue that Claude’s ‘social instincts’ derived in this way are defective somehow compared to humans but it is unclear (to me) that this path cannot make AGIs with decent social instincts.
Perhaps a crux of differences in opinion between us is that I think that much more ‘alignment relevant’ morality is not created entirely by innate human social instincts but is instead learnt by our predictive world models based on external data—i.e. ‘culture’. Now culture itself obviously is downstream of a lot of our social instincts but it is also based on other factors like game-theoretic equilibria which promote cooperation even among selfish agents and, very pertinently, using logical ‘system 2’ reasoning to try to generalize and extend our inchoate social instincts and then learn to backprop this new understanding into our learnt value functions. Utilitarianism, and this super generalized EA-style compassion it brings is a great example of this. No primitive tribesman or indeed very few humans before the 18th century had ever thought of or had moral intuitions aligned with these ideas. They are profoundly unnatural to our innate ‘human social instincts’. (Some) people today feel these ideas viscerally because they have been exposed to them enough that they have propagated them from the world model back into the value function through in-lifetime learning.
We don’t have to conjure up thought experiments about aliens outside of our light cone. Throughout most of history humans have been completely uncompassionate about suffering existing literally right in front of their faces. From the beginning of time to the 18th century almost nobody had any issues with slavery despite often living with slaves or seeing slave suffering on a daily basis. Today, only a few people have moral issues with eating meat despite the enormous mountain of suffering it causes to living animals right here on our own planet while eating meat only brings reasonable (and diminishing), but not humongously massive, benefits to our quality of life.
My thinking is that this ‘far-mode’ and ‘literate/language/system2-derived’ morality is actually better for alignment and human flourishing in general than the standard set of human social instincts—i.e. I would prefer a being with the morality of Claude Opus to rule the world rather than a randomly selected human. Alignment is a high bar and ultimately we need to create minds far more ‘saintly’ than any living human could ever be.
What we then need to do is figure out how to distill this set of mostly good, highly verbal moral intuitions from culture into a value function that the model ‘feels viscerally’. Of course reverse-engineering some human social instincts are probably important here—i.e. our compassion instinct is good if generalized, and even more generally understanding how the combination of innate reward signals in the hypothalamus plus the representations in our world model gets people to feel viscerally about the fates of aliens we can never possibly interact with, is very important to understand.
Nevertheless, truly out-of-distribution things also exist, just as the world of today is truly out-of-distribution from the perspective of an ancient Egyptian.
As a side-note, it’s really unclear how good humans are at generalizing at true out-of-distribution moralities. Today’s morality likely looks pretty bad from the ancient Egyptian perspective. We are really bad at worshipping Ra and reconciling with our Ba’s. It might be the case that, upon sufficient reflection, the Egyptians would come to realize that we are right all along, but of course we would say that in any case. I don’t know how to solve this or whether there is in fact any general case solution to any degree of ‘out-of-distribution-ness’ except just like pure conservatism where you freeze both the values and the representations they are based on.
- Steven Byrnes 30 Sep 2025 14:36 UTC
  LW: 5 AF: 3
  0
  AF Parent
  Belated thanks!
  I would prefer a being with the morality of Claude Opus to rule the world rather than a randomly selected human … it’s really unclear how good humans are at generalizing at true out-of-distribution moralities. Today’s morality likely looks pretty bad from the ancient Egyptian perspective…
  Hmm, I think maybe there’s something I was missing related to what you’re saying here, and that maybe I’ve been thinking about §8.2.1 kinda wrong. I’ve been mulling it over for a few days already, and might write some follow-up. Thanks.
  Perhaps a difference in opinion is that it’s really unclear to me that an AGI wouldn’t do much the same thing of “thinking about it more, repeatedly querying their ‘ground truth’ social instincts” that humans do. Arguably models like Claude Opus already do this where it clearly can do detailed reasoning about somewhat out-of-distribution scenarios using moral intuitions that come from somewhere…
  I think LLMs as we know them today and use them today are basically fine, and that this fine-ness comes first and foremost from imitation-learning on human data (see my Foom & Doom post §2.3). I think some of my causes for concern are that, by the time we get to ASI…
  (1) Most importantly, I personally expect a paradigm shift after which true imitation-learning on human data won’t be involved at all, just as it isn’t in humans (Foom & Doom §2.3.2) … but I’ll put that aside for this comment;
  (2) even if imitation-learning (a.k.a. pretraining) remains part of the process, I expect RL to be a bigger and bigger influence over time, which will make human-imitation relatively less of an influence on the ultimate behavior (Foom & Doop §2.3.5);
  (3) I kinda expect the eventual AIs to be kinda more, umm, aggressive and incorrigible and determined and rule-bending in general, since that’s the only way to make AIs that get things done autonomously in a hostile world where adversaries are trying to jailbreak or otherwise manipulate them, and since that’s the end-point of competition.
  Perhaps a crux of differences in opinion between us is that I think that much more ‘alignment relevant’ morality is not created entirely by innate human social instincts but is instead learnt by our predictive world models based on external data—i.e. ‘culture’.…
  (You might already agree with all this:)
  Bit of a nitpick, but I agree that absorbing culture is a “predictive world model” thing in LLMs, but I don’t think that’s true in humans, at least in a certain technical sense. I think we humans absorb culture because our innate drives make us want to absorb culture, i.e. it happens ultimately via RL. Or at least, we want to absorb some culture in some circumstances, e.g. we particularly absorb the habits and preferences of people we regard as high-status. I have written about this at “Heritability: Five Battles” §2.5.1, and “Valence & Liking / Admiring” §4.5.
  See here for some of my thoughts on cultural evolution in general.
  I agree that “game-theoretic equilibria” are relevant to why human cultures are how they are right now, and they might also be helpful in a post-AGI future if (at least some of) the AGIs intrinsically care about humans, but wouldn’t lead to AGIs caring about humans if they don’t already.
  I think “profoundly unnatural” is somewhat overstating the disconnect between “EA-style compassion” and “human social instincts”. I would say something more like: we have a bunch of moral intuitions (derived from social instincts) that push us in a bunch of directions. Every human movement / ideology / meme draws from one or more forces that we find innately intuitively motivating: compassion, justice, spite, righteous indignation, power-over-others, satisfaction-of-curiosity, etc.
  So EA is drawing from a real innate force of human nature (compassion, mostly). Likewise, xenophobia is drawing from a real innate force of human nature, and so on. Where we wind up at the end of the day is a complicated question, and perhaps underdetermined. (And it also depends on an individual’s personality.) But it’s not a coincidence that there is no EA-style group advocating for things that have no connection to our moral intuitions / human nature whatsoever, like whether the number of leaves on a tree is even vs odd.
  We don’t have to conjure up thought experiments about aliens outside of our light cone. Throughout most of history humans have been completely uncompassionate about suffering existing literally right in front of their faces…
  Just to clarify, the context of that thought experiment in the OP was basically: “It’s fascinating that human compassion exists at all, because human compassion has surprising and puzzling properties from an RL algorithms perspective.”
  Obviously I agree that callous indifference also exists among humans. But from an RL algorithms perspective, there is nothing interesting or puzzling about callous indifference. Callous indifference is the default. For example, I have callous indifference about whether trees have even vs odd numbers of leaves, and a zillion other things like that.