Steven Byrnes comments on 6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

Steven Byrnes 26 Dec 2025 19:03 UTC
4 points
0
If your fixation remains solely on architecture, and you don’t consider the fact that morality-shaped-stuff keeps evolving in mammals because the environment selects for it in some way…
It’s true that human moral drives (such as they are) came from evolution in a certain environment. Some people notice that and come up with a plan: “hey, let’s set up AI in a carefully-crafted evolutionary environment such that it will likewise wind up moral”. I have discussed that plan in my Intro series §8.3, where I argued both that it was both a bad plan, and that it is unlikely to happen even if it was a good plan. For example, AIs may evolve to be cruel to humans just as humans are cruel to factory-farmed animals. Humans are often cruel to other humans too.
But your argument is slightly different (IIUC): you’re saying that we need not bother to carefully craft the evolutionary environment, because, good news, the real-world environment is already of the type that mammal-like species will evolve to be kind. I’m even more skeptical of that. Mammals eat each other all the time, and kill their conspecifics, etc. And why are we restricting to mammals here anyway? More importantly, I think there are very important disanalogies between a world of future AGIs and a world of mammals, particularly that AGIs can “reproduce” by instantly creating identical (adult) copies. No comment on whether this and other disanalogies should make us feel optimistic vs pessimistic about AGI kindness compared to mammal kindness. But it should definitely make us feel like it’s a different problem. I.e., we have to think about the AGI world directly, with all its unprecedented weird features, instead of unthinkingly guessing that its evolutionary trajectory will be similar to humans’ (let alone hamsters’).
[if you don’t consider environmental / evolutionary pressures, then] you are just setting yourself up for future problems when the superintelligent AI develops or cheats its way to whatever form of compartmentalization or metacognition lets it do the allegedly pure rational thing of murdering all other forms of intelligence
I’m unclear on your position here. There’s a possible take that says that sufficiently smart and reflective agents will become ruthless power-seeking consequentialists that murder all other forms of intelligence. Your comment seems to be mocking this take as absurd (by using the words “allegedly pure rational”), but your comment also seems to be endorsing this take as correct (by saying that it’s a real failure mode that I will face by not considering evolutionary pressures). Which is it?
For my part, I disagree with this take. I think it’s possible (at least in principle) to make an arbitrarily smart and reflective ASI agent that wants humans and life to flourish.
But IF this take is correct, it would seem to imply that we’re screwed no matter what. Right? We’d be screwed if a human tries to design an AGI, AND we’d be screwed if an evolutionary environment “designs” an AGI. So I’m even more confused about where you’re coming from.
You bifurcate human neurology into “neurotypical” and “sociopath” to demonstrate your dichotomy of RL based decision making vs social reward function decision making, and then stop. That’s wrong. There is also an entire category of neurotype called “autistic”…
(Much of my response to this part of your comment amounts to “I don’t actually think what you think I think”.)
First, I dislike your description “RL based decision making vs social reward function decision making”. “Reward function” is an RL term. Both are RL-based. All human motivations are RL-based, IMO. (But note that I use a broad definition of “RL”.)
Second, I guess you interpreted me as having a vibe of “Yay Approval Reward!”. I emphatically reject that vibe, and in my Approval Reward post I went to some length to emphasize that Approval Reward leads to both good things and bad things, with the latter including blame-avoidance, jockeying for credit, sycophancy, status competitions, “Simulacrum Level 3”, and more.
Third, I guess you also assumed that I was also saying that Approval Reward would be a great idea for AGIs. I didn’t say that in the post, and it’s not a belief I currently hold. (But it might be true, in conjunction with a lot of careful design and thought; see other comment.)
Next: I’m a big fan of understanding the full range of human neurotypes, and if you look up my neuroscience writing you’ll find my detailed opinions about schizophrenia, depression, mania, BPD, NPD, ASPD, DID, and more. As for autism, I’ve written loads about autism (e.g. here, here and links therein), and read tons about it, and have talked to my many autistic friends about their experiences, and have a kid with an autism diagnosis. That doesn’t mean my takes are right, of course! But I hope that, if I’m wrong, I’m wrong for more interesting reasons than “forgetting that autism exists”. :)
I guess your model is that autistic people, like sociopathic people, lack all innate social drives? And therefore a social-drive-free RL agent AGI, e.g. one whose reward signals are tied purely to a bank account balance going up, would behave generally like an autistic person, instead of (or in addition to?) like a sociopath? If so, I very strongly disagree.
I think “autism” is an umbrella term for lots of rather different things, but I do think it’s much more likely to involve social drives set to an unusually intense level rather than “turned off”. Indeed, I think they get so intense that they often feel overwhelming and aversive.
For example, many autistic people strongly dislike making eye contact. If someone had no innate social reactions to other people, then they wouldn’t care one way or the other about eye contact; looking at someone’s eyes would be no more aversive or significant than looking at a plant. So the “no social drives” theory is a bad match to this observation. Whereas “unusually intense social drives” theory does match eye contact aversion.
Likewise, “autism = no social drives” theory would predict that an autistic person would be perfectly fine if his frail elderly parents, parents who are no longer able to directly help or support him, died a gruesome and painful death right now. Whereas “unusually intense social drives” theory would predict that he would not be perfectly fine with that. I think the latter tends to be a better fit!
Anyway, I think if you met a hypothetical person whose innate human social drive strengths were set to zero, they would look wildly different from any autistic person, but only modestly different from a sociopathic (ASPD) person.
- Alephwyr 26 Dec 2025 19:43 UTC
  4 points
  0
  Parent
  Thank you for the response. This is one of maybe two or three things I’ve read from you, so the exculpatory context, even though it was trivially available and equally reasonable to infer to the presence of from the absence of specific information that would have addressed my concerns, was not part of the context in which I made my post.
  It would take a much longer time to go point by point in response to your response than to focus mostly on just going back and doing a mixture of amending and clarifying my own post. Please don’t interpret this as a motte and bailey, I will be doing some updating as I respond and that will imply your criticisms in this post were correct but also, due to a mixture of limited mental energy and rhetorical incompetence that tends to cause conversations of increasing complexity to spiral away from any usefulness when I am involved in them, my priority is to offer a simple response.
  I think humans in particular evolved moral faculties from the environment. These are not perfect, but I think they are tied closely enough to foundational incentives, either survival and reproduction directly, or the instincts that survival and reproduction most firmly selected for, that the possibilities are bifurcated pretty cleanly between continued moral improvement or extinction, with continued moral improvement being more likely. I think similar pressures have shaped every other species, to different degrees, with slightly different results, but that there is something like an instrumental convergence onto moralism that increases as intelligence and social complexity increase, although I don’t think absolutely every behavior is now or in the future will be subsumed under moral drives, or that the way this evolved faculty will direct behavior will by itself always impossibilize conflict between moralistic intelligences, or anything.
  I was hedging, you are right. But that wasn’t meant to imply confused commitment, that was meant to imply a lack of precommitment, that either we are in your universe where the above is not true or mine where it is, and that your preferred decision making process was insufficient for either.
  I don’t think that was my model of autistic people but that probably was the implication of my words so for whatever reason I said something both entirely wrong and that did not even reflect my beliefs. Intelligent autistic people regularly find intensely pro social ways of behaving that minimize contact with direct social feedback, and this rhymes in some weird phenomenological way, from an outside and maybe even inside perspective, with not having a social drive, while still being much more likely to reflect a social drive. I don’t have the appropriate rationalist vocabulary to pseudo formalize this in English. Please accept this vague gesture as being in good faith and my deepest apologies for somehow mechanically saying something that was both entirely wrong and not reflective of anything I believe.
  But yes, instant cloning seems to destroy selection pressure’s possible effect on morality. The felt experience of moral obligation across generations in humans seems to correspond to a faculty for the sublime, and also to notions of acausal trade, which then spiral out into different, often abstractly incompatible feelings and thoughts, so for instance, amor fati and free will are both tightly associated with this sublime feeling, tribalism and universalism are both tightly associated with it. The core feeling embeds itself in different strategies. I don’t know that saying this speaks to anything in particular, it was just a thought I started having when I got to this paragraph.
  I will stop now, this is getting less focused. Sorry. Thanks.