interstice comments on 6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

interstice 14 Dec 2025 19:43 UTC
LW: 8 AF: 3
1
AF
What’s your take on why Approval Reward was selected for in the first place VS sociopathy?

I find myself wondering if non-behavioral reward functions are more powerful in general than behavioral ones due to less tendency towards wireheading, etc.(consider the laziness & impulsivity of sociopaths) Especially ones such as Approval Reward which can be “customized” depending on the details of the environment and what sort of agent it would be most useful to become.
- Steven Byrnes 15 Dec 2025 14:59 UTC
  LW: 6 AF: 3
  2
  AF Parent
  What’s your take on why Approval Reward was selected for in the first place VS sociopathy?
  Good question!
  There are lots of things that an ideal utility maximizer would do via means-end reasoning, that humans and animals do instead because they seem valuable as an end in itself, thanks to the innate reward function. E.g. curiosity, as discussed in A mind needn’t be curious to reap the benefits of curiosity. And also play, and injury-avoidance, etc. Approval Reward has the same property—whatever selfish end an ideal utility maximizer can achieve via Approval Reward, it can achieve it as well if not better by acting as if it had Approval Reward in situations where that’s in its selfish best interests, and not where it isn’t.
  In all these cases, we can ask: why do humans in fact find it intrinsically motivating? I presume that the answer is something like humans are not automatically strategic, which is even more true when they’re young and still learning. “Humans are the least intelligent species capable of building a technological civilization.” For example, people with analgesic conditions (like leprosy or CIP) are often shockingly cavalier about bodily harm, even when they know consciously that it will come back to bite them in the long term. Consequentialist planning is often not strong enough to outweigh what seems appealing in the moment.
  To rephrase more abstractly: for ideal rational agents, intelligent means-end planning towards X (say, gaining allies for a raid) is always the best way to accomplish that same X. If some instrumental strategy S (say, trying to fit in) is usually helpful towards X, means-end planning can deploy S when S is in fact useful, and not deploy S when it isn’t. But in humans, who are not ideal rational agents, they’re often more likely to get X by wanting X and intrinsically want S as an end in itself. The costs of this strategy (i.e., still wanting S even in cases where it’s not useful towards X) are outweighed by the benefit (avoiding the problem of not pursuing S because you didn’t think of it, or can’t be bothered).
  This doesn’t apply to all humans all the time, and I definitely don’t think it will apply to AGIs.
  …For completeness, I should note that there’s a evo-psych theory that there has been frequency-dependent selection for sociopaths—i.e., if there are too many sociopaths in the population, then everyone else improves their wariness and ability to detect sociopaths and kill or exile them, but when sociopathy is rare, it’s adaptive (or at least, was adaptive in Pleistocene Africa). I haven’t seen any good evidence for this theory, and I’m mildly skeptical that it’s true. Wary or not, people will learn the character traits of people they’ve lived and worked with for years. Smells like a just-so story, or at least that’s my gut reaction. More importantly, the current population frequency of sociopathy is in the same general ballpark as schizophrenia, profound autism, etc., which seem (to me) very unlikely to have been adaptive in hunter-gatherers. My preferred theory is that there’s frequency-dependent selection across many aspects of personality, and then sometimes a kid winds up with a purely-maladaptive profile because they’re at the tail of some distribution. [Thanks science banana for changing my mind on this.]
  I find myself wondering if non-behavioral reward functions are more powerful in general than behavioral ones due to less tendency towards wireheading, etc.(consider the laziness & impulsivity of sociopaths)
  I think the “laziness & impulsivity of sociopaths” can be explained away as a consequence of the specific way that sociopathy happens in human brains, via chronically low physiological arousal (which also leads to boredom and thrill-seeking). I don’t think we can draw larger lessons from that.
  I also don’t see much connection between “power” and behaviorist reward functions. For example, eating yummy food is (more-or-less) a behaviorist component of the overall human reward function. And its consequences are extraordinary. Consider going to a restaurant, and enjoying it, and thus going back again a month later. It sounds unimpressive, but really it’s remarkable. After a single exposure (compare that to the data inefficiency of modern RL agents!), the person is making an extraordinarily complicated (by modern AI standards) plan to get that same rewarding experience, and the plan will almost definitely work on the first try. The plan is hierarchical, involving learned motor control (walking to the bus), world-knowledge (it’s a holiday so the buses run on the weekend schedule), dynamic adjustments on the fly (there’s construction, so you take a different walking route to the bus stop), and so on, which together is way beyond anything AI can do today.
  I do think there’s a connection between “power” and consequentialist desires. E.g. the non-consequentialist “pride in my virtues” does not immediately lead to anything as impressive as the above consequentialist desire to go to that restaurant. But I don’t see much connection between behaviorist rewards and consequentialist desires—if we draw a 2×2 thing, then I can think of examples in all four quadrants.
  What links here?
  - [Intro to brain-like-AGI safety] 8. Takeaways from neuro 1/2: On AGI development by Steven Byrnes (16 Mar 2022 13:59 UTC; 61 points)
  - interstice 18 Dec 2025 3:18 UTC
    LW: 5 AF: 3
    1
    AF Parent
    
    There are lots of things that an ideal utility maximizer would do via means-end reasoning, that humans and animals do instead because[...]
    
    Right. What you said in your comment seems pretty general—any thoughts on what in particular leads to Approval Reward being a good thing for the brain to optimize? Spitballing, maybe it’s because human life is a long iterated game so reputation ends up being the dominant factor in most situations and this might not be easily learned by a behaviorist reward function?
    - Steven Byrnes 18 Dec 2025 15:38 UTC
      LW: 5 AF: 3
      0
      AF Parent
      You mean, if I’m a guy in Pleistocene Africa, then why it instrumentally useful for other people to have positive feelings about me? Yeah, basically what you said; I’m regularly interacting with these people, and if they have positive feelings about me, they’ll generally want me to be around, and to stick around, and also they’ll tend to buy into my decisions and plans, etc.
      Also, Approval Reward also leads to norm-following, which is also probably adaptive for me, because probably many of those social norms exist for good and non-obvious reason, cf. Heinrich.
      this might not be easily learned by a behaviorist reward function
      I’m not sure what the word “behaviorist” is doing there; I would just say: “This won’t happen quickly, and indeed might not happen at all, unless it’s directly in the reward function. If it’s present only indirectly (via means-end planning, or RL back-chaining, etc.), that’s not as effective.”
      I think “the reward function is incentivizing (blah) directly versus indirectly” is (again) an orthogonal axis from “the reward function is behaviorist vs non-behaviorist”.
- Noosphere89 14 Dec 2025 20:25 UTC
  5 points
  0
  Parent
  An underrated answer is that humans are very, very dependent on other people to survive, and we have easily the longest childhood where we are vulnerable of any mammal, and even once we do become an adult, we are still really, really bad at surviving on our own compared to other animals, and since we are K-selected, every dead child mattters a lot in evolution, so it’s very, very difficult for sociopathy to be selected for.