Noosphere89 comments on Perils of under- vs over-sculpting AGI desires

Noosphere89 7 Aug 2025 22:59 UTC
5 points
2
On section 6.1, there’s a scenario you made called the Omega-hates-aliens scenario, and in a quote below, you argue the difference in the AGI and the human come down to their different reward function design:

…So now we have two parallel scenarios: me with Omega, and the AGI in a lab. In both these scenarios, we are offered more and more antisocial options, free of any personal consequences. But the AGI will have its desires sculpted by RL towards the antisocial options, while my desires are evidently not.

What exactly is the disanalogy?

The start of the answer is: I said above that the antisocial options were “free of any personal consequences”. But that’s a lie! When I press the hurt-the-aliens button, it is not free of personal consequences! I know that the aliens are suffering, and when I think about it, my RL reward function (the part related to compassion) triggers negative ground-truth reward. Yes the aliens are outside my light cone, but when I think about their situation, I feel a displeasure that’s every bit as real and immediate as stubbing my toe. By contrast, “free of any personal consequences” is a correct description for the AGI. There is no negative reward for the AGI unless it gets caught. Its reward function is “behaviorist”, and cannot see outside the light cone.

So while I somewhat agree with the story in a sense, I’d worry that at least 1-5% of people, if not more would just keep pressing yes, and continually being more and more antisocial in the scenario you’d present.

I’d also say that another big part of the answer on why people don’t do escalating anti-social acts is that they depend on other humans for sustenance, and this means that greater and greater anti-social acts don’t get you to unlimited, forever power, it just lets you collapse society and yourself, and in particular there are always positive rewards from keeping others alive to trade with for the last 200 years that outweighs the positive rewards for tricking the system (usually. When positive rewards for being anti-social outweigh the rewards for being pro-social in a behaviorist sense, that’s when you get to stuff like the colonization of Africa in the 19th century or the United Fruit Company’s dealings with Latin America or the many rentier states that existed today and in the past, which reliably caused mass suffering).

The other reason why anti-social behavior isn’t more common is because we actually manage to catch them relatively reliably and metaphorically give them negative reward, and since a single human can’t take over the world/subvert the nation-scale system that gives them negative rewards usually, this means we mostly don’t have to fear anti-social behavior being more common (except when anti-social people coordinate well enough, and that’s the start of wars/civil wars, revolutions and regime changes when they have a real chance of success. But see my first point on why anti-social behavior can’t be taken too far. Also why all the examples I picked where anti-social incentives can thrive are basically all perpetrated by states, or corporations with the backing of state institutions, because there’s no police/states can’t get caught by higher powers).

More generally, I’m more pessimistic than you about what people would do if they knew they wouldn’t be caught and knew that harming others and being antisocial would be rewarded with unlimited power forever (as is true for AGI in a lot of worlds where AGI killed off humanity), so in a sense I think the social instinct story is at best a limited guide to how humans are pro-social, and I’d put incentives/getting caught by the police as a greater factor in human behavior than you do.

This doesn’t affect AI alignment or your post all that much, because all we need is the existence of humans with significantly pro-social reward functions such that they wouldn’t do anti-social acts even if they knew they wouldn’t be caught and they’d gain almost unlimited power for the plan of having brain-like AGIs having social instincts/controlled brain-like AGIs to work.
- Steven Byrnes 8 Aug 2025 13:14 UTC
  10 points
  0
  Parent
  I think I actually agree with all of that. Like you said in the last paragraph, I usually bring up kindness-among-humans in the context of “existence proof that this is possible”, not making a prevalence claim.
  Semi-related: I once asked a chatbot for historical counterexamples to “absolute power corrupts absolutely” and it suggested Marcus Aurelius, Ashoka the Great, Cincinnatus, Atatürk, LKY, and Nelson Mandela. (IIRC, Gemini said that Mandela never had absolute power but probably could have maneuvered into it if he had tried.) I really don’t know much about any of these people. Maybe I’ll read some biographies at some point, but that would be very low on my priority list.