testingthewaters comments on Perils of under- vs over-sculpting AGI desires

testingthewaters 5 Aug 2025 18:28 UTC
3 points
0
My suspicion is that, just as “learning” and “character formation” in humans can take years or decades, it is foolish to think that setting up the AGI with a proper reward function etc is sufficient for “aligning” it. Designing a good reward does not “align” an AI any more than gene editing can guarantee that a baby will become the next Nelson Mandela. (If nothing else, they can always fail to achieve the reward function’s stated goal if they have insufficient resources or adaptability) Basic drives can produce inclinations one way or another (I’d really not want an AGI whose reward function is tied to seeing humans get stabbed), but how an AI interfaces with the world and how it reinforces its own worldview is as much a product of its environment as it is stemming from the reward function. What I have read of neuroscience is quite firm on the fact that most of our internal reward signals are learned within-lifetime (this was laid down back in the days of Pavlov’s dogs and conditioned reflexes), and therefore heavily path-dependent on the stimuli provided by our environment. This is especially true if AI systems are not passive supervised learning predictors but rather active agents constantly updating their beliefs.

As such it may be useful to imagine the reward function less as a totalising superobjective that must be bulletproof to any kind of reward hacking and more like a starting point for an interactive “alignment curriculum”. It may be that we need to align AIs by talking to them and teaching them just like how we raise kids today, before it becomes superintelligent. And we should hopefully not teach them that the fate of the universe is a zero-sum game being fought between all intelligent species where the prize is the matter-energy of the lightcone.
- Steven Byrnes 5 Aug 2025 20:52 UTC
  5 points
  0
  Parent
  it is foolish to think that setting up the AGI with a proper reward function etc is sufficient for “aligning” it
  I agree that, when RL agents wind up pursuing goals, those goals depend on both their reward function and their training environment, not just the reward function. I do think the reward function is extraordinarily important, and I further think that, if you start with a bad reward function, then there’s no possible training environment that will make up for how badly you screwed up with the reward function. And worse, I think that every reward function in the RL literature to date is a “bad reward function” for AGI in that sense, i.e. so bad that no training environment can realistically redeem it. (See “Behaviorist” RL reward functions lead to scheming.)
  But yes, if you find a not-terrible reward function, you still also need to co-design it with an RL training environment.
  …Unfortunately, the AGI will eventually wind up in the real world, which is also a “training environment” of sorts, but one that’s basically impossible for the AGI programmers to control or design. That’s another reason that the reward function is so important. (I’m expecting continuous learning, as opposed to train-then-deploy … we can talk separately about why I expect that.)
  Designing a good reward does not “align” an AI any more than gene editing can guarantee that a baby will become the next Nelson Mandela. … It may be that we need to align AIs by talking to them and teaching them just like how we raise kids today, before it becomes superintelligent.
  (“Guarantee” is too high a bar. In an AGI context, I mostly expect doom, and would consider it huge progress to have a plan that will probably work but is not guaranteed.)
  Identical twins do have remarkably similar personalities. That’s true if the twins are raised in the same family, and it’s approximately equally true if the twins are raised by different families in different towns (but in the same country). (If the identical twins not even raised in the same country, … Well, if Nelson Mandela’s identical twin lived his entire life among the Yanomami, then OK yeah sure, he would probably have quite different goals and behaviors.)
  Basically, I think you’re wrong that humans “raise kids” in the sense that you’re suggesting. See Heritability: Five Battles §2 and especially §2.5.1. More elaboration:
  What I have read of neuroscience is quite firm on the fact that most of our internal reward signals are learned within-lifetime (this was laid down back in the days of Pavlov’s dogs and conditioned reflexes), and therefore heavily path-dependent on the stimuli provided by our environment. This is especially true if AI systems are not passive supervised learning predictors but rather active agents constantly updating their beliefs.
  I’ve spent years reading all the neuroscience that I can find about reward signals, and I’m not sure what you’re referring to. Pavlov was well aware that e.g. pain is innately aversive to mice, and yummy food is innately appetitive, and so on. The behaviorists talked at length about “unconditioned stimuli”, “primary rewards”, “primary reinforcers”, “primary punishers”, etc.
  In Heritability, Behaviorism, and Within-Lifetime RL, I talk about two mental pictures: “RL with continuous learning” versus “RL learn-then-get-stuck”. (Click the link for details, I’m gonna start using those terms without explaining them.)
  I acknowledge that “RL learn-then-get-stuck” can happen in certain cases (see §6.2), but I strongly believe that these are the exception, not the rule.
  Even behaviorism, if you actually take its ideas seriously, is generally much more compatible with the “RL with continuous learning” story than the “RL learn-then-get-stuck” story. It’s true that the mouse is acquiring Conditioned Stimuli (CSs), but generally those CSs are reflective of some environmental regularity (e.g. smell-of-cheese precedes eating-cheese). If the mouse grows up and that regularity goes away, then it will generally quickly unlearn the CS (“extinction”).
  So the CS is learned, yes, but that doesn’t imply path-dependence. The mouse’s behavior mostly just depends on its reward function (USs), and on the environment where the mouse winds up living its life (which determine which CSs tend to precede any given US). The mouse will learn CSs that reflect those environmental regularities, and unlearn CSs that don’t, regardless of what happened in its childhood.
  And this lines up with the human heritability evidence (linked above) that neither parents nor any other characters in a child’s environment are able to permanently sculpt the kid’s personality and values to any measurable extent.
  - testingthewaters 5 Aug 2025 21:25 UTC
    1 point
    0
    Parent
    Hey, I want to respond to the rest of your comments, but that might take some time. Just a quick thought/clarification about one particular thing.
    
    I’ve spent years reading all the neuroscience that I can find about reward signals, and I’m not sure what you’re referring to. Pavlov was well aware that e.g. pain is innately aversive to mice, and yummy food is innately appetitive, and so on. The behaviorists talked at length about “unconditioned stimuli”, “primary rewards”, “primary reinforcers”, “primary punishers”, etc.
    
    I understand that the reward signals are hard coded, but internal anticipation of reward signals is not. Pavlov’s breakthrough was that with proper conditioning you could connect unrelated stimuli to the possibility of “true” reward signals in the future. In other words, you can “teach” an RL system to perform credit assignment to just about anything. Lights, sounds, patterns of behaviour...
    
    In the alignment context this means that even if the agent gets reward when the condition “someone is stabbed” is fulfilled and the reward function triggers, it is possible for it to associate and anticipate reward coming from (and therefore prefer) entirely unconnected world states. If for example one person is injured every time by an overseer every time it solves a sudoku puzzle, it will probably have a reward-anticipation signal as it gets closer to solving a sudoku puzzle (plus some partial reward when it does a row or a column or a cell satisfactorily etc). And the reward anticipation signal isn’t even wrong! It’s just the way the environment is set up. We do this to ourselves all the time, of course, when we effectively dose ourselves with partial reward signals for fulfilling steps of a long term plan which only distantly ends with something that we would “innately” desire.
    
    That’s just to clarify a bit more of what I mean when I say that reward signal is only the first step. You then have to connect it with the correct learned inferences about what produces reward, which is something that only happens in the “deployment environment” i.e. the real world. As an example of what happens when goal state and acquired inference are unaligned, religious fanatics often also believe that they are saving the world, and vastly improving the lives of everyone in it.
    
    (There’s also the small problem that even the action “take action to gather more information and improve your causal inferences about the world” is itself motivated by learned anticipation of future reward… it is very hard to explain something inconvenient to someone when their job depends on them not understanding it)