maddi comments on 6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

maddi 9 Dec 2025 2:17 UTC
4 points
0
AF
Even if AGI has Approval Rewards (i.e., from LLMs or somehow in RL/agentic scenarios), Approval Rewards only work if the agent actually values the approver’s approval. Maybe sometimes that valuation is more or less explicit, but there needs to be some kind of belief that the approval is important, and therefore behaviors should align with approval reward-seeking / disapproval minimizing outcomes.
As a toy analogy: many animals have preferences about food, territory, mates, etc., but humans don’t really treat those signals as serious guides to our behaviors. Not because the signals aren’t real, but because we don’t see birds, for example, as being part of our social systems in ways that require us seeking their approval for better outcomes for us. We don’t care if birds support our choice of lunch, or who we decide to partner with. Even among humans, in-group/out-group biases, or continuums of sameness/differentness, closeness/distance, etc. can materially affect how strongly or weakly we value approval reward signals. The approval of someone seen as very different, or part of a distant group, will get discounted, while those from “friends and idols”, or even nearby strangers, matter a lot.
So if AGI somehow does have an Approval Reward mechanism, what will count as a relevant or valued approval reward signal? Would AGI see humans as not relevant (like birds—real, embodied creatures with observable preferences that just don’t matter to them), or not valued (out-group, non-valued reference class), and largely discount our approval in their reward systems? Would it see other AGI entities as relevant/valued?
Maybe this is part of the sociopath issue too. But the point is, approval rewards only work if the agent assigns significance to the approver. So if we do decide that approval rewards are a good thing, and try to somehow incorporate them in AGI designs, we should probably make sure that human approval rewards are valued (or at least be explicit and intentional about this valuation structure).
On another note, initially I felt like one attraction of having an approval reward signal is that, to your point, it’s actually pretty plastic (in humans), so could potentially increase alignment plasticity, which might be important. I think unless we discover some magic universal value system that is relevant for all of humanity for all eternity, it would be good for alignment to shift alongside organic human values-drift. We probably wouldn’t want AGI today to be aligned to colonial values from the 1600s. Maybe future humans will largely disagree with current regimes, e.g., capitalism. But approval rewards mechanisms could orient alignment toward some kind of consensus / average, which could also change over time. It would also guardrail against “bad” values drift, so AGI doesn’t start adopting outlier values that don’t benefit most people. Still, it’s not perfect because it could also inherit all the failure modes of human social reward dynamics, like capture by powerful groups, polarization, majorities endorsing evil norms, etc., which could play out in scary ways with superintelligence discounting human signals.
- Steven Byrnes 10 Dec 2025 14:08 UTC
  LW: 5 AF: 2
  0
  AF Parent
  So if AGI somehow does have an Approval Reward mechanism, what will count as a relevant or valued approval reward signal? Would AGI see humans as not relevant (like birds—real, embodied creatures with observable preferences that just don’t matter to them), or not valued (out-group, non-valued reference class), and largely discount our approval in their reward systems? Would it see other AGI entities as relevant/valued?
  I feel like this discussion can only happen in the context of a much more nuts-and-bolts plan for how this would work in an AGI. In particular, I think the AGI programmers would have various free parameters / intervention points in the code to play around with, some of which may be disanalogous to anything in human or animal brains. So we would need to list those intervention points and talk about what to do with them, and then think about possible failure modes, which might be related to exogenous or endogenous distribution shifts, AGI self-modification / making successors, etc. We definitely need this discussion but it wouldn’t fit in a comment thread.
  - maddi 10 Dec 2025 14:56 UTC
    1 point
    0
    Parent
    Makes sense!
- TristanTrim 9 Dec 2025 19:39 UTC
  1 point
  0
  Parent
  I think “alignment plasticity” is called “corrigibility”.
  
  I agree with your view that approval reward as an AGI target would be complicated. I’d add the detail that even robustly desiring the approval of humans is probably not a good thing for an ASI to be doing, in the same way as a “smile optimizer” would not be a good thing for people who want to smile because they are happy.
  
  unless we discover some magic universal value system that is relevant for all of humanity for all eternity
  
  I’m not a huge fan of your dismissive tone here. My goal is to help humanity build a system for encoding such a thing. I think it is very difficult. Probably the most difficult thing humanity has ever attempted by far. But I do not think it is impossible, and it is only “magic” in the sense that any engineering discipline is magic.
  - maddi 9 Dec 2025 20:15 UTC
    2 points
    0
    Parent
    Thank you! My intent definitely wasn’t to be dismissive, maybe skeptical, but I’m definitely aligned with you that solving this particular problem is both extremely hard and extremely important. Thanks for pointing out how that landed.
    - TristanTrim 9 Dec 2025 20:16 UTC
      1 point
      0
      Parent
      No problem. Hope my criticism didn’t come across as overly harsh. I’m grateful for your engagement : )