Stephen McAleese comments on We need a field of Reward Function Design

Stephen McAleese 26 Dec 2025 22:57 UTC
LW: 4 AF: 3
0
AF
Thanks for the post. The importance of reward function design for solving the alignment problem is worth emphasizing.

I’m wondering how you research fits into other reward function alignment research such as CHAI’s research on CIRL and inverse reinforcement learning, and reward learning theory.

It seems like these other agendas are focused on using game theory or machine learning fundamentals to come up with a new RL approach that makes AI alignment easier whereas your research is more focused on the intersection of neuroscience and RL.
- Steven Byrnes 27 Dec 2025 1:33 UTC
  LW: 7 AF: 5
  2
  AF Parent
  My very diplomatic answer is: the field of Reward Function Design should be a rich domain with lots of ideas. Curiosity drive is one of them, and so is reward shaping, and so is IRL / CIRL, etc. What else should be on that list that hasn’t been invented yet? Well, let’s invent it! Let a thousand flowers bloom!
  …Less diplomatically, since you asked, here’s a hot take. I’m not 100% confident, but I currently don’t think IRL / CIRL per se is a step forward for the kinds of alignment problems I’m worried about. Some possible issues (semi-overlapping) include (1) ontology identification (figuring out which latent variables if any correspond to a human, or human values, in a learned-from-scratch unlabeled world-model); (2) “the hard problem of wireheading”; (3) “the problem of fully updated deference”; (4) my guess that the “brain-like AGI” that I’m specifically working on simply wouldn’t be compatible with IRL / CIRL anyway (i.e. I’m worried that IRL-compatible algorithms would be much less powerful); and (5) my lack of confidence in the idea that learning what a particular human wants to do right now, and then wanting the same thing, really constitutes progress on the ASI x-risk problem in the first place.
  What links here?
  - [Intro to brain-like-AGI safety] 10. The technical alignment problem by Steven Byrnes (30 Mar 2022 13:24 UTC; 55 points)