Steven Byrnes comments on We need a field of Reward Function Design

Steven Byrnes 27 Dec 2025 1:33 UTC
LW: 7 AF: 5
2
AF
My very diplomatic answer is: the field of Reward Function Design should be a rich domain with lots of ideas. Curiosity drive is one of them, and so is reward shaping, and so is IRL / CIRL, etc. What else should be on that list that hasn’t been invented yet? Well, let’s invent it! Let a thousand flowers bloom!
…Less diplomatically, since you asked, here’s a hot take. I’m not 100% confident, but I currently don’t think IRL / CIRL per se is a step forward for the kinds of alignment problems I’m worried about. Some possible issues (semi-overlapping) include (1) ontology identification (figuring out which latent variables if any correspond to a human, or human values, in a learned-from-scratch unlabeled world-model); (2) “the hard problem of wireheading”; (3) “the problem of fully updated deference”; (4) my guess that the “brain-like AGI” that I’m specifically working on simply wouldn’t be compatible with IRL / CIRL anyway (i.e. I’m worried that IRL-compatible algorithms would be much less powerful); and (5) my lack of confidence in the idea that learning what a particular human wants to do right now, and then wanting the same thing, really constitutes progress on the ASI x-risk problem in the first place.
What links here?
- [Intro to brain-like-AGI safety] 10. The technical alignment problem by Steven Byrnes (30 Mar 2022 13:24 UTC; 55 points)