Thanks for the post. The importance of reward function design for solving the alignment problem is worth emphasizing.
I’m wondering how you research fits into other reward function alignment research such as CHAI’s research on CIRL and inverse reinforcement learning, and reward learning theory.
It seems like these other agendas are focused on using game theory or machine learning fundamentals to come up with a new RL approach that makes AI alignment easier whereas your research is more focused on the intersection of neuroscience and RL.
My very diplomatic answer is: the field of Reward Function Design should be a rich domain with lots of ideas. Curiosity drive is one of them, and so is reward shaping, and so is IRL / CIRL, etc. What else should be on that list that hasn’t been invented yet? Well, let’s invent it! Let a thousand flowers bloom!
…Less diplomatically, since you asked, here’s a hot take. I’m not 100% confident, but I currently don’t think IRL / CIRL per se is a step forward for the kinds of alignment problems I’m worried about. Some possible issues (semi-overlapping) include (1) ontology identification (figuring out which latent variables if any correspond to a human, or human values, in a learned-from-scratch unlabeled world-model); (2) “the hard problem of wireheading”; (3) “the problem of fully updated deference”; (4) my guess that the “brain-like AGI” that I’m specifically working on simply wouldn’t be compatible with IRL / CIRL anyway (i.e. I’m worried that IRL-compatible algorithms would be much less powerful); and (5) my lack of confidence in the idea that learning what a particular human wants to do right now, and then wanting the same thing, really constitutes progress on the ASI x-risk problem in the first place.
Thanks for the post. The importance of reward function design for solving the alignment problem is worth emphasizing.
I’m wondering how you research fits into other reward function alignment research such as CHAI’s research on CIRL and inverse reinforcement learning, and reward learning theory.
It seems like these other agendas are focused on using game theory or machine learning fundamentals to come up with a new RL approach that makes AI alignment easier whereas your research is more focused on the intersection of neuroscience and RL.
My very diplomatic answer is: the field of Reward Function Design should be a rich domain with lots of ideas. Curiosity drive is one of them, and so is reward shaping, and so is IRL / CIRL, etc. What else should be on that list that hasn’t been invented yet? Well, let’s invent it! Let a thousand flowers bloom!
…Less diplomatically, since you asked, here’s a hot take. I’m not 100% confident, but I currently don’t think IRL / CIRL per se is a step forward for the kinds of alignment problems I’m worried about. Some possible issues (semi-overlapping) include (1) ontology identification (figuring out which latent variables if any correspond to a human, or human values, in a learned-from-scratch unlabeled world-model); (2) “the hard problem of wireheading”; (3) “the problem of fully updated deference”; (4) my guess that the “brain-like AGI” that I’m specifically working on simply wouldn’t be compatible with IRL / CIRL anyway (i.e. I’m worried that IRL-compatible algorithms would be much less powerful); and (5) my lack of confidence in the idea that learning what a particular human wants to do right now, and then wanting the same thing, really constitutes progress on the ASI x-risk problem in the first place.