I’m generally pretty skeptical about inverse reinforcement learning (IRL) as a method for alignment. One of many arguments against: I do not act according to any utility function, including the one I would deem the best. Presumably, if I had as much time & resources as I wanted, I would eventually be able to figure out a good approximation to what that best utility function would do, and do it. At that point I would be acting according to the utility function I deem best. That process of value-reflection is not even close to similar to performing a bayes-update, or finding a best-fit utility function to my current or past actions. Yet this is what IRL aims to do to find the utility function that is to be optimized for all of time!
A more general conclusion: we should not be aiming to make our AGI have a utility function. Utility functions are nice because they’re static, but unless the process used to make that utility function resembles the process I’d use to find the best utility function (unlikely, hard, and unwise [edit: to attempt]!), then the static-value nature of our AGI is a flaw, even if it makes it easier to tell you will die if you run it.
Though there are limitations to this argument. For example, Tammy’s QACI. Such things still seem unwise, for for slightly different reasons.
I’m generally pretty skeptical about inverse reinforcement learning (IRL) as a method for alignment. One of many arguments against: I do not act according to any utility function, including the one I would deem the best. Presumably, if I had as much time & resources as I wanted, I would eventually be able to figure out a good approximation to what that best utility function would do, and do it. At that point I would be acting according to the utility function I deem best. That process of value-reflection is not even close to similar to performing a bayes-update, or finding a best-fit utility function to my current or past actions. Yet this is what IRL aims to do to find the utility function that is to be optimized for all of time!
A more general conclusion: we should not be aiming to make our AGI have a utility function. Utility functions are nice because they’re static, but unless the process used to make that utility function resembles the process I’d use to find the best utility function (unlikely, hard, and unwise [edit: to attempt]!), then the static-value nature of our AGI is a flaw, even if it makes it easier to tell you will die if you run it.
Though there are limitations to this argument. For example, Tammy’s QACI. Such things still seem unwise, for for slightly different reasons.