Roughly, yeah, though there are some differences—e.g. here the AI has no prior “directly about” values, it’s all mediated by the “messages”, which are themselves informing intended AI behavior directly. So e.g. we don’t need to assume that “human values” live in the space of utility functions, or that the AI is going to explicitly optimize for something, or anything like that. But most of the things which are hard in CIRL are indeed still hard here; it doesn’t really solve anything in itself.
One way to interpret it: this approach uses a similar game to CIRL, but strips out most of the assumptions about the AI and human being expected utility maximizers. To the extent we’re modelling the human as an optimizer, it’s just an approximation to kick off communication, and can be discarded later on.
Roughly, yeah, though there are some differences—e.g. here the AI has no prior “directly about” values, it’s all mediated by the “messages”, which are themselves informing intended AI behavior directly. So e.g. we don’t need to assume that “human values” live in the space of utility functions, or that the AI is going to explicitly optimize for something, or anything like that. But most of the things which are hard in CIRL are indeed still hard here; it doesn’t really solve anything in itself.
One way to interpret it: this approach uses a similar game to CIRL, but strips out most of the assumptions about the AI and human being expected utility maximizers. To the extent we’re modelling the human as an optimizer, it’s just an approximation to kick off communication, and can be discarded later on.