Maybe! I think this is how Evan explicitly defined it for a time, a few years ago. I think the strong claim isn’t very plausible, and the latter claim is… misdirecting of attention, and maybe too weak. Re: attention, I think that “does the agent end up aligned?” gets explained by the dataset more than by the reward function over e.g. hypothetical sentences.
I think “reward/reinforcement numbers” and “data points” are inextricably wedded. I think trying to reason about reward functions in isolation is… a caution sign? A warning sign?
Maybe! I think this is how Evan explicitly defined it for a time, a few years ago. I think the strong claim isn’t very plausible, and the latter claim is… misdirecting of attention, and maybe too weak. Re: attention, I think that “does the agent end up aligned?” gets explained by the dataset more than by the reward function over e.g. hypothetical sentences.
I think “reward/reinforcement numbers” and “data points” are inextricably wedded. I think trying to reason about reward functions in isolation is… a caution sign? A warning sign?