I mean, I find the whole “models don’t want rewards, they want proxies of rewards” conversation kind of pointless because nothing ever perfectly matches anything else, so I agree that in a common-sense way it’s fine to express this as “wanting reward”, but also, I think the people who care a lot about the distinction of proxies of rewards and actual reward would feel justifiedly kind of misunderstood by this.
I think I agree with “nothing ever perfectly matches anything else” and in particular, philosophically, there are many different precissifications of “reward/reinforcement” which are conceptually distinct and it’s unclear which one if any a reward-seeking AI would seek. E.g. is it about a reward counter on a GPU somewhere going up, or is it about the backpropagation actually happening?
I mean, I find the whole “models don’t want rewards, they want proxies of rewards” conversation kind of pointless because nothing ever perfectly matches anything else, so I agree that in a common-sense way it’s fine to express this as “wanting reward”, but also, I think the people who care a lot about the distinction of proxies of rewards and actual reward would feel justifiedly kind of misunderstood by this.
I think I agree with “nothing ever perfectly matches anything else” and in particular, philosophically, there are many different precissifications of “reward/reinforcement” which are conceptually distinct and it’s unclear which one if any a reward-seeking AI would seek. E.g. is it about a reward counter on a GPU somewhere going up, or is it about the backpropagation actually happening?