This is an interesting point. I can imagine a case where our assigned reward comes from a simple function (e.g reward = number of letter ‘e’s in output) and we also have a model which is doing some internal optimization to maximise the number of ‘e’s produced in its output, so it is “goal-directed to produce lots of ’e’s”.
Even in this case, I would still say this model isn’t a “reward maximiser”. It is a “letter ‘e’ maximiser”.
(I also want to acknowledge that thinking this through makes me feel somewhat confused. I think what I said is correct. My guess is the misunderstanding I highlight in the post is quite pervasive, and the language we use isn’t current up-to-scratch to write about these things clearly. Good job thinking of a case that is pushing against my understanding!)
Ah ok. If a reward function is taken as a preference ordering then you are right the model is optimizing for reward as the preference ranking is literally identical.
I think the reason we have been talking past each other is in my head when I think of “reward function” I am literally thinking of the reward function (i.e the actual code), and when I think of “reward maximiser” I think of a system that is trying to get that piece of code to output a high number.
So I guess it’s a case of us needing to be very careful by exactly what we mean by reward function, and my guess is as long as we use the same definition then we are in agreement? Does that make sense?