There is no constraint towards specifying measurable goals of the kind that lead to reward-hacking concerns.
I’m not sure that reward-hacking in LM agent systems is inevitable, but it seems at least plausible that reward hacking could occur in such systems without further precautions.
For example, if oversight is implemented via an overseer LLM agent O which gives scores for proposed actions by another agent A, then A might end up adversarially optimizing against O if A is set up for a high success rate (high rate of actions accepted).
(I agree very much with the general point of the post, though)
I’m not sure that reward-hacking in LM agent systems is inevitable, but it seems at least plausible that reward hacking could occur in such systems without further precautions.
For example, if oversight is implemented via an overseer LLM agent O which gives scores for proposed actions by another agent A, then A might end up adversarially optimizing against O if A is set up for a high success rate (high rate of actions accepted).
(I agree very much with the general point of the post, though)