In my usage, “wireheading” is generally about the direct change of a reward value which bypasses the utility function which is supposed to map experience to reward. It’s a subset of Goodhart, which also can cover cases of misuse of the of reward mapping (eating ice cream instead of healthier food), or changing of the mapping function itself (intentionally acquiring a taste for something).
But really, what’s the purpose of trying to distinguish wireheading from other forms of reward hacking? The mitigations for Goodhart are the same: ensure that there is a reward function that actually matches real goals, or enough functions with declining marginal weight that abusing any of them is self-limiting.
In my usage, “wireheading” is generally about the direct change of a reward value which bypasses the utility function which is supposed to map experience to reward. It’s a subset of Goodhart, which also can cover cases of misuse of the of reward mapping (eating ice cream instead of healthier food), or changing of the mapping function itself (intentionally acquiring a taste for something).
But really, what’s the purpose of trying to distinguish wireheading from other forms of reward hacking? The mitigations for Goodhart are the same: ensure that there is a reward function that actually matches real goals, or enough functions with declining marginal weight that abusing any of them is self-limiting.
Because mitigations for different failure modes might not be the same, depending on the circumstances.