Dagon comments on Defining AI wireheading

Dagon 21 Nov 2019 22:01 UTC
4 points
0
In my usage, “wireheading” is generally about the direct change of a reward value which bypasses the utility function which is supposed to map experience to reward. It’s a subset of Goodhart, which also can cover cases of misuse of the of reward mapping (eating ice cream instead of healthier food), or changing of the mapping function itself (intentionally acquiring a taste for something).
But really, what’s the purpose of trying to distinguish wireheading from other forms of reward hacking? The mitigations for Goodhart are the same: ensure that there is a reward function that actually matches real goals, or enough functions with declining marginal weight that abusing any of them is self-limiting.
- Stuart_Armstrong 22 Nov 2019 11:48 UTC
  3 points
  0
  Parent
  
  But really, what’s the purpose of trying to distinguish wireheading from other forms of reward hacking?
  
  Because mitigations for different failure modes might not be the same, depending on the circumstances.