I’ve moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:
The top-level category is reward hacking / reward corruption, which means that the agent’s observed reward differs from true reward/task performance.
Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.
Tampering can subsequently be divided into further subcategories. Does the agent tamper with its reward function, its observations, or the preferences of a user giving feedback? Which things the agent might want to tamper with depends on how its observed rewards are computed.
One advantage with this terminology is that it makes it clearer what we’re talking about. For example, its pretty clear what reward function tampering refers to, and how it differs from observation tampering, even without consulting a full definition.
That said, I think you’re post nicely puts the finger on what we usually mean when we say wireheading, and it is something we have been talking about a fair bit. Translated into my terminology, I think your definition would be something like “wireheading = tampering with goal measurement”.
Thanks Stuart, nice post.
I’ve moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:
The top-level category is reward hacking / reward corruption, which means that the agent’s observed reward differs from true reward/task performance.
Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.
Tampering can subsequently be divided into further subcategories. Does the agent tamper with its reward function, its observations, or the preferences of a user giving feedback? Which things the agent might want to tamper with depends on how its observed rewards are computed.
One advantage with this terminology is that it makes it clearer what we’re talking about. For example, its pretty clear what reward function tampering refers to, and how it differs from observation tampering, even without consulting a full definition.
That said, I think you’re post nicely puts the finger on what we usually mean when we say wireheading, and it is something we have been talking about a fair bit. Translated into my terminology, I think your definition would be something like “wireheading = tampering with goal measurement”.