This doesn’t seem relevant for non-AIXI RL agents which don’t end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?
With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here, wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach.
Output-based evaluation (including supervised learning, and the most popular forms of unsupervised learning, and a lot of other stuff which treats models as black boxes implementing some input/output behavior or probability distribution or similar) can’t distinguish between a model which is internalizing the desired concepts, vs a model which is instead modeling the actual feedback process instead. These two do different things, but not in a way that the feedback system can differentiate.
In terms of shard theory, as I understand it, the point is that (absent arguments to the contrary, which is what we want to be able to construct), shards that implement feedback-modeling like this cannot be disincentivized by the feedback process, since they perform very well in those terms. Shards which do other things may or may not be disincentivized, but the feedback-modeling shards (if any are formed at any point) definitely won’t, unless of course they’re just not very good at their jobs.
So the problem, then, is: how do we arrange training such that those shards have very little influence, in the end? How do we disincentivize that kind of reasoning at all?
Plausibly, this should only be tackled as a knock-on effect of the real problem, actually giving good feedback which points in the right direction; however, it remains a powerful counterexample class which challenges many many proposals. (And therefore, trying to generate the analogue of the wireheading problem for a given proposal seems like a good sanity check.)
With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here, wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach.
Output-based evaluation (including supervised learning, and the most popular forms of unsupervised learning, and a lot of other stuff which treats models as black boxes implementing some input/output behavior or probability distribution or similar) can’t distinguish between a model which is internalizing the desired concepts, vs a model which is instead modeling the actual feedback process instead. These two do different things, but not in a way that the feedback system can differentiate.
In terms of shard theory, as I understand it, the point is that (absent arguments to the contrary, which is what we want to be able to construct), shards that implement feedback-modeling like this cannot be disincentivized by the feedback process, since they perform very well in those terms. Shards which do other things may or may not be disincentivized, but the feedback-modeling shards (if any are formed at any point) definitely won’t, unless of course they’re just not very good at their jobs.
So the problem, then, is: how do we arrange training such that those shards have very little influence, in the end? How do we disincentivize that kind of reasoning at all?
Plausibly, this should only be tackled as a knock-on effect of the real problem, actually giving good feedback which points in the right direction; however, it remains a powerful counterexample class which challenges many many proposals. (And therefore, trying to generate the analogue of the wireheading problem for a given proposal seems like a good sanity check.)