Human agents can avoid self-delusion so human motivation may suggest a way of computing utilities so that agents do not choose the delusion box for self-delusion (although they may experiment with it to learn how it works). At this moment my dogs are out of sight but I am confident that they are in the kitchen and because I cannot hear them I believe they are resting. Their happiness is one of my motives and I evaluate that they are currently reasonably happy. I am evaluating my motives based on my internal mental model rather than my observations, although my mental model is inferred from my observations. I am motivated to maintain the well being of my dogs and so will act to avoid delusions that prevent me from having an accurate model of their state. If I choose to watch a movie on TV tonight it will delude my observations. However, I know that movies are make-believe so observations of movies update my model of make-believe worlds rather than my model of the real world. My make-believe models and my real world model have very different roles in my motivations. These introspections about my own mental processes motivate me to seek a way for AI agents to avoid self-delusion by basing their utility functions on the environment models that they learn from their interactions with the environment.
And proposed:
This paper argues, via two examples, that the behavior problems can be avoided by formulating the utility function in two steps: 1) inferring a model of the environment from interactions, and 2) computing utility as a function of the environment model.
“Model-Based Utility Functions” (Hibbard 2012) gave a similar intuition:
And proposed: