World-models containing self-models

One problem in theoretical AI that sometimes comes up is the problem of finding ways for AI systems to model themselves, or at least to act well as if they had models of themselves. I can see how this is a problem for uncomputable agents like AIXI (though I think this problem is largely solved by reflective oracles), but it doesn’t seem to me to be a very hard problem for computable agents—they seem to me to be able to learn models of themselves along with the rest of the world. I’ll give an example of self-modeling trouble that some kinds of systems can run into, then my reasons for not thinking this is a big problem (though I’m by no means sure!).

A problem for model-based RL

Suppose that we’re using model-based RL: our system learns a model that maps states of the world and actions the system takes to next states and rewards. This learned model is used to choose actions by building a tree of possible sequences of actions the system could take and the consequences that the model predicts would result; the path with the highest expected reward is chosen.

The situation our system is in will be as follows:

  • The system is learning to perform some episodic RL task; at the end of each episode, the environment is reset, then another episode is run.

  • In this environment, the agent has an action that gives a moderately large reward, but that forces the agent to take a null action for the rest of the episode.

The interesting thing here is that the system’s model won’t learn anything about the bad side effect of this action, even if it impacts the system’s total reward a lot. This is because the model maps (state, action) → (next state); it learns what environmental state the bad action leads to, and after that it learns a lot about the effects of the null action, but it doesn’t learn that the bad action leads to the null action. Furthermore, the tree search will continue to assume that the system will be able to choose whatever action it wants, even when the system will be forced to take the null action.

This is concerning, but the fix seems simple: have the system learn an additional model that maps states to states, implicitly causing it to model the system’s action selection. Then, when the agent selects an action, have it use the (state, action) → (state) model followed by several iterations of the (state) → (state) model to see what effects that action will have. (Once the system has inferred a good enough model of itself, this will effectively be a tree search.) This should allow the system to learn when it will be forced to take the null action, so that it can choose that action only when it actually maximises rewards.

In general, this kind of approach seems fine to me; a system can learn a model of the environment including itself, and use this model to figure out the long-term consequences of its actions. I haven’t yet found a problem with this, and I might look for some kind of formal guarantee.

All in all, the theoretical problem involving uncomputable ideals like AIXI seems to be mostly solved, and the practical problem doesn’t seem like a big deal because of fixes like the above. Am I missing something?