# World-models containing self-models

One problem in theoretical AI that sometimes comes up is the problem of finding ways for AI systems to model themselves, or at least to act well as if they had models of themselves. I can see how this is a problem for uncomputable agents like AIXI (though I think this problem is largely solved by reflective oracles), but it doesn’t seem to me to be a very hard problem for computable agents—they seem to me to be able to learn models of themselves along with the rest of the world. I’ll give an example of self-modeling trouble that some kinds of systems can run into, then my reasons for not thinking this is a big problem (though I’m by no means sure!).

# A problem for model-based RL

Suppose that we’re using model-based RL: our system learns a model that maps states of the world and actions the system takes to next states and rewards. This learned model is used to choose actions by building a tree of possible sequences of actions the system could take and the consequences that the model predicts would result; the path with the highest expected reward is chosen.

The situation our system is in will be as follows:

The system is learning to perform some episodic RL task; at the end of each episode, the environment is reset, then another episode is run.

In this environment, the agent has an action that gives a moderately large reward, but that forces the agent to take a null action for the rest of the episode.

The interesting thing here is that the system’s model won’t learn anything about the bad side effect of this action, even if it impacts the system’s total reward a lot. This is because the model maps (state, action) → (next state); it learns what environmental state the bad action leads to, and after that it learns a lot about the effects of the null action, but it doesn’t learn that the bad action leads to the null action. Furthermore, the tree search will continue to assume that the system will be able to choose whatever action it wants, even when the system will be forced to take the null action.

This is concerning, but the fix seems simple: have the system learn an additional model that maps states to states, implicitly causing it to model the system’s action selection. Then, when the agent selects an action, have it use the (state, action) → (state) model followed by several iterations of the (state) → (state) model to see what effects that action will have. (Once the system has inferred a good enough model of itself, this will effectively be a tree search.) This should allow the system to learn when it will be forced to take the null action, so that it can choose that action only when it actually maximises rewards.

In general, this kind of approach seems fine to me; a system can learn a model of the environment including itself, and use this model to figure out the long-term consequences of its actions. I haven’t yet found a problem with this, and I might look for some kind of formal guarantee.

All in all, the theoretical problem involving uncomputable ideals like AIXI seems to be mostly solved, and the practical problem doesn’t seem like a big deal because of fixes like the above. Am I missing something?

A few points:

If the problem isn’t solved for AIXI, then I don’t see why it would be solved for bounded agents. Solomonoff induction doesn’t have Solomonoff induction in its hypothesis class, and naive bounded Solomonoff induction doesn’t have naive bounded Solomonoff induction in its hypothesis class.

The solution you propose looks something like: to predict X property of the world, just learn to predict X directly instead of learning a whole model of the world containing the agent. (Let me know if this in an inaccurate summary). This essentially “marginalizes over the agent” by ignoring the agent. While this isn’t a completely bad way to predict X, it seems like this won’t succeed in creating an

integratedworld model. If there’s some property of the agent’s source code that affects X, then the agent won’t be able to determine this before actually running that branch of the source code. I think a lot of the logical uncertainty / naturalized induction problem is to create an integrated world model that can reason about the agent itself when this is useful to predict X.Reflective oracles solve a large part of the unbounded problem, but they don’t have an obvious bounded analogue. For example, when an agent using reflective oracles is predicting a smarter agent, it is already able to see that smarter agent’s distribution over actions. This doesn’t work in bounded land; you probably need some other ingredient. Perhaps the solution acts like reflective oracles in the limit, but it needs a story for why a weak agent can reason about a smart agent, which reflective oracles don’t provide.

I think that Vadim’s optimal predictors can play the same role as reflective oracles in the bounded case, or at least that’s the idea.

Both of them are just analysis tools though, not algorithms. I think the corresponding algorithms will be closer to what Daniel describes, that is the agent does not treat itself specially (except for correlation between its decision and the agent’s output).

Thanks Jessica. This was helpful, and I think I see more what the problem is.

Re point 1: I see what you mean. The intuition behind my post is that it seems like it should be possible to make a bounded system that can eventually come to hold any computable hypothesis given enough evidence, including a hypothesis including a model of itself of arbitrary precision (which is different from Solomonoff, which can clearly never think about systems like itself). It’s clearly not possible for the system to hold and update infinitely many hypotheses the way Solomonoff does, and a system would need some kind of logical uncertainty or other magic to evaluate complex or self-referential hypotheses, but it seems like these hypotheses should be “in its class”. Does this make sense, or do you think there is a mistake there?

Re point 2: I’m not confident that’s an accurate summary; I’m precisely proposing that the agent learn a model of the world containing a model of the agent (approximate or precise). I agree that evaluating this kind of model will require logical uncertainty or similar magic, since it will be expensive and possibly self-referential.

Re point 3: I see what you mean, though for self-modeling the agent being predicted should only be as smart as the agent doing the prediction. It seems like approximation and logical uncertainty are the main ingredients needed here. Are there particular parts of the unbounded problem that are not solved by reflective oracles?

Re point 1: Suppose the agent considers all hypotheses of length up to l bits that run in up to t time. Then the agent takes 2lt time to run. For an individual hypothesis to reason about the agent, it must use t computation time to reason about a computation of size 2lt. A theoretical understanding of how this works would solve a large part of the logical uncertainty / naturalized induction / Vingean reflection problem.

Maybe it’s possible for this to work without having a theoretical understanding of why it works, but the theoretical understanding is useful too (it seems like you agree with this). I think there are some indications that naive solutions won’t automatically work; see e.g. this post.

Re point 2: It seems like this is learning a model from the state and action to state, and a model from state to state that ignores the agent. But it isn’t learning a model that e.g. reasons about the agent’s source code to predict the next state. An integrated model should be able to do reasoning like this.

Re point 3: I think you still have a Vingean reflection problem if a hypothesis that runs in t time predicts a computation of size 2lt. Reflective Solomonoff induction solves a problem with an unrealistic computation model, and doesn’t translate to a solution with a finite (but large) amount of computing resources. The main part not solved is the general issue of predicting aspects of large computations using a small amount of computing power.

Thanks. I agree that these are problems. It seems to me that the root of these problems is logical uncertainty / vingean reflection (which seem like two sides of the same coin); I find myself less confused when I think about self-modeling as being basically an application of “figuring out how to think about big / self-like hypotheses”. Is that how you think of it, or are there aspects of the problem that you think are missed by this framing?

Yes, this is also how I think about it. I don’t know anything specific that doesn’t fit into this framing.