Force neural nets to use models, then detect these

Research projects

I’m planning to start two research projects on model splintering/​reward generalisation and learning the preferences of irrational agents.

Within those projects, I’m aiming to work on subprojects that are:

  1. Posed in terms that are familiar to conventional ML;

  2. interesting to solve from the conventional ML perspective;

  3. and whose solutions can be extended to the big issues in AI safety.

The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.

The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.

Force model use and then detect it

Parent project: this is a subproject of the value learning project.


I’ve seen human values residing, at least in part, in our mental models. We have a mental model of what might happen in the world, and we grade these outcomes as good or bad. In order to learn what humans value, the AI needs to be able to access the mental models underlying our thought processes.

Before starting on humans, with our messy brains, it might be better to start on artificial agents, especially neural-net based ones that superficially resemble ourselves.

The problem is that deep learning RL agents are generally model-free. Or, when they are model-based, they are generally constructed with a model explicitly, so that identifying their model is as simple as saying “the model is in this sub-module, the one labelled ‘model’.”


The idea here is to force a neural net to construct a model within itself—a model that we can somewhat understand.

I can think of several ways of doing that. We could get a traditional deep learning agent that performs on a game. But we might also force it to answer questions about various aspects of the game, identifying the values of certain features we have specified in advance (“how many spaceships are there on the screen currently?”). We can then use multi-objective optimisation with a strong simplicity prior/​regulariser. This may force the agent to use the categories it has constructed to answer the questions, in order to play the game.

Or we could be more direct. We could, for instance, have the neural net pass on instructions or advice to another entity that actually plays the game. The neural net sees the game state, but the other entity can only react in terms of the features we’ve laid down. So the neural net has to translate the game state into the features (this superficially looks like an autoencoder; those might be another way of achieving the aim).

Ideally, we may discover ways of forcing an agent to use a model without specifying the model ourselves; some approaches to transfer learning may work here, and it’s possible that GPT-3 and other transformer-based architectures already generate something that could be called an “internal model”.

Then, we go looking for that model within the agent. Here the idea is to use something like the OpenAI microscope. That approach allows people to visualise what each neuron in an image classifier is reacting to, and how the classifier is doing its job. Similarly, we’d want to identify where the model resides, how it’s encoded and accessed, and similar questions. We can then modify the agent’s architecture to test if these characteristics are general, or particular to the agent’s specific design.

Research aims

  1. See how feasible it is to force a neural net based RL agent to construct mental models.

  2. See how easy it is to identify these mental models within the neural net, and what characteristics they have (are they spread out, are they tightly localised, how stable are they, do they get reused for other purposes?).

  3. See how the results of the first two aims might lead to more research, or might be applied to AI-human interactions directly.