(C)IRL is not solely a learning process

A putative new idea for AI control; index here.

I feel Inverse Reinforcement Learning (IRL) and Cooperative Inverse Reinforcement Learning (CIRL) are very good ideas, and will likely be essential for safe AI if we can’t come up with some sort of sustainable low impact, modular, or Oracle design. But IRL and CIRL have a weakness. In a nutshell:

#. The models (C)IRL uses for humans are underspecified. #. This should cause CIRL to have motivated and manipulative learning. #. Even without that, (C)IRL can end up fitting a terrible model to humans. #. To solve those issues, (C)IRL will need to make creative modelling decisions that go beyond (standard) learning.


In a nutshell within the nutshell, (C)IRL doesn’t avoid the main problems that other learning approaches have. Let’s look at each of these points in turn.

The models (C)IRL uses for humans are underspecified

This shouldn’t be in doubt. CIRL doesn’t have a proper model of a human, beyond an agent that “knows the reward function” (). Standard IRL has even less: an expert policy, or a set of sampled trajectories (examples of human performance). There have been efforts to add noise to the model of human behaviour, but only in a very simplistic way, that doesn’t model the full range of human irrationality (see some examples here).

Of course, given a diverse enough prior, a correct model of human irrationality will be included, but the human remains underspecified.

This should cause CIRL to have motivated and manipulative learning

The CIRL is not immune to the usual pressures towards manipulative learning of any agent whose goal is specified in terms of what the agent learns.

To illustrate with an example: suppose first that the CIRL models the human as being perfectly rational, free of error or bias. Then, assuming the CIRL can also predict and manipulate human behaviour, it can force the human to confirm (through action or speech) that some particularly-easy-to-maximise is the correct one.

But the CIRL agent is unlikely to have only this “rationality model”. It may have a large variety of models, and maybe some explicit meta-preferences. But the same pressure applies: the agent will attempt to manipulate the update of the models similarly, all to force towards something particularly easy to maximise.

Partially defining terms like bias and bounded rationality don’t help here; since the agent is corrupting (from our perspective, thought not from a formal persective) the learning process, it will fix its formal “bias” and “bounded rationality” terms to mean whatever it can make them mean.

Consider the concept of alief. An alief is an automatic or habitual belief-like attitude. For example, a person standing on a transparent balcony may believe that they are safe, but alieve that they are in danger.

This is the sort of concept that a purely learning AI would come up with if it were observing human behaviour, and would allow it to model us better. But with the AI’s learning corrupted, aliefs and other concepts would merely allow it equivocate between what is knowledge, what is bias, and what is preference.

Again, the corruption of the AI’s learning does not come from any explicit anti-learning programming, but merely from underspecified models and a desire to maximise the learnt reward.

Even without that, (C)IRL can end up fitting a terrible model to humans

AIXI has an incorrect self model, so can end up destroying itself. Similarly, if the space of possible models the AI considers is too narrow, it can end up fitting a model to human behaviour that wildly inappropriate, forcing it to fit as much as it can (this mis-fit has a similarity to AIs handling ontology shifts badly).

Even if the AI’s priors include an acceptable model of humans, it may still end up fitting different ones. It could model humans as a mix of conflicting subagents, or even something like “the hypothalamus is the human, the rest of the brain is this complicated noise”, and the model could fit—and fit very well, depending on what “complicated noise” it is allowed to consider.

To solve those issues, (C)IRL will need to make creative modelling decisions that go beyond (standard) learning

Imagine we that we have somehow solved all the issues above—the CIRL agent is motivated to learn, correctly, about human values (and then end up maximising them). Somehow, we’ve ensured that it will consistently use definitional concepts like “bias” and “human knowledge” in the ways we would like it to.

It still has to resolve a lot of issues that we ourselves haven’t solved. Such as the tension between procrastination and obsessive focus. Or what population ethics it should use. Or how to resolve stated versus revealed preferences, and how to deal with beliefs in belief and knowledge that people don’t want to know.

Essentially, the AI has to be able to do moral philosophy exactly as a human would, and to do it well. Without us being able to define what “exactly as a human would” means. And it has to continue this, as both it and humans change and we’re confronted by a world completely transformed, and situations we can’t currently imagine.