(C)IRL is not solely a learning process

A pu­ta­tive new idea for AI con­trol; in­dex here.

I feel In­verse Re­in­force­ment Learn­ing (IRL) and Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing (CIRL) are very good ideas, and will likely be es­sen­tial for safe AI if we can’t come up with some sort of sus­tain­able low im­pact, mod­u­lar, or Or­a­cle de­sign. But IRL and CIRL have a weak­ness. In a nut­shell:

#. The mod­els (C)IRL uses for hu­mans are un­der­speci­fied. #. This should cause CIRL to have mo­ti­vated and ma­nipu­la­tive learn­ing. #. Even with­out that, (C)IRL can end up fit­ting a ter­rible model to hu­mans. #. To solve those is­sues, (C)IRL will need to make cre­ative mod­el­ling de­ci­sions that go be­yond (stan­dard) learn­ing.


In a nut­shell within the nut­shell, (C)IRL doesn’t avoid the main prob­lems that other learn­ing ap­proaches have. Let’s look at each of these points in turn.

The mod­els (C)IRL uses for hu­mans are underspecified

This shouldn’t be in doubt. CIRL doesn’t have a proper model of a hu­man, be­yond an agent that “knows the re­ward func­tion” (). Stan­dard IRL has even less: an ex­pert policy, or a set of sam­pled tra­jec­to­ries (ex­am­ples of hu­man perfor­mance). There have been efforts to add noise to the model of hu­man be­havi­our, but only in a very sim­plis­tic way, that doesn’t model the full range of hu­man ir­ra­tional­ity (see some ex­am­ples here).

Of course, given a di­verse enough prior, a cor­rect model of hu­man ir­ra­tional­ity will be in­cluded, but the hu­man re­mains un­der­speci­fied.

This should cause CIRL to have mo­ti­vated and ma­nipu­la­tive learning

The CIRL is not im­mune to the usual pres­sures to­wards ma­nipu­la­tive learn­ing of any agent whose goal is speci­fied in terms of what the agent learns.

To illus­trate with an ex­am­ple: sup­pose first that the CIRL mod­els the hu­man as be­ing perfectly ra­tio­nal, free of er­ror or bias. Then, as­sum­ing the CIRL can also pre­dict and ma­nipu­late hu­man be­havi­our, it can force the hu­man to con­firm (through ac­tion or speech) that some par­tic­u­larly-easy-to-max­imise is the cor­rect one.

But the CIRL agent is un­likely to have only this “ra­tio­nal­ity model”. It may have a large va­ri­ety of mod­els, and maybe some ex­plicit meta-prefer­ences. But the same pres­sure ap­plies: the agent will at­tempt to ma­nipu­late the up­date of the mod­els similarly, all to force to­wards some­thing par­tic­u­larly easy to max­imise.

Par­tially defin­ing terms like bias and bounded ra­tio­nal­ity don’t help here; since the agent is cor­rupt­ing (from our per­spec­tive, thought not from a for­mal per­sec­tive) the learn­ing pro­cess, it will fix its for­mal “bias” and “bounded ra­tio­nal­ity” terms to mean what­ever it can make them mean.

Con­sider the con­cept of alief. An alief is an au­to­matic or ha­bit­ual be­lief-like at­ti­tude. For ex­am­ple, a per­son stand­ing on a trans­par­ent bal­cony may be­lieve that they are safe, but alieve that they are in dan­ger.

This is the sort of con­cept that a purely learn­ing AI would come up with if it were ob­serv­ing hu­man be­havi­our, and would al­low it to model us bet­ter. But with the AI’s learn­ing cor­rupted, aliefs and other con­cepts would merely al­low it equiv­o­cate be­tween what is knowl­edge, what is bias, and what is prefer­ence.

Again, the cor­rup­tion of the AI’s learn­ing does not come from any ex­plicit anti-learn­ing pro­gram­ming, but merely from un­der­speci­fied mod­els and a de­sire to max­imise the learnt re­ward.

Even with­out that, (C)IRL can end up fit­ting a ter­rible model to humans

AIXI has an in­cor­rect self model, so can end up de­stroy­ing it­self. Similarly, if the space of pos­si­ble mod­els the AI con­sid­ers is too nar­row, it can end up fit­ting a model to hu­man be­havi­our that wildly in­ap­pro­pri­ate, forc­ing it to fit as much as it can (this mis-fit has a similar­ity to AIs han­dling on­tol­ogy shifts badly).

Even if the AI’s pri­ors in­clude an ac­cept­able model of hu­mans, it may still end up fit­ting differ­ent ones. It could model hu­mans as a mix of con­flict­ing sub­agents, or even some­thing like “the hy­potha­la­mus is the hu­man, the rest of the brain is this com­pli­cated noise”, and the model could fit—and fit very well, de­pend­ing on what “com­pli­cated noise” it is al­lowed to con­sider.

To solve those is­sues, (C)IRL will need to make cre­ative mod­el­ling de­ci­sions that go be­yond (stan­dard) learning

Imag­ine we that we have some­how solved all the is­sues above—the CIRL agent is mo­ti­vated to learn, cor­rectly, about hu­man val­ues (and then end up max­imis­ing them). Some­how, we’ve en­sured that it will con­sis­tently use defi­ni­tional con­cepts like “bias” and “hu­man knowl­edge” in the ways we would like it to.

It still has to re­solve a lot of is­sues that we our­selves haven’t solved. Such as the ten­sion be­tween pro­cras­ti­na­tion and ob­ses­sive fo­cus. Or what pop­u­la­tion ethics it should use. Or how to re­solve stated ver­sus re­vealed prefer­ences, and how to deal with be­liefs in be­lief and knowl­edge that peo­ple don’t want to know.

Essen­tially, the AI has to be able to do moral philos­o­phy ex­actly as a hu­man would, and to do it well. Without us be­ing able to define what “ex­actly as a hu­man would” means. And it has to con­tinue this, as both it and hu­mans change and we’re con­fronted by a world com­pletely trans­formed, and situ­a­tions we can’t cur­rently imag­ine.