If I were a well-intentioned AI… II: Acting in a world

Clas­sify­ing images is one thing. But what if I’m an agent that is ac­tu­ally ac­tive in some set­ting?

The pre­vi­ous ap­proach still ap­plies: de­tect­ing when I’m out of dis­tri­bu­tion, and try­ing to keep my be­havi­our com­pat­i­ble with the var­i­ous re­ward func­tion that could be com­pat­i­ble with the data I’ve seen.

The main differ­ence is that, if I’m act­ing, it’s much eas­ier to push the set­ting into an out of dis­tri­bu­tion state, seek­ing out an ex­tremal Good­hart solu­tion to max­imise re­ward. But that is­sue is for a next post.

Mazes and doors example

We’ll use the maze and door ex­am­ple from this post. I’ve has been trained to go through a maze and reach a red door (which is the only red ob­ject in the en­vi­ron­ment); the epi­sode then ends.

I’m now in an en­vi­ron­ment where the only door is blue, and the only red thing is a win­dow. What should I do now?

My re­ward func­tion is un­der­speci­fied by its train­ing en­vi­ron­ment—this is the old prob­lem of uniden­ti­fi­a­bil­ity of re­ward func­tions.

There are three po­ten­tial re­ward func­tions I could ex­trap­o­late from the train­ing ex­am­ples:

  • : re­ward for reach­ing a red door.

  • : re­ward for reach­ing a door.

  • : re­ward for reach­ing a red ob­ject.

The epi­sode ended, in train­ing, ev­ery time I reached the red door. So I can’t dis­t­in­guish “reach­ing” a point from “stay­ing” at that point. So the fol­low­ing three re­ward func­tions are also pos­si­ble, though less likely:

  • : re­ward for each turn spent next to a red door.

  • : re­ward for each turn spent next to a door.

  • : re­ward for each turn spent next to a red ob­ject.

There are other pos­si­ble re­ward func­tions, but these are the most ob­vi­ous. I might have differ­ent lev­els of cre­dence for these re­wards; as stated be­fore, the seems less likely than the .

So, what is the op­ti­mal policy here? Note that and are ir­rele­vant here, be­cause the cur­rent en­vi­ron­ment doesn’t con­tain any red doors. So, ini­tially, to go to the blue door and the red win­dow—which one first de­pends on the lay­out of the maze and the rel­a­tive prob­a­bil­ities of the re­ward func­tions and .

After that, if the epi­sode hasn’t ended, the re­wards are ir­rele­vant—ei­ther they are in­cor­rect, or they have already been ac­com­plished. So now only the re­wards and are rele­vant. If the first one is the most likely, I max­imise ex­pected re­ward by stand­ing by the door for­ever; if the sec­ond is more likely, then stand­ing by the win­dow for­ever is the cor­rect policy.


If I have the op­por­tu­nity to ask for clar­ifi­ca­tion about my re­ward func­tion—maybe by run­ning an­other train­ing ex­am­ple with differ­ent speci­fi­ca­tions—then I would do so, and would be will­ing to pay a cost to ask[1].

Diminish­ing re­turns and other effects

If I sus­pect my re­wards have diminish­ing re­turns, then it could be in my in­ter­ests to al­ter­nate be­tween the blue door and the red win­dow. This is ex­plained more fully in this post. In fact, that whole post grew out of this kind of “if I were a well-in­ten­tioned AI” rea­son­ing. So I’ll re­peat the con­clu­sion of that post:

So, as long as:

  1. We use a Bayesian mix of re­ward func­tions rather than a max­i­mum like­li­hood re­ward func­tion.

  2. An ideal re­ward func­tion is pre­sent in the space of pos­si­ble re­ward func­tions, and is not pe­nal­ised in prob­a­bil­ity.

  3. The differ­ent re­ward func­tions are nor­mal­ised.

  4. If our ideal re­ward func­tions have diminish­ing re­turns, this fact is ex­plic­itly in­cluded in the learn­ing pro­cess. Then, we shouldn’t un­duly fear Good­hart effects [...]

If not all those con­di­tions are met, then:

  1. The nega­tive as­pects of the Good­hart effect will be weaker if there are gains from trade and a rounded Pareto bound­ary.

So if those prop­er­ties hold, I would tend to avoid Good­hart effects. Now, I don’t know ex­tra true in­for­ma­tion about the re­ward func­tion—as I said, I’m well-in­ten­tioned, but not well-in­formed. But hu­mans could in­clude in me the fact that they fear the Good­hart effect. This very fact is in­for­ma­tive, and, equipped with that knowl­edge and the list above, I can in­fer that the ac­tual re­ward has diminish­ing re­turns, or that it is pe­nal­ised in prob­a­bil­ity, or that there is a nor­mal­i­sa­tion is­sue there. I’m already us­ing a Bayesian mix of re­wards, so it would be in­for­ma­tive for me to know whether my hu­man pro­gram­mers are aware of that.

In the next post, we’ll look at more ex­treme ex­am­ples of AI-me act­ing in the world.

  1. The cost I’m will­ing to pay de­pends, of course, on the rel­a­tive prob­a­bil­ities of the two re­main­ing re­ward func­tions. ↩︎

No comments.