Malign generalization without internal search

In my last post, I challenged the idea that in­ner al­ign­ment failures should be ex­plained by ap­peal­ing to agents which perform ex­plicit in­ter­nal search. By do­ing so, I ar­gued that we should in­stead ap­peal to the more gen­eral con­cept of ma­lign gen­er­al­iza­tion, and treat mesa-mis­al­ign­ment as a spe­cial case.

Un­for­tu­nately, the post was light on ex­am­ples of what we should be wor­ry­ing about in­stead of mesa-mis­al­ign­ment. Evan Hub­inger wrote,

Per­son­ally, I think there is a mean­ingful sense in which all the mod­els I’m most wor­ried about do some sort of search in­ter­nally (at least to the same ex­tent that hu­mans do search in­ter­nally), but I’m definitely un­cer­tain about that.

Wei Dai ex­pressed con­fu­sion why I would want to re­treat to ma­lign gen­er­al­iza­tion with­out some sort of con­crete failure mode in mind,

Can you give some re­al­is­tic ex­am­ples/​sce­nar­ios of “ma­lign gen­er­al­iza­tion” that does not in­volve mesa op­ti­miza­tion? I’m not sure what kind of thing you’re ac­tu­ally wor­ried about here.

In this post, I will out­line a gen­eral cat­e­gory of agents which may ex­hibit ma­lign gen­er­al­iza­tion with­out in­ter­nal search, and then will provide a con­crete ex­am­ple of an agent in the cat­e­gory. Then I will ar­gue that, rather than be­ing a very nar­row coun­terex­am­ple, this class of agents could be com­pet­i­tive with search-based agents.

The switch case agent

Con­sider an agent gov­erned by the fol­low­ing gen­eral be­hav­ior,

It’s clear that this agent does not perform any in­ter­nal search for strate­gies: it doesn’t op­er­ate by choos­ing ac­tions which rank highly ac­cord­ing to some sort of in­ter­nal ob­jec­tive func­tion. While you could po­ten­tially ra­tio­nal­ize its be­hav­ior ac­cord­ing to some ob­served-util­ity func­tion, this would gen­er­ally lead to more con­fu­sion than clar­ity.

How­ever, this agent could still be ma­lign in the fol­low­ing way. Sup­pose the agent is ‘mis­taken’ about the state of the world. Say that it be­lieves that the state of the world is 1, whereas the ac­tual state of the world is 2. Then it could take the wrong ac­tion, al­most like a per­son who is con­fi­dent in a false­hood and makes catas­trophic mis­takes be­cause of their er­ror.

To see how this could man­i­fest as bad be­hav­ior in our ar­tifi­cial agents, I will use a mo­ti­vat­ing ex­am­ple.

The red-seek­ing lu­nar lander

Sup­pose we train a deep re­in­force­ment learn­ing agent on the lu­nar lan­der en­vi­ron­ment from OpenAI’s Gym.

We make one cru­cial mod­ifi­ca­tion to our en­vi­ron­ment. Dur­ing train­ing, we make it so the land­ing pad is always painted red, and this is given to the agent as part of its ob­ser­va­tion of the world. We still re­ward the agent like nor­mally for suc­cess­fully land­ing in a land­ing pad.

Sup­pose what re­ally de­ter­mines whether a patch of ground is a land­ing pad is whether it is en­closed by two flags. Nev­er­the­less, in­stead of pick­ing up on the true in­di­ca­tor of whether some­thing is a land­ing pad, the agent may in­stead pick up the proxy that held dur­ing train­ing—namely, that land­ing pads are parts of the ground that are painted red.

Us­ing the psue­docode ear­lier and filling in some de­tails, we could de­scribe the agent’s be­hav­ior some­thing like this.

Dur­ing de­ploy­ment, this could end catas­troph­i­cally. As­sume that some crater is painted red but our land­ing pads is painted blue. Now, the agent will guide it­self com­pe­tently to­wards the crater and miss the real land­ing pad en­tirely. That’s not what we wanted.

(ETA: If you think I’m us­ing the term ‘catas­troph­i­cally’ too loosely here, since the agent ac­tu­ally lands safely in a crater rather than crash­ing into the ground, we could in­stead imag­ine a lu­nar ve­hi­cle which veers off into the red crater rather than just sit­ting still and await­ing fur­ther in­struc­tion since it’s con­fused.)

What made the agent be­come malign

Above, I pointed to the rea­son why agents like ours could be ma­lign. Speci­fi­cally, it was ‘mis­taken’ about what counted as a land­ing pad. How­ever, it’s worth not­ing that say­ing the agent is mis­taken about the state of the world is re­ally an an­thro­po­mor­phiza­tion. It was ac­tu­ally perfectly cor­rect in in­fer­ring where the red part of the world was—we just didn’t want it to go to that part of the world. We model the agent as be­ing ‘mis­taken’ about where the land­ing pad is, but it works equally well to model the agent as hav­ing goals that are counter to ours.

Since the ma­lign failure doesn’t come from a pure epistemic er­ror, we can’t merely ex­pect that the agent will self-cor­rect as it gains more knowl­edge about the world. Say­ing that it is mak­ing an epistemic mis­take is just a model of what’s go­ing on that helps us in­ter­pret its be­hav­ior, and it does not im­ply that this er­ror is be­nign.

Imag­in­ing more com­plex agents

But what’s to worry about if this sort of thing only hap­pens in very sim­ple agents? Per­haps you think that only agents which perform in­ter­nal search could ever reach the level of com­pe­tence re­quired to perform a real-world catas­tro­phe?

I think that these con­cerns about my ex­am­ple are valid, but I don’t be­lieve they are com­pel­ling. As a re­ply, I think the gen­eral agent su­per­struc­ture I out­lined in the ini­tial pseu­docode could reach very high lev­els of com­pe­tence.

Con­sider an agent that could, dur­ing its op­er­a­tion, call upon a vast ar­ray of sub­rou­tines. Some of these sub­rou­tines can ac­com­plish ex­tremely com­pli­cated ac­tions, such as “Prove this the­o­rem: [...]” or “Com­pute the fastest route to Paris.” We then imag­ine that this agent still shares the ba­sic su­per­struc­ture of the pseu­docode I gave ini­tially above. In effect, the agent has an outer loop, dur­ing which it takes in ob­ser­va­tions from the real world, and out­puts ac­tion se­quences de­pend­ing on which state of the world it thinks its in, and us­ing the sub­rou­tines it has available.

Since the sub­rou­tines are ar­bi­trar­ily com­plex, I don’t think there is any fun­da­men­tal bar­rier for this agent to achieve high lev­els of com­pe­tence in the real world. More­over, some sub­rou­tines could them­selves perform pow­er­ful in­ter­nal searches, pretty clearly ob­vi­at­ing the com­pet­i­tive ad­van­tage that ex­plicit search agents offer.

And even while some sub­rou­tines could perform pow­er­ful in­ter­nal searches, these sub­rou­tines aren’t the only source of our ma­lign gen­er­al­iza­tion con­cern. The be­hav­ior of the agent is still well-de­scribed as a switch-case agent, and this means that the failure mode of the agent be­ing ‘mis­taken’ about the state of the world re­mains. There­fore, it’s in­ac­cu­rate to say that the source of ma­lign gen­er­al­iza­tion must come from an in­ter­nal search be­ing mis­al­igned with the ob­jec­tive func­tion we used dur­ing train­ing.