Is the term mesa optimizer too narrow?

In the post in­tro­duc­ing mesa op­ti­miza­tion, the au­thors defined an op­ti­mizer as

a sys­tem [that is] in­ter­nally search­ing through a search space (con­sist­ing of pos­si­ble out­puts, poli­cies, plans, strate­gies, or similar) look­ing for those el­e­ments that score high ac­cord­ing to some ob­jec­tive func­tion that is ex­plic­itly rep­re­sented within the sys­tem.

The pa­per con­tinues by defin­ing a mesa op­ti­mizer as an op­ti­mizer that was se­lected by a base op­ti­mizer.

How­ever, there are a num­ber of is­sues with this defi­ni­tion, as some have already pointed out.

First, I think by this defi­ni­tion hu­mans are clearly not mesa op­ti­miz­ers. Most op­ti­miza­tion we do is im­plicit. Yet, hu­mans are the sup­posed to be the pro­to­typ­i­cal ex­am­ples of mesa op­ti­miz­ers, which ap­pears be a con­tra­dic­tion.

Se­cond, the defi­ni­tion ex­cludes perfectly le­gi­t­i­mate ex­am­ples of in­ner al­ign­ment failures. To see why, con­sider a sim­ple feed­for­ward neu­ral net­work trained by deep re­in­force­ment learn­ing to nav­i­gate my Ch­ests and Keys en­vi­ron­ment. Since “go to the near­est key” is a good proxy for get­ting the re­ward, the neu­ral net­work sim­ply re­turns the ac­tion, that when given the board state, re­sults in the agent get­ting closer to the near­est key.

Is the feed­for­ward neu­ral net­work op­ti­miz­ing any­thing here? Hardly, it’s just ap­ply­ing a heuris­tic. Note that you don’t need to do any­thing like an in­ter­nal A* search to find keys in a maze, be­cause in many en­vi­ron­ments, fol­low­ing a wall un­til the key is within sight, and then perform­ing a very shal­low search (which doesn’t have to be ex­plicit) could work fairly well.

As far as I can tell, Hjal­mar Wijk in­tro­duced the term “ma­lign gen­er­al­iza­tion” to de­scribe the failure mode that I think is most worth wor­ry­ing about here. In par­tic­u­lar, ma­lign gen­er­al­iza­tion hap­pens when you trained a sys­tem with ob­jec­tive func­tion X, that at de­ploy­ment has the ac­tual out­come of do­ing Y, where Y is so bad that we’d pre­fer the sys­tem to fail com­pletely. To me at least, this seems like a far more in­tu­itive and less the­ory-laden way of fram­ing in­ner al­ign­ment failures.

This way of re­fram­ing the is­sue al­lows us to keep the old ter­minol­ogy that we are con­cerned with ca­pa­bil­ity ro­bust­ness with­out al­ign­ment ro­bust­ness, but drops all un­nec­es­sary refer­ences to mesa op­ti­miza­tion.

Mesa op­ti­miz­ers could still form a nat­u­ral class of things that are prone to ma­lign gen­er­al­iza­tion. But if even hu­mans are not mesa op­ti­miz­ers, why should we ex­pect mesa op­ti­miz­ers to be the pri­mary real world ex­am­ples of such in­ner al­ign­ment failures?