If I were a well-intentioned AI… IV: Mesa-optimising

Here I ap­ply my “If I were a well-in­ten­tioned AI” filter to mesa-optimising

Now, I know that a mesa-op­ti­miser need not be a sub­agent (see 1.1 here), but I’m ob­vi­ously go­ing to imag­ine my­self as a mesa-op­ti­mis­ing sub­agent.

An im­me­di­ate hu­man anal­ogy springs to mind: I’m the di­rec­tor of a sub­di­vi­sion of some cor­po­ra­tion or agency, and the “root op­ti­miser” is the man­age­ment of that en­tity.

There is a lot of liter­a­ture on what hap­pens if I’m self­ish in this po­si­tion; but if I’m well-in­ten­tioned, what should I be do­ing?

One thing that think­ing this way made me re­al­ise: there is a big differ­ence be­tween “al­igned with man­age­ment” and “con­trol­led by man­age­ment”.

We’ll con­sider each one in turn, but to sum­marise: al­igned mesa-op­ti­misers are gen­er­ally bet­ter than con­trol­led mesa-op­ti­misers, but it is hard to tell the differ­ence be­tween an al­igned and a dan­ger­ous un­al­igned mesa-op­ti­miser.

Con­trol vs alignment

First let’s flesh out the cor­po­rate/​man­age­ment ex­am­ple a bit. Me-AI is in charge of mak­ing wid­gets, that are used by the com­pany for some pur­pose. That pur­pose is given by - the base util­ity for the cor­po­ra­tion.

My role is to make as many wid­gets as pos­si­ble within my bud­get; this is , the mesa-ob­jec­tive I have been given by man­age­ment.

My true util­ity func­tion is . Man­age­ment don’t fully know what is—or at least don’t fully un­der­stand it, or all of its im­pli­ca­tions. This is needed, of course, be­cause if man­age­ment fully un­der­stood the im­pli­ca­tions of , there would be no un­cer­tainty at all on their part, and they could make me do ex­actly what they wanted—or they would turn me off.

Be­cause of this un­cer­tainty, man­age­ment have added some ex­tra lev­els of con­trol over me. Let’s as­sume one typ­i­cal pre­cau­tion: if I un­der­spend the bud­get, the bud­get is cut next year. If I over­spend, I’m rep­ri­manded (and fired if I do it too of­ten), but the bud­get is not cut.

There are three pos­si­ble situ­a­tions I could con­front:

  • : I’ve made as many wid­gets as I can this year, and spent of my bud­get. I pre­dict that, next year, I will only need of this year’s bud­get.

  • : I’ve made as many wid­gets as I can this year, and spent 90% of my bud­get. I pre­dict that, next year, I will need of this year’s bud­get.

  • : I’ve spent of my bud­get on wid­gets. I pre­dict that wid­gets are par­tic­u­larly valuable to the com­pany this year, much moreso than next year.

Aligned mesa-optimiser

If I’m al­igned with the goals of the com­pany, , the base util­ity. Man­age­ment has as­signed to me, but I have ac­tu­ally no in­trin­sic de­sire to max­imise it. I am a -max­imiser, with the ex­tra in­for­ma­tion that man­age­ment, who are also -max­imisers, have as­signed to me.

My be­havi­our should be that of a max­imiser in most cir­cum­stances. If I know man­age­ment is much smarter than me (or have an in­for­ma­tion ad­van­tage; maybe I’m not even sure what is!), then I will tend to be­have like a blind -max­imiser, trust­ing to man­age­ment to sort things out prop­erly. In and , I will not spend any­thing more, and in , I will do noth­ing.

If I know I’m smarter than man­age­ment, at least in some ar­eas, then I will start mak­ing moves that are -max­imis­ing but not nec­es­sar­ily -max­imis­ing. In , I won’t spend any more (since I don’t need a large bud­get next year), but in , I’ll waste the ex­tra to en­sure my bud­get isn’t cut. In , I’ll go over bud­get to get the ex­tra-valuable wid­gets.

Danger­ous mesa-optimiser

Note that, from a man­age­ment per­spec­tive, the and ac­tions of a smart al­igned mesa-op­ti­miser… look like an dan­ger­ous, un­al­igned mesa-op­ti­miser defect­ing.

Let be the util­ity func­tion of an un­al­igned (and non-con­trol­led) agent. How can man­age­ment dis­t­in­guish be­tween and ?

In both cases, the sub­agent will take ac­tions that go be­yond , ac­tions whose ram­ifi­ca­tions man­age­ment can­not im­me­di­ately see.

Of course, I would ex­plain to man­age­ment why I’m do­ing this, and why they are com­pat­i­ble with . But a -max­imiser would similarly ex­plain its ac­tions, try­ing to trick the man­age­ment into think­ing it was… me.

Thus, as in the clas­si­cal prin­ci­pal-agent prob­lem, there is a prob­lem with im­perfect in­for­ma­tion. Even if I am al­igned, man­age­ment can’t trust me if they don’t know I’m al­igned. I would try and sig­nal this; an un­al­igned agent would try and fake the same sig­nals. The very fact that al­igned-but-not-con­trol­led agents could ex­ist, make it eas­ier for un­al­igned-and-not-con­trol­led agents to go un­de­tected.

Con­trol­led mesa-optimiser

A con­trol­led mesa-op­ti­miser is the obe­di­ent ser­vant. They will max­imise , while en­sur­ing that they re­main un­der man­age­ment con­trol.

If I’m in that case, I’ll try and en­sure that man­age­ment has ac­cu­rate in­for­ma­tion about what I’m do­ing[1], and that they can change my ob­jec­tives as needed. I’m safely in­ter­rupt­ible and cor­rigible.

In , I will not do any­thing; it’s not my role to max­imise , so it’s not my role to take ad­van­tage of this op­por­tu­nity. Depend­ing on how I’m de­signed, I may re­port this in­for­ma­tion to man­age­ment. In , I will not spend any ex­tra, and I’ll let my bud­get be cut next year.

In , I face a dilemma: my goal that wants me to pre­serve my bud­get next year, vs my de­sire to be in­for­ma­tive. Again, this varies de­pend­ing on how I’m de­signed. So I’ll likely spend my en­tire bud­get (to pre­vent the au­to­matic cut­ting of my bud­get), while re­port­ing this to man­age­ment so they have full in­for­ma­tion. If I can’t re­port this, I may de­cide not to spend any ex­tra, so that man­age­ment gets bet­ter in­for­ma­tion.

Note the “de­pend­ing on how I’m de­signed”. Be­cause un­like the al­igned mesa-op­ti­miser, the con­trol­led agent has two goals, in ten­sion with each other: max­imis­ing , and be­ing in­for­ma­tive and con­trol­led by man­age­ment. If we call the “con­trol­led” goal , then I’m maximising

where means that is some func­tion of and that in­creases when­ever ei­ther of those func­tions in­crease.

I’m not al­igned, so I’m not a max­imiser. I would gladly sac­ri­fice com­pletely, in or­der to max­imise and/​or . Hope­fully, is suffi­ciently well-de­signed to al­low man­age­ment to pre­dict this and pre­clude it. And, hope­fully, the trade­off be­tween and is also suffi­ciently well de­signed to pre­clude it.

Since I am well-in­ten­tioned, I can make use of the fact that is the out­come de­sired by man­age­ment, in or­der to bet­ter figure out the ideal ver­sions of , , and of . Defin­ing “the hu­man H un­der­stands what the mesa-op­ti­miser is do­ing” is not easy. So a well-in­ten­tioned con­trol­led mesa-op­ti­miser, is akin to a “con­trol op­ti­miser” which is try­ing to figure out what con­trol means (see pre­vi­ous posts in this se­quence) and avoid­ing Good­hart­ing both and .

Aligned and controlled

I could in­stead be maximising

or, more simply

with the ex­tra piece of in­for­ma­tion that man­age­ment have as­signed me the role of max­imis­ing . The might be a smoothed lex­i­co­graphic or­der­ing, with the pri­mary util­ity.

So, here I’m al­igned and con­trol­led. So that I can trade off some of what man­age­ment re­ally want - - to in­crease , their con­trol over me. But not ar­bi­trary amounts of it.

So, be­ing al­igned and con­trol­led, I will do noth­ing in , just as both other agents do. I will over­spend in , to take ad­van­tage of the -in­creas­ing po­ten­tial, just like the al­igned agent.

The situ­a­tion is more com­pli­cated. I have a ten­dency to want to waste my bud­get (al­ign­ment) but also want to com­mu­ni­cate good in­for­ma­tion to man­age­ment (con­trol). Un­like the con­trol­led agent, where con­trol will dic­tate “do noth­ing” if there is no other way of com­mu­ni­cat­ing in­for­ma­tion, this is a gen­uine ten­sion. I am there­fore more likely to choose “waste” than the con­trol­led agent.

In gen­eral, this seems the best de­sign, if man­age­ment is un­sure of me. Not -op­ti­mal, but the prob­a­bil­ity of dra­matic -failure is re­duced, while man­age­ment keep some of the con­trol that they think is de­sir­able.


To sum­marise the three types of agent; if they are smart and pow­er­ful com­pared to man­age­ment, they would tend to do the fol­low­ing ac­tions (a in­di­cates strong un­cer­tainty; if one op­tion is more likely, de­spite the un­cer­tainty, that op­tion is noted be­fore the ):

An un­al­igned agent will pre­tend to be one of the other types of agents; it will have most suc­cess at its own ob­jec­tives if it pre­tends to be an al­igned agent.

If man­age­ment mis­clas­sifies me

Aligned, be­lieved controlled

If I’m al­igned, but man­age­ment thinks I’m con­trol­led, this isn’t much of a prob­lem. When I’m weak or ig­no­rant, I should be­have like a con­trol­led agent any­way, since man­age­ment have a much bet­ter han­dle on what is needed than me.

Once I be­come pow­er­ful and smart, I would most likely still want man­age­ment to be­lieve I’m con­trol­led (or else they might in­ter­vene to con­strict me or turn me off). So I’d gen­er­ally be­have ex­actly as if I was con­trol­led, only tak­ing ad­van­tage of grey zones of man­age­ment mi­s­un­der­stand­ing, to push the out­come more to­wards the -op­ti­mal po­si­tion, even at the cost of and .

Con­trol­led, be­lieved aligned

A con­trol­led agent that is be­lieved to be al­igned is more of a prob­lem. If the pro­vides an in­cen­tive for me to sig­nal man­age­ment “hey, you mis­clas­sified me, I’m ac­tu­ally con­trol­led, not al­igned”, then the prob­lem is self-cor­rect­ing.

If not, the prob­lems of con­trol­led agents are ex­ac­er­bated; man­age­ment will let me get on with my own stuff, ig­nor­ing any “weird” or “anti-” stuff that I might be do­ing, be­cause they be­lieve that I’m act­ing for and they just lack in­for­ma­tion to fully un­der­stand why I’m tak­ing these ac­tions. Their more “rub­ber-stamp” ap­proach in­creases the risk that I may sac­ri­fice their to my own .

  1. Note that even a con­trol­led mesa-op­ti­miser is not nec­es­sar­ily hon­est. My goal is to give the man­age­ment good in­for­ma­tion and be un­der their con­trol; not to be hon­est. If man­age­ment ex­pects that ev­ery di­rec­tor will pad their bud­get re­quests by , then I will do so as well; to re­frain from do­ing so would be mis­lead­ing. ↩︎