# Optimization Regularization through Time Penalty

For an overview of the prob­lem of Op­ti­miza­tion Reg­u­lariza­tion, or Mild Op­ti­miza­tion, I re­fer to MIRI’s pa­per Align­ment for Ad­vanced Ma­chine Learn­ing Sys­tems, sec­tion 2.7

# My solution

Start with a bounded util­ity func­tion, , that is eval­u­ated based on state of the world at a sin­gle time (ig­nor­ing for now si­mul­tane­ity is ill-defined in Rel­a­tivity). Ex­am­ples:

• If a hu­man at time (at the start of the op­ti­miza­tion pro­cess) are shown the world state at time , how much would they like it (mapped to to the in­ter­val ).

Then max­i­mize , where is a reg­u­lariza­tion pa­ram­e­ter cho­sen by the AI en­g­ineer, and is a free vari­able cho­sen by the AI.

Time is mea­sured from the start of the op­ti­miza­tion pro­cess. Be­cause the util­ity is eval­u­ated based on the world at time , this value is the amount of time the AI spends on the task. It is up to the AI to de­cide how much time it wants. Choos­ing should be seen as part of choos­ing the policy, or be in­cluded in the ac­tion space.

Be­cause the util­ity func­tion is bounded, the op­ti­miza­tion pro­cess will even­tu­ally hit diminish­ing re­turns, and will then choose to ter­mi­nate, be­cause of the time penalty.

## Why time penalty?

Un­bounded op­ti­miza­tion pres­sure is dan­ger­ous. Without any form of reg­u­lariza­tion, we need to get the al­ign­ment ex­actly right. How­ever, with reg­u­lariza­tion we merely need to get it al­most ex­actly right, which I be­lieve is much eas­ier.

How­ever, im­pact reg­u­lariza­tion have turned out to be very hard. We don’t want the im­pact mea­sure to de­pend on the AI’s un­der­stand­ing of hu­man val­ues, be­cause that will not provide ex­tra safety. But a value neu­tral im­pact mea­sure is al­most im­pos­si­ble, be­cause the world has too many de­grees of free­dom. How­ever, time is both value neu­tral and has only a sin­gle de­gree of free­dom.

## Why not use a fixed finite time hori­zon?

The rea­son is a vari­able cho­sen by the AI and not a con­stant cho­sen by us, is be­cause we don’t know when the op­ti­miza­tion pro­cess will start hit­ting diminish­ing re­turns. Leav­ing up to the AI solves this dy­nam­i­cally. In ad­di­tion, we will still get to choose a max­i­mum time hori­zon by the choice of and . The AI will never keep go­ing be­yond

# What hap­pens af­ter t=T ?

That de­pends on the AI de­sign. Since we don’t yet know how to build a gen­eral in­tel­li­gence, we also don’t know what will hap­pen af­ter in some de­fault sce­nario.

How­ever, we prob­a­bly don’t want a strong op­ti­mizer sit­ting around with no goal, be­cause it is very un­clear what that thing will do. Or if we ac­ci­den­tally give it some tiny in­cen­tive, it might then max­i­mize that un­til the end of time.

E.g. if the AI has any un­cer­tainty whether it has reach time or not, it will keep max­i­miz­ing con­di­tioned on , be­cause those are the only wor­lds that count. As be­comes less and less likely, the AI will act more and more crazy.

A way to solve this is to in­clude an in­cen­tive for turn­ing it­self off, e.g. max­i­mize

Un­for­tu­nately, I don’t know of any good ob­jec­tive way to define “is turned of”. The best defi­ni­tion I thought of so far is:

Defi­ni­tion: AI is turned off at time = If a hu­man at time are shown the world state a time , they would agree that, the AI and all its sub agents and suc­ces­sor agents are turned off.

And if we are refer­ring to hu­man judge­ment any­way, we might as well throw in some more de­sir­able things. Max­i­mize

Defi­ni­tion: AI is turned off and the world is OK, at time = If a hu­man at time are shown the world state a time , they would agree that, the AI and all its sub agents and suc­ces­sor agents are turned off, and the world at time is not sig­nifi­cantly worse or in greater dan­ger than at at time .

Note that “the world is OK” is not nec­es­sary for the reg­u­lariza­tion to work. But I would still recom­mend to in­clude some ex­plicit op­ti­miza­tion pres­sure to­wards not de­stroy­ing the world, ether in , or as an ex­tra term. The reg­u­lariza­tion mainly stops the AI from Good­hart­ing too hard, it does not do much to re­duce side effects you have not even tried to spec­ify.

# Some open prob­lems

## How is time mea­sured?

I think it is best if time refers to real phys­i­cal time, and not clock ticks or num­ber of com­put­ing op­er­a­tions. This is just an in­tu­ition at this point, but it seems to be like we get a bet­ter over­all op­ti­miza­tion reg­u­la­tor if we pun­ish both com­pu­ta­tion and ex­e­cu­tion, be­cause that is less likely to have loop­holes. E.g. pe­nal­iz­ing phys­i­cal time is ro­bust un­der del­e­ga­tion.

## How to make this com­pat­i­ble with Gen­eral Rel­a­tivity?

If mea­sures phys­i­cal time then this is ill-defined in GR, and since we prob­a­bly live in GR or similar, this is a big prob­lem.

## Is there a bet­ter way to define “is turned off”?

It would be nice with a defi­ni­tion of “is turned off” that does not re­lay on hu­mans’ abil­ity to judge this, or the AI’s abil­ity to model hu­mans.

“world is OK” is clearly a value state­ment, so for this part we will have to rely on some sort of value learn­ing scheme.

# Acknowledgements

This sug­ges­tion is in­spired by, and partly based on ARLEA and dis­cus­sions with John Maxwell. The idea was fur­ther de­vel­oped in dis­cus­sion with Stu­art Arm­strong.

• If we ig­nore sub­agents and imag­ine a carte­sian bound­ary, turned off can eas­ily be defined as all fu­ture out­puts are 0.

I also doubt that an AI work­ing ASAP is safe in any mean­ingful sense. Of course you can move all the magic into “hu­man judges world ok”. If you make lambda large enough, your AI is safe and use­less.

If the util­ity func­tion is 1 if wid­get ex­ists, else 0. Where a wid­get is eas­ily build-able, not cur­rently ex­ist­ing ob­ject.

Sup­pose that or­der­ing the parts through nor­mal chan­nels will take a few weeks. If it hacks the nukes and holds the world to ran­som, then ev­ery­one at the wid­get fac­tory will work non­stop, then drop dead of ex­haus­tion.

Alter­nately it might be able to boot­strap self repli­cat­ing nan­otech in less time. The AI has no rea­son to care if the nan­otech that makes the wid­get is highly toxic, and no rea­son to care if it has a shut­off switch or grey goos the earth af­ter the wid­get is pro­duced.

World looks ok at time T is not enough, you could still get some­thing bad aris­ing from the way seem­ingly in­nocu­ous parts were set up at time T. Be­ing switched off and hav­ing no sub­agents in the con­ven­tional sense isn’t enough. What if the AI changed some physics data in such a way that hu­mans would col­lapse the quan­tum vac­uum state, be­liev­ing the ex­per­i­ment they were do­ing was safe. Build­ing a sub­agent is just a spe­cial case of hav­ing un­wanted influence

• I like this line of thought over­all.

• How would we safely set lambda?

• Isn’t it still do­ing an argmax over plans and T, mak­ing the in­ter­nal op­ti­miza­tion pres­sure very non-mild? If we have some no­tion of em­bed­ded agency, one would imag­ine that do­ing the argmax would be pe­nal­ized, but it’s not clear what kind of con­trol the agent has over its search pro­cess in this case.

But a value neu­tral im­pact mea­sure is al­most im­pos­si­ble, be­cause the world has too many de­grees of free­dom.

Can you ex­plain why you think some­thing like AUP re­quires value-laden in­puts?

• Hey there!

I think this method works well as an ex­tra layer of pre­cau­tion to go along with an­other mea­sure of re­duced im­pact. On its own, it has a few is­sues, some of which you cover.

First of all, I’d re­place the util­ity func­tion with a re­ward func­tion, speci­fi­cally one that pro­vides re­wards for past achieve­ments. Why? Well, in gen­eral, util­ity func­tions give too much of an in­cen­tive to keep con­trol of the fu­ture. “Create a sub­agent and turn your­self off” is my gen­eral cri­tique of these kinds of meth­ods; if the sub­agent is pow­er­ful enough, the best policy for the agent could be to cre­ate them and then turn them­selves off for or some similarly low num­ber.

Hav­ing a re­ward func­tion on past achieve­ments pre­cludes that, and it also means the agent is not in­cen­tivised to con­tinue past ; in­deed part of the defi­ni­tion of the re­ward could be that it stops at .

When us­ing hu­man judge­ments, nor­mally the risk is that the AI is in­cen­tivised to fool us. Here, how­ever, the AI is on a time bud­get, so it might find it eas­ier to be “hon­est” than to put in the time and effort to fool us. Another ap­proach is to use in­differ­ence so that it doesn’t care about the hu­man de­ci­sion ahead of time.

Gen­eral rel­a­tivity doesn’t seem much of an is­sue. Just pick a refer­ence frame (say one cen­tred on the AI at time and with rel­a­tive ve­loc­ity to the AIT\$” as the hy­per­plane. Be­cause of the (re­verse) tri­an­gle in­equal­ity, any path the AI takes to reach this hy­per­plane will give it proper time in which to act. If we worry about worm­holes and such, we could even define to be the min­i­mum of the refer­ence frame time () and the AI’s proper time (), to be re­ally sure that the AI doesn’t get too much time to think.

• I’m not con­vinced that rel­a­tivity is re­ally a prob­lem: it looks to me like you can prob­a­bly deal with it as fol­lows. In­stead of ask­ing about the state of the uni­verse at time T and mak­ing T one pa­ram­e­ter in the op­ti­miza­tion, ask about the state of the uni­verse within a space­time re­gion in­clud­ing O (where O is a start­ing-point some­where around where the AI is to start op­er­at­ing) where now that re­gion is a pa­ram­e­ter in the op­ti­miza­tion. Then in­stead of , use times some mea­sure of the size of that re­gion. (You might use some­thing like to­tal com­pu­ta­tion done within the re­gion but that might be hard to define and as OP sug­gests it might not pe­nal­ize ev­ery­thing you care about.) You might ac­tu­ally want to use the size of the bound­ary rather than of the re­gion it­self in your reg­u­lariza­tion term, to dis­cour­age ger­ry­man­der­ing. (Which might also make some sense in terms of physics be­cause some­thing some­thing holo­graphic prin­ci­ple some­thing some­thing, but that’s hand­wavy mo­ti­va­tion at best.)

Of course, op­ti­miz­ing over the ex­act ex­tent of a more-or-less-ar­bi­trary re­gion of space­time is much more work than op­ti­miz­ing over a sin­gle scalar pa­ram­e­ter. But in the con­text we’re look­ing at, you’re already op­ti­miz­ing over an ab­surdly large space: that of all pos­si­ble courses of ac­tion the AI could take.