Optimization Regularization through Time Penalty

For an overview of the prob­lem of Op­ti­miza­tion Reg­u­lariza­tion, or Mild Op­ti­miza­tion, I re­fer to MIRI’s pa­per Align­ment for Ad­vanced Ma­chine Learn­ing Sys­tems, sec­tion 2.7

My solution

Start with a bounded util­ity func­tion, , that is eval­u­ated based on state of the world at a sin­gle time (ig­nor­ing for now si­mul­tane­ity is ill-defined in Rel­a­tivity). Ex­am­ples:

  • If a hu­man at time (at the start of the op­ti­miza­tion pro­cess) are shown the world state at time , how much would they like it (mapped to to the in­ter­val ).

Then max­i­mize , where is a reg­u­lariza­tion pa­ram­e­ter cho­sen by the AI en­g­ineer, and is a free vari­able cho­sen by the AI.

Time is mea­sured from the start of the op­ti­miza­tion pro­cess. Be­cause the util­ity is eval­u­ated based on the world at time , this value is the amount of time the AI spends on the task. It is up to the AI to de­cide how much time it wants. Choos­ing should be seen as part of choos­ing the policy, or be in­cluded in the ac­tion space.

Be­cause the util­ity func­tion is bounded, the op­ti­miza­tion pro­cess will even­tu­ally hit diminish­ing re­turns, and will then choose to ter­mi­nate, be­cause of the time penalty.

Why time penalty?

Un­bounded op­ti­miza­tion pres­sure is dan­ger­ous. Without any form of reg­u­lariza­tion, we need to get the al­ign­ment ex­actly right. How­ever, with reg­u­lariza­tion we merely need to get it al­most ex­actly right, which I be­lieve is much eas­ier.

How­ever, im­pact reg­u­lariza­tion have turned out to be very hard. We don’t want the im­pact mea­sure to de­pend on the AI’s un­der­stand­ing of hu­man val­ues, be­cause that will not provide ex­tra safety. But a value neu­tral im­pact mea­sure is al­most im­pos­si­ble, be­cause the world has too many de­grees of free­dom. How­ever, time is both value neu­tral and has only a sin­gle de­gree of free­dom.

Why not use a fixed finite time hori­zon?

The rea­son is a vari­able cho­sen by the AI and not a con­stant cho­sen by us, is be­cause we don’t know when the op­ti­miza­tion pro­cess will start hit­ting diminish­ing re­turns. Leav­ing up to the AI solves this dy­nam­i­cally. In ad­di­tion, we will still get to choose a max­i­mum time hori­zon by the choice of and . The AI will never keep go­ing be­yond

What hap­pens af­ter ?

That de­pends on the AI de­sign. Since we don’t yet know how to build a gen­eral in­tel­li­gence, we also don’t know what will hap­pen af­ter in some de­fault sce­nario.

How­ever, we prob­a­bly don’t want a strong op­ti­mizer sit­ting around with no goal, be­cause it is very un­clear what that thing will do. Or if we ac­ci­den­tally give it some tiny in­cen­tive, it might then max­i­mize that un­til the end of time.

E.g. if the AI has any un­cer­tainty whether it has reach time or not, it will keep max­i­miz­ing con­di­tioned on , be­cause those are the only wor­lds that count. As be­comes less and less likely, the AI will act more and more crazy.

A way to solve this is to in­clude an in­cen­tive for turn­ing it­self off, e.g. max­i­mize

Un­for­tu­nately, I don’t know of any good ob­jec­tive way to define “is turned of”. The best defi­ni­tion I thought of so far is:

Defi­ni­tion: AI is turned off at time = If a hu­man at time are shown the world state a time , they would agree that, the AI and all its sub agents and suc­ces­sor agents are turned off.

And if we are refer­ring to hu­man judge­ment any­way, we might as well throw in some more de­sir­able things. Max­i­mize

Defi­ni­tion: AI is turned off and the world is OK, at time = If a hu­man at time are shown the world state a time , they would agree that, the AI and all its sub agents and suc­ces­sor agents are turned off, and the world at time is not sig­nifi­cantly worse or in greater dan­ger than at at time .

Note that “the world is OK” is not nec­es­sary for the reg­u­lariza­tion to work. But I would still recom­mend to in­clude some ex­plicit op­ti­miza­tion pres­sure to­wards not de­stroy­ing the world, ether in , or as an ex­tra term. The reg­u­lariza­tion mainly stops the AI from Good­hart­ing too hard, it does not do much to re­duce side effects you have not even tried to spec­ify.

Some open prob­lems

How is time mea­sured?

I think it is best if time refers to real phys­i­cal time, and not clock ticks or num­ber of com­put­ing op­er­a­tions. This is just an in­tu­ition at this point, but it seems to be like we get a bet­ter over­all op­ti­miza­tion reg­u­la­tor if we pun­ish both com­pu­ta­tion and ex­e­cu­tion, be­cause that is less likely to have loop­holes. E.g. pe­nal­iz­ing phys­i­cal time is ro­bust un­der del­e­ga­tion.

How to make this com­pat­i­ble with Gen­eral Rel­a­tivity?

If mea­sures phys­i­cal time then this is ill-defined in GR, and since we prob­a­bly live in GR or similar, this is a big prob­lem.

Is there a bet­ter way to define “is turned off”?

It would be nice with a defi­ni­tion of “is turned off” that does not re­lay on hu­mans’ abil­ity to judge this, or the AI’s abil­ity to model hu­mans.

“world is OK” is clearly a value state­ment, so for this part we will have to rely on some sort of value learn­ing scheme.


This sug­ges­tion is in­spired by, and partly based on ARLEA and dis­cus­sions with John Maxwell. The idea was fur­ther de­vel­oped in dis­cus­sion with Stu­art Arm­strong.