# A universal score for optimizers

[Epistemic sta­tus: the idea is sim­ple and in­tu­itive, so I wouldn’t be sur­prised if it has ap­peared be­fore. If so, I would ap­pre­ci­ate poin­t­ers.]

Eliezer defines op­ti­miza­tion power as the abil­ity to pro­duce solu­tions higher on the prefer­ence or­der­ing. This feels right, but it would be nice to con­vert this in­tu­ition into a score that can be mea­sured, a uni­ver­sal met­ric that al­lows one to mean­ingfully state that AlphaZero is bet­ter at Go than Deep Blue was at chess. In this post I try ac­com­plish­ing just that. I start by giv­ing a can­di­date met­ric and some of its prop­er­ties. I then dis­cuss how it can be ap­prox­i­mately com­puted in com­plex set­tings, and de­scribe a some­what para­dox­i­cal im­pli­ca­tion of adopt­ing this defi­ni­tion for re­peated games.

For sim­plic­ity I will as­sume a de­ter­minis­tic, finite-ac­tion-space en­vi­ron­ment (ex­tend­ing ev­ery­thing to MDPs with un­countable ac­tions doesn’t seem to be a fun­da­men­tal ob­sta­cle). The ac­tion space is , the out­come space is , util­ity of an agent un­der ques­tion is . Be­cause is de­ter­minis­tic I will just write for brevity. Note that we can rep­re­sent se­quen­tial de­ci­sion prob­lems in this frame­work (e.g. Su­doku), el­e­ments of A would then be vec­tors of in­di­vi­d­ual ac­tions. We write n(S) for the size of set S.

Defi­ni­tion. Sup­pose we ob­serve an agent with util­ity func­tion u(a) take an ac­tion , then we say this agent has a C-score of:

In­tu­itively, C-score is in­versely pro­por­tional to the frac­tion of out­comes that are as good for the agent as the one it man­aged to ob­tain. One in­ter­pre­ta­tion of this for­mula is that the agent is com­pet­ing against a bot that just picks a ran­dom ac­tion (ran­dom-bot), and then is log prob­a­bil­ity of los­ing in this com­pe­ti­tion.

Some prop­er­ties of C(u,a):

• C-score is in­de­pen­dent of com­pu­ta­tional power employed

• It’s im­pos­si­ble to com­pute C-score with­out know­ing the util­ity function

• Im­prov­ing from be­ing in top 1% out­comes to top 0.1% is as “hard” as im­prov­ing from 10% to 1%

• One can use Monte-Carlo to gen­er­ate lower bounds of the form “w.h.p. C-score of this agent is at least Y”

• Strat­egy of tak­ing a ran­dom ac­tion pro­duces the same score in any set­ting (un­der some as­sump­tions it might also be ap­prox­i­mately true for the strat­egy of try­ing k ran­dom moves and pick­ing the best one)

Ap­prox­i­mat­ing C-score

Can we es­ti­mate C-score for com­plex en­vi­ron­ments and agents? For ex­am­ple, sup­pose we have an agent play­ing chess against a known dis­tri­bu­tion of op­po­nents (or, more sim­ply, against ran­dom-bot), and util­ity func­tion is the prob­a­bil­ity of win­ning the game (ac­tion space then is a set of poli­cies, i.e. map­pings from state to ac­tion). Can we mea­sure C-score of this agent with­out us­ing an un­re­al­is­tic amount of com­pute?

Clearly, the naive Monte-Carlo ap­proach I men­tioned above is a no-go in this sce­nario: the prob­a­bil­ity that a ran­domly sam­pled ac­tion would be in the set is tiny (if the agent is any good), and we will never sam­ple this set.

I have a cou­ple of po­ten­tial ideas here:

• In case of hav­ing ac­cess to a bet­ter op­ti­mizer than the one be­ing mea­sured, one can use rare event sam­pling, that is bias the sam­pling pro­ce­dure to­wards good poli­cies and then ac­count for this bias when com­put­ing the C-score. This ap­proach nicely fits with the in­tu­ition “you need to be at least about as smart as the agent to mea­sure how smart it is”

• If the game is DP/​MDP and the agent em­ploys value func­tion es­ti­ma­tion for pick­ing moves, one can try look­ing at how much the agent im­proves when en­hanced with MCTS, and then try in­fer­ring C-score from it. I don’t have good ex­pla­na­tions for why and how this can work, only an in­tu­ition that this is a mean­ingful ap­proach to try.

Re­peated game paradox

I’ll use a very sim­ple game to illus­trate this is­sue. An agent picks a num­ber be­tween 1 and 10 and util­ity of the agent equals to the num­ber cho­sen. This game is re­peated T times, so agent’s to­tal util­ity is the sum of num­bers it re­turned in all T games.

If T=1 and the agent picks 8, we have (there are 310 ac­tions that yield util­ity of at least 8)

If T=2 and the agent picks 8 twice, we get

,

Thus an agent that finds a good solu­tion once at t=1, and then re­peats the same ac­tion un­til the end, ob­tains higher and higher scores for big­ger T. This feels like an ar­ti­fact of the defi­ni­tion (com­ing from how vol­umes grow with di­men­sions), rather than the agent be­ing a gen­uinely bet­ter op­ti­mizer. Is there a score for­mula that has similar prop­er­ties but doesn’t suffer from this para­dox?

Thanks to Linda, Joar and Chris for helpful dis­cus­sions that led to this post.

• Pre­vi­ously...

(That’s an es­pe­cially in­ter­est­ing com­ment by EY. To read it in the origi­nal con­text, or­der by ‘new’ at the top of the com­ments, and read the dis­cus­sion with Shane_Legg in re­verse or­der.)

• Thanks for the poin­t­ers, I should have done more re­search be­fore writ­ing this up. After a quick glance my ini­tial im­pres­sion is that there still isn’t a good solu­tion (even an un­com­putable one) to the prob­lems such as the one I de­scribe at the end, or the one AlexMen­nen men­tions.

• Some un­de­sir­able prop­er­ties of C-score:

It de­pends on how the space of ac­tions are rep­re­sented. If a set of very similar ac­tions that achieve the same util­ity for the agent are merged into one ac­tion, this will change the agent’s C-score.

It does not de­pend on the mag­ni­tudes of the agent’s prefer­ences, only on their or­der­ings. Com­pare 2 agents: the first has 3 available ac­tions, which would give it util­ities 0, .9, and 1, re­spec­tively, and it picks the ac­tion that would give it util­ity .9. The sec­ond has 3 available ac­tions, which would give it util­ities 0, .1, and 1, re­spec­tively, and it picks the ac­tion that would give it util­ity .1. In­tu­itively, the first agent is a more suc­cess­ful op­ti­mizer, but both agents have the same C-score.

• I agree with the first point, and I don’t have solid solu­tions to this. There’s also the fact that some games are eas­ier to op­ti­mize than oth­ers (name a num­ber game I de­scribed at the end vs. chess), and this com­plex­ity is im­pos­si­ble to cap­ture while stay­ing com­pu­ta­tion-ag­nos­tic. Maybe one can use the length of the short­est proof that tak­ing ac­tion a leads to util­ity u(a) to ac­count for these is­sues..

The sec­ond point is more con­tro­ver­sial, my in­tu­ition is that first agent is an equally good op­ti­mizer, even if it did bet­ter in terms of pay­offs. Also, at least in the set­ting of de­ter­minis­tic games, util­ity func­tions are ar­bi­trary up to en­cod­ing the same prefer­ence or­der­ings (once ran­dom­ness is in­tro­duced this stops be­ing true)

• I think de­ci­sion prob­lems with in­com­plete in­for­ma­tion are a bet­ter model in which to mea­sure op­ti­miza­tion power than de­ter­minis­tic de­ci­sion prob­lems with com­plete in­for­ma­tion are. If the agent knows ex­actly what pay­offs it would get from each ac­tion, it is hard to ex­plain why it might not choose the op­ti­mal one. In the ex­am­ple I gave, the first agent could have mis­tak­enly con­cluded that the .9-util­ity ac­tion was bet­ter than the 1-util­ity ac­tion while mak­ing only small er­rors in es­ti­mat­ing the con­se­quences of each of its ac­tions, while the sec­ond agent would need to make large er­rors in es­ti­mat­ing the con­se­quences of its ac­tions in or­der to think that the .1-util­ity ac­tion was bet­ter than the 1-util­ity ac­tion.

• “Note that we can rep­re­sent se­quen­tial de­ci­sion prob­lems in this frame­work (e.g. Su­doku), el­e­ments of A would then be vec­tors of in­di­vi­d­ual ac­tions.”

Un­less the en­vi­ron­ment is de­ter­minis­tic, you want to con­sider poli­cies rather than vec­tors of ac­tions. On a re­lated note, in­stead of con­sid­er­ing a uniform dis­tri­bu­tion over ac­tions, we might con­sider a uniform dis­tri­bu­tion over pro­grams for a pre­fix-free uni­ver­sal Tur­ing ma­chine. This solves your re­peated game para­dox in the sense that, the pro­gram that always picks 9 will have some finite prob­a­bil­ity and will do bet­ter than your agent for any T, so your agent’s score will be bounded.

• The “rare event sam­pling” link is bro­ken.