A universal score for optimizers

[Epistemic sta­tus: the idea is sim­ple and in­tu­itive, so I wouldn’t be sur­prised if it has ap­peared be­fore. If so, I would ap­pre­ci­ate poin­t­ers.]

Eliezer defines op­ti­miza­tion power as the abil­ity to pro­duce solu­tions higher on the prefer­ence or­der­ing. This feels right, but it would be nice to con­vert this in­tu­ition into a score that can be mea­sured, a uni­ver­sal met­ric that al­lows one to mean­ingfully state that AlphaZero is bet­ter at Go than Deep Blue was at chess. In this post I try ac­com­plish­ing just that. I start by giv­ing a can­di­date met­ric and some of its prop­er­ties. I then dis­cuss how it can be ap­prox­i­mately com­puted in com­plex set­tings, and de­scribe a some­what para­dox­i­cal im­pli­ca­tion of adopt­ing this defi­ni­tion for re­peated games.

For sim­plic­ity I will as­sume a de­ter­minis­tic, finite-ac­tion-space en­vi­ron­ment (ex­tend­ing ev­ery­thing to MDPs with un­countable ac­tions doesn’t seem to be a fun­da­men­tal ob­sta­cle). The ac­tion space is , the out­come space is , util­ity of an agent un­der ques­tion is . Be­cause is de­ter­minis­tic I will just write for brevity. Note that we can rep­re­sent se­quen­tial de­ci­sion prob­lems in this frame­work (e.g. Su­doku), el­e­ments of A would then be vec­tors of in­di­vi­d­ual ac­tions. We write n(S) for the size of set S.

Defi­ni­tion. Sup­pose we ob­serve an agent with util­ity func­tion u(a) take an ac­tion , then we say this agent has a C-score of:

In­tu­itively, C-score is in­versely pro­por­tional to the frac­tion of out­comes that are as good for the agent as the one it man­aged to ob­tain. One in­ter­pre­ta­tion of this for­mula is that the agent is com­pet­ing against a bot that just picks a ran­dom ac­tion (ran­dom-bot), and then is log prob­a­bil­ity of los­ing in this com­pe­ti­tion.

Some prop­er­ties of C(u,a):

  • C-score is in­de­pen­dent of com­pu­ta­tional power employed

  • It’s im­pos­si­ble to com­pute C-score with­out know­ing the util­ity function

  • Im­prov­ing from be­ing in top 1% out­comes to top 0.1% is as “hard” as im­prov­ing from 10% to 1%

  • One can use Monte-Carlo to gen­er­ate lower bounds of the form “w.h.p. C-score of this agent is at least Y”

  • Strat­egy of tak­ing a ran­dom ac­tion pro­duces the same score in any set­ting (un­der some as­sump­tions it might also be ap­prox­i­mately true for the strat­egy of try­ing k ran­dom moves and pick­ing the best one)

Ap­prox­i­mat­ing C-score

Can we es­ti­mate C-score for com­plex en­vi­ron­ments and agents? For ex­am­ple, sup­pose we have an agent play­ing chess against a known dis­tri­bu­tion of op­po­nents (or, more sim­ply, against ran­dom-bot), and util­ity func­tion is the prob­a­bil­ity of win­ning the game (ac­tion space then is a set of poli­cies, i.e. map­pings from state to ac­tion). Can we mea­sure C-score of this agent with­out us­ing an un­re­al­is­tic amount of com­pute?

Clearly, the naive Monte-Carlo ap­proach I men­tioned above is a no-go in this sce­nario: the prob­a­bil­ity that a ran­domly sam­pled ac­tion would be in the set is tiny (if the agent is any good), and we will never sam­ple this set.

I have a cou­ple of po­ten­tial ideas here:

  • In case of hav­ing ac­cess to a bet­ter op­ti­mizer than the one be­ing mea­sured, one can use rare event sam­pling, that is bias the sam­pling pro­ce­dure to­wards good poli­cies and then ac­count for this bias when com­put­ing the C-score. This ap­proach nicely fits with the in­tu­ition “you need to be at least about as smart as the agent to mea­sure how smart it is”

  • If the game is DP/​MDP and the agent em­ploys value func­tion es­ti­ma­tion for pick­ing moves, one can try look­ing at how much the agent im­proves when en­hanced with MCTS, and then try in­fer­ring C-score from it. I don’t have good ex­pla­na­tions for why and how this can work, only an in­tu­ition that this is a mean­ingful ap­proach to try.

Re­peated game paradox

I’ll use a very sim­ple game to illus­trate this is­sue. An agent picks a num­ber be­tween 1 and 10 and util­ity of the agent equals to the num­ber cho­sen. This game is re­peated T times, so agent’s to­tal util­ity is the sum of num­bers it re­turned in all T games.

If T=1 and the agent picks 8, we have (there are 310 ac­tions that yield util­ity of at least 8)

If T=2 and the agent picks 8 twice, we get


Thus an agent that finds a good solu­tion once at t=1, and then re­peats the same ac­tion un­til the end, ob­tains higher and higher scores for big­ger T. This feels like an ar­ti­fact of the defi­ni­tion (com­ing from how vol­umes grow with di­men­sions), rather than the agent be­ing a gen­uinely bet­ter op­ti­mizer. Is there a score for­mula that has similar prop­er­ties but doesn’t suffer from this para­dox?

Thanks to Linda, Joar and Chris for helpful dis­cus­sions that led to this post.