A uni­ver­sal score for optimizers

[Epistemic status: the idea is simple and in­tu­it­ive, so I wouldn’t be sur­prised if it has ap­peared be­fore. If so, I would ap­pre­ci­ate point­ers.]

Eliezer defines op­tim­iz­a­tion power as the abil­ity to pro­duce solu­tions higher on the pref­er­ence or­der­ing. This feels right, but it would be nice to con­vert this in­tu­ition into a score that can be meas­ured, a uni­ver­sal met­ric that al­lows one to mean­ing­fully state that AlphaZero is bet­ter at Go than Deep Blue was at chess. In this post I try ac­com­plish­ing just that. I start by giv­ing a can­did­ate met­ric and some of its prop­er­ties. I then dis­cuss how it can be ap­prox­im­ately com­puted in com­plex set­tings, and de­scribe a some­what para­dox­ical im­plic­a­tion of ad­opt­ing this defin­i­tion for re­peated games.

For sim­pli­city I will as­sume a de­term­in­istic, fi­nite-ac­tion-space en­vir­on­ment (ex­tend­ing everything to MDPs with un­count­able ac­tions doesn’t seem to be a fun­da­mental obstacle). The ac­tion space is , the out­come space is , util­ity of an agent un­der ques­tion is . Be­cause is de­term­in­istic I will just write for brev­ity. Note that we can rep­res­ent se­quen­tial de­cision prob­lems in this frame­work (e.g. Sudoku), ele­ments of A would then be vec­tors of in­di­vidual ac­tions. We write n(S) for the size of set S.

Defin­i­tion. Sup­pose we ob­serve an agent with util­ity func­tion u(a) take an ac­tion , then we say this agent has a C-score of:

In­tu­it­ively, C-score is in­versely pro­por­tional to the frac­tion of out­comes that are as good for the agent as the one it man­aged to ob­tain. One in­ter­pret­a­tion of this for­mula is that the agent is com­pet­ing against a bot that just picks a ran­dom ac­tion (ran­dom-bot), and then is log prob­ab­il­ity of los­ing in this com­pet­i­tion.

Some prop­er­ties of C(u,a):

  • C-score is in­de­pend­ent of com­pu­ta­tional power employed

  • It’s im­possible to com­pute C-score without know­ing the util­ity function

  • Im­prov­ing from be­ing in top 1% out­comes to top 0.1% is as “hard” as im­prov­ing from 10% to 1%

  • One can use Monte-Carlo to gen­er­ate lower bounds of the form “w.h.p. C-score of this agent is at least Y”

  • Strategy of tak­ing a ran­dom ac­tion pro­duces the same score in any set­ting (un­der some as­sump­tions it might also be ap­prox­im­ately true for the strategy of try­ing k ran­dom moves and pick­ing the best one)

Ap­prox­im­at­ing C-score

Can we es­tim­ate C-score for com­plex en­vir­on­ments and agents? For ex­ample, sup­pose we have an agent play­ing chess against a known dis­tri­bu­tion of op­pon­ents (or, more simply, against ran­dom-bot), and util­ity func­tion is the prob­ab­il­ity of win­ning the game (ac­tion space then is a set of policies, i.e. map­pings from state to ac­tion). Can we meas­ure C-score of this agent without us­ing an un­real­istic amount of com­pute?

Clearly, the na­ive Monte-Carlo ap­proach I men­tioned above is a no-go in this scen­ario: the prob­ab­il­ity that a ran­domly sampled ac­tion would be in the set is tiny (if the agent is any good), and we will never sample this set.

I have a couple of po­ten­tial ideas here:

  • In case of hav­ing ac­cess to a bet­ter op­tim­izer than the one be­ing meas­ured, one can use rare event sampling, that is bias the sampling pro­ced­ure to­wards good policies and then ac­count for this bias when com­put­ing the C-score. This ap­proach nicely fits with the in­tu­ition “you need to be at least about as smart as the agent to meas­ure how smart it is”

  • If the game is DP/​MDP and the agent em­ploys value func­tion es­tim­a­tion for pick­ing moves, one can try look­ing at how much the agent im­proves when en­hanced with MCTS, and then try in­fer­ring C-score from it. I don’t have good ex­plan­a­tions for why and how this can work, only an in­tu­ition that this is a mean­ing­ful ap­proach to try.

Repeated game paradox

I’ll use a very simple game to il­lus­trate this is­sue. An agent picks a num­ber between 1 and 10 and util­ity of the agent equals to the num­ber chosen. This game is re­peated T times, so agent’s total util­ity is the sum of num­bers it re­turned in all T games.

If T=1 and the agent picks 8, we have (there are 310 ac­tions that yield util­ity of at least 8)

If T=2 and the agent picks 8 twice, we get

,

Thus an agent that finds a good solu­tion once at t=1, and then re­peats the same ac­tion un­til the end, ob­tains higher and higher scores for big­ger T. This feels like an ar­ti­fact of the defin­i­tion (com­ing from how volumes grow with di­men­sions), rather than the agent be­ing a genu­inely bet­ter op­tim­izer. Is there a score for­mula that has sim­ilar prop­er­ties but doesn’t suf­fer from this para­dox?

Thanks to Linda, Joar and Chris for help­ful dis­cus­sions that led to this post.