Alternative to Bayesian Score

I am start­ing to won­der whether or not Bayes Score is what I want to max­i­mize for my epistemic ra­tio­nal­ity. I started think­ing about by try­ing to de­sign a board game to teach cal­ibra­tion of prob­a­bil­ities, so I will use that as my ex­am­ple:

I wanted a scor­ing mechanism which mo­ti­vates hon­est re­port­ing of prob­a­bil­ities and re­wards play­ers who are bet­ter cal­ibrated. For sim­plic­ity, lets as­sume that we only have to deal with true/​false ques­tions for now. A player is given a ques­tion which they be­lieve is true with prob­a­bil­ity p. They then name a real num­ber x be­tween 0 and 1. Then, they re­ceive a score which is a func­tion of x and whether or not the prob­lem is true. We want the ex­pected score to be max­i­mized ex­actly when x=p. Let f(x) be the out­put if the ques­tion is true, and let g(x) be the out­put if the ques­tion is false. Then, my ex­pected util­ity is (p)f(x)+(1-p)g(x). If we as­sume f and g are smooth, then in or­der to have a max­i­mum at x=p, we want (p)f’(p)+(1-p)g’(p)=0, which still leaves us with a large class of func­tions. It would also be nice to have sym­me­try by hav­ing f(x)=g(1-x). If we fur­ther re­quire this, we get (p)f’(p)+(1-p)(-1)f’(1-p)=0, or equiv­a­lently (p)f’(p)=(1-p)f’(1-p). One way to achieve this is to set (x)f’(x) to be a con­stant So then, f’(x)=c/​x, so f(x)=log x. This scor­ing mechanism is referred to as “Bayesian Score”.

How­ever, an­other nat­u­ral way to to achieve this is by set­ting f’(p)/​(1-p) equal to a con­stant. If we set this con­stant equal to 2, we get f’(x)=2-2x, which gives us f(x)=2x-x2=1-(1-x)2. I will call this the “Squared Er­ror Score.”

There are many other func­tions which satisfy the de­sired con­di­tions, but these two are the sim­plest, so I will fo­cus on these two.

Eliezer ar­gues for Bayesian Score in A Tech­ni­cal Ex­pla­na­tion of Tech­ni­cal Ex­pla­na­tion, which I recom­mend read­ing. The rea­son he prefers Bayesian Score is that he wants the sum of the scores as­so­ci­ated with de­ter­min­ing P(A) and P(B|A) to equal the score for de­ter­min­ing P(A&B). In other words he wants it to not mat­ter whether you break a prob­lem up into one ex­per­i­ment or two ex­per­i­ments. This is a le­gi­t­i­mate virtue of this scor­ing mechanism, but I think that many peo­ple think it is a lot more valuable than it is. This doesn’t elimi­nate the prob­lem of we don’t know what ques­tions to ask. It gives us the same an­swer re­gard­less of how we break up an ex­per­i­ment into smaller ex­per­i­ments, but our score is still de­pen­dent on what ques­tions are asked, and this can­not be fixed by just say­ing, “Ask all ques­tions.” There are in­finitely many of them. The sum does not con­verge. Be­cause the score is still a func­tion of what ques­tions are asked, the fact that it gives the same an­swer for some re­lated sets of ques­tions is not a huge benefit.

One nice thing about the Squared Er­ror Score is that it always gives a score be­tween 0 and 1, which means we can ac­tu­ally use it in real life. For ex­am­ple, we could ask some­one to con­struct a spin­ner that comes up ei­ther true or false, and then spin it twice. They win if ei­ther of the two spins comes up with the true an­swer. In this case, the best strat­egy is to as­sign prob­a­bil­ity p to true. There is no way to do any­thing similar for the Bayesian Score, in fact it is ques­tion­able whether or not ar­bi­trary low util­ities even make sense.

The Bayesian Score is slightly eas­ier to gen­er­al­ize to mul­ti­ple choice ques­tions. The Squared Er­ror Score can also be gen­er­al­ized, but it un­for­tu­nately has to make your score a func­tion not only of the prob­a­bil­ity you as­signed to the cor­rect solu­tion. For ex­am­ple, If A is the cor­rect an­swer, you get more points for 80%A 10%B 10%C than from 80%A 20%B 0%C. The func­tion you want for mul­ti­ple val­ues is if you as­sign prob­a­bil­ities x1, through xn, and the first op­tion is cor­rect you get out­put 2x1-x12-x22-...-xn2. I do not think this is as bad as it seems. It kind of makes sense that when the an­swer is A, you get pe­nal­ized slightly for say­ing that you are much more con­fi­dent in B than in C, since mak­ing such a claim is a waste of in­for­ma­tion. To view this as a spin­ner, you con­struct a spin­ner, spin it twice, and you win if ei­ther spin gets the cor­rect an­swer, or if the first spin comes lex­i­co­graph­i­cally strictly be­fore the sec­ond spin.

For the pur­pose of my cal­ibra­tion game, I will al­most cer­tainly use Squared Er­ror Scor­ing, be­cause log is not fea­si­ble. But it got me think­ing about why I am not think­ing in terms of Squared Er­ror Score in real life.

You might ask what is the ex­per­i­men­tal differ­ence be­tween the two, since they are both max­i­mized by hon­est prob­a­bil­ities. Well If I have two ques­tions and I want to max­i­mize my (pos­si­bly weighted) av­er­age score, and I have a limited amount of time to re­search and im­prove my an­swers for them, then it mat­ters how much the scor­ing mechanism pe­nal­izes var­i­ous er­rors. Bayesian Scor­ing pe­nal­izes so much for be­ing sure of one false thing that none of the other scores re­ally mat­ter, while Squared Er­ror is much more for­giv­ing. If we nor­mal­ize to say that 5050 gives 0 points while true cer­tainty gives 1 points, then Squared Er­ror gives −3 points for false cer­tainty while Bayesian gives nega­tive in­finity.

I view max­i­miz­ing Bayesian Score as the Golden Rule of epistemic ra­tio­nal­ity, so even a small chance that some­thing else might be bet­ter is worth in­ves­ti­gat­ing. Even if you are fully com­mit­ted to Bayesian Score, I would love to hear any pros or cons you can think of in ei­ther di­rec­tion.

(Edited for for­mat­ting)