# Alternative to Bayesian Score

I am start­ing to won­der whether or not Bayes Score is what I want to max­i­mize for my epistemic ra­tio­nal­ity. I started think­ing about by try­ing to de­sign a board game to teach cal­ibra­tion of prob­a­bil­ities, so I will use that as my ex­am­ple:

I wanted a scor­ing mechanism which mo­ti­vates hon­est re­port­ing of prob­a­bil­ities and re­wards play­ers who are bet­ter cal­ibrated. For sim­plic­ity, lets as­sume that we only have to deal with true/​false ques­tions for now. A player is given a ques­tion which they be­lieve is true with prob­a­bil­ity p. They then name a real num­ber x be­tween 0 and 1. Then, they re­ceive a score which is a func­tion of x and whether or not the prob­lem is true. We want the ex­pected score to be max­i­mized ex­actly when x=p. Let f(x) be the out­put if the ques­tion is true, and let g(x) be the out­put if the ques­tion is false. Then, my ex­pected util­ity is (p)f(x)+(1-p)g(x). If we as­sume f and g are smooth, then in or­der to have a max­i­mum at x=p, we want (p)f’(p)+(1-p)g’(p)=0, which still leaves us with a large class of func­tions. It would also be nice to have sym­me­try by hav­ing f(x)=g(1-x). If we fur­ther re­quire this, we get (p)f’(p)+(1-p)(-1)f’(1-p)=0, or equiv­a­lently (p)f’(p)=(1-p)f’(1-p). One way to achieve this is to set (x)f’(x) to be a con­stant So then, f’(x)=c/​x, so f(x)=log x. This scor­ing mechanism is referred to as “Bayesian Score”.

How­ever, an­other nat­u­ral way to to achieve this is by set­ting f’(p)/​(1-p) equal to a con­stant. If we set this con­stant equal to 2, we get f’(x)=2-2x, which gives us f(x)=2x-x2=1-(1-x)2. I will call this the “Squared Er­ror Score.”

There are many other func­tions which satisfy the de­sired con­di­tions, but these two are the sim­plest, so I will fo­cus on these two.

Eliezer ar­gues for Bayesian Score in A Tech­ni­cal Ex­pla­na­tion of Tech­ni­cal Ex­pla­na­tion, which I recom­mend read­ing. The rea­son he prefers Bayesian Score is that he wants the sum of the scores as­so­ci­ated with de­ter­min­ing P(A) and P(B|A) to equal the score for de­ter­min­ing P(A&B). In other words he wants it to not mat­ter whether you break a prob­lem up into one ex­per­i­ment or two ex­per­i­ments. This is a le­gi­t­i­mate virtue of this scor­ing mechanism, but I think that many peo­ple think it is a lot more valuable than it is. This doesn’t elimi­nate the prob­lem of we don’t know what ques­tions to ask. It gives us the same an­swer re­gard­less of how we break up an ex­per­i­ment into smaller ex­per­i­ments, but our score is still de­pen­dent on what ques­tions are asked, and this can­not be fixed by just say­ing, “Ask all ques­tions.” There are in­finitely many of them. The sum does not con­verge. Be­cause the score is still a func­tion of what ques­tions are asked, the fact that it gives the same an­swer for some re­lated sets of ques­tions is not a huge benefit.

One nice thing about the Squared Er­ror Score is that it always gives a score be­tween 0 and 1, which means we can ac­tu­ally use it in real life. For ex­am­ple, we could ask some­one to con­struct a spin­ner that comes up ei­ther true or false, and then spin it twice. They win if ei­ther of the two spins comes up with the true an­swer. In this case, the best strat­egy is to as­sign prob­a­bil­ity p to true. There is no way to do any­thing similar for the Bayesian Score, in fact it is ques­tion­able whether or not ar­bi­trary low util­ities even make sense.

The Bayesian Score is slightly eas­ier to gen­er­al­ize to mul­ti­ple choice ques­tions. The Squared Er­ror Score can also be gen­er­al­ized, but it un­for­tu­nately has to make your score a func­tion not only of the prob­a­bil­ity you as­signed to the cor­rect solu­tion. For ex­am­ple, If A is the cor­rect an­swer, you get more points for 80%A 10%B 10%C than from 80%A 20%B 0%C. The func­tion you want for mul­ti­ple val­ues is if you as­sign prob­a­bil­ities x1, through xn, and the first op­tion is cor­rect you get out­put 2x1-x12-x22-...-xn2. I do not think this is as bad as it seems. It kind of makes sense that when the an­swer is A, you get pe­nal­ized slightly for say­ing that you are much more con­fi­dent in B than in C, since mak­ing such a claim is a waste of in­for­ma­tion. To view this as a spin­ner, you con­struct a spin­ner, spin it twice, and you win if ei­ther spin gets the cor­rect an­swer, or if the first spin comes lex­i­co­graph­i­cally strictly be­fore the sec­ond spin.

For the pur­pose of my cal­ibra­tion game, I will al­most cer­tainly use Squared Er­ror Scor­ing, be­cause log is not fea­si­ble. But it got me think­ing about why I am not think­ing in terms of Squared Er­ror Score in real life.

You might ask what is the ex­per­i­men­tal differ­ence be­tween the two, since they are both max­i­mized by hon­est prob­a­bil­ities. Well If I have two ques­tions and I want to max­i­mize my (pos­si­bly weighted) av­er­age score, and I have a limited amount of time to re­search and im­prove my an­swers for them, then it mat­ters how much the scor­ing mechanism pe­nal­izes var­i­ous er­rors. Bayesian Scor­ing pe­nal­izes so much for be­ing sure of one false thing that none of the other scores re­ally mat­ter, while Squared Er­ror is much more for­giv­ing. If we nor­mal­ize to say that 5050 gives 0 points while true cer­tainty gives 1 points, then Squared Er­ror gives −3 points for false cer­tainty while Bayesian gives nega­tive in­finity.

I view max­i­miz­ing Bayesian Score as the Golden Rule of epistemic ra­tio­nal­ity, so even a small chance that some­thing else might be bet­ter is worth in­ves­ti­gat­ing. Even if you are fully com­mit­ted to Bayesian Score, I would love to hear any pros or cons you can think of in ei­ther di­rec­tion.

(Edited for for­mat­ting)

• Quadratic scor­ing rules are of­ten referred to as the Brier score (it seems odd to re­fer to one score by a name and the other by its func­tional form, rather than com­par­ing names or func­tions).

You can read a com­par­i­son of the three proper scor­ing rules by Eric Bickel here. He ar­gues for log­a­r­ith­mic scor­ing rules be­cause of two prac­ti­cal con­cerns that I sus­pect are differ­ent from Eliezer’s con­cern.

• So, it looks like the two main con­cerns int his pa­per are:

1. Brier Score is non-lo­cal, mean­ing that some­times it benefits giv­ing a slightly lower prob­a­bil­ity to a true state­ment. This is be­cause it pe­nal­izes slightly for not dis­tribut­ing your prob­a­bil­ity mass equally among all false hy­poth­e­sis. This seems like it is prob­a­bly a bad thing, but I am not com­pletely sure. It is still a waste of in­for­ma­tion to pre­fer B to C when the cor­rect an­swer is A. Ad­di­tion­ally, if we only think about this in the con­text of true-false ques­tions, this is com­pletely a non-con­cern.

2. Bayesian Score is more sta­ble to slightly non-lin­ear util­ity func­tions. This is ar­gued as a pro for Bayesian Score, but I think it should be the other way. Bayesian Score is more sta­ble to non-lin­ear util­ity func­tions, but with Brier Score, you can use ran­dom­ness to re­move any prob­lems from non-lin­ear util­ity func­tions com­pletely. Be­cause Brier Score gives you scores be­tween 0 and 1, you don’t have to re­ward differ­ent util­ities. You can just say you get some fixed util­ity with prob­a­bil­ity equal to your score. This is im­pos­si­ble in Bayesian Score.

The pa­per also talks about a third “Spher­i­cal” scor­ing mechanism, which sets your score equal to the prob­a­bil­ity you as­signed to the cor­rect an­swer di­vided by the square root of the sum of the squares of all the prob­a­bil­ities.

Now that I know the name of this scor­ing rule, I will look for more in­for­ma­tion, but I think if any­thing that pa­per makes me like the Brier score bet­ter.(at least for true-false ques­tions)

• Thanks!

• It’s prob­a­bly worth point­ing out that the pa­per is by J. Eric Bickel and not by the much bet­ter known statis­ti­cian Peter Bickel.

• Edited.

• Maybe in­stead of grum­bling, the web­site could be changed to make that the de­fault work­flow au­to­mat­i­cally, with an ad­vanced op­tion for raw HTML?

• One nice thing about the Squared Er­ror Score is that it always gives a score be­tween 0 and 1… There is no way to do any­thing similar for the Bayesian Score

Scor­ing re­sults be­tween 0 and 1 ac­tu­ally seems like the wrong thing to do, be­cause you are not ad­e­quately pun­ish­ing peo­ple for be­ing over­con­fi­dent. If some­one says they are 99.99% con­fi­dent that event A will not hap­pen, and then A does hap­pen, you should as­sign that per­son a very strong penalty.

I find it much eas­ier to think about the com­pres­sion rate -log2(x) than the Bayesian Score. Think­ing in terms of com­pres­sion makes it easy to re­mem­ber that the goals is to min­i­mize -log2(x), or com­press a data set to the short­est pos­si­ble size (log base 2 gives you a bit length). Bit lengths are always nice pos­i­tive num­bers, and we have the nice in­ter­pre­ta­tion that an over­con­fi­dent guesser is re­quired to use a very large code­length to en­code an out­come that was pre­dicted to have very low prob­a­bil­ity.

• It is not ob­vi­ous to me that peo­ple should be pe­nal­ized so strongly for be­ing wrong. In fact, I think that is a big part of the ques­tion I am ask­ing. Would you rather be right 999 times and 100% sure but wrong the last time, or would you rather have no in­for­ma­tion on any­thing?

• Would you rather be right 999 times and 100% sure but wrong the last time, or would you rather have no in­for­ma­tion on any­thing?

Is that a rhetor­i­cal ques­tion? Ob­vi­ously it de­pends on the ap­pli­ca­tion do­main: if we were talk­ing about buy­ing and sel­l­ing stocks, I would cer­tainly want to have no in­for­ma­tion about any­thing than ex­pe­rience a sce­nario where I was 100% sure and then wrong. In that sce­nario I would pre­sum­ably have bet all my money and maybe lots of my in­vestors’ money, and then lost it all.

• It does de­pend on the do­main. I think that the rea­son that you want to be very risk-averse in stocks is be­cause you have ad­ver­saries try­ing to take your money, so you get all the nega­tives of be­ing wrong with­out all the pos­i­tives of the 999 times you knew the stock would rise and were cor­rect.

In other cases, such as de­cid­ing which route to take while trav­el­ing to save time, I’d rather be wrong ev­ery once in a while so that I could be right more of­ten.

Both of these ideas are about in­stru­men­tal ra­tio­nal­ity, so the ques­tion is if you are try­ing to come up with a model of epistemic ra­tio­nal­ity which does not de­pend on util­ity func­tions, what type of scor­ing should you use?

• It’s mean­ingless to talk about op­ti­miz­ing epistemic ra­tio­nal­ity with­out talk­ing about your util­ity func­tion. There are a lot of ques­tions you could get bet­ter at an­swer­ing. Which ones you want to an­swer de­pends on what kind of de­ci­sions you want to make, which de­pends on what you value.

• But prob­a­bil­ities are a use­ful la­tent vari­able in the rea­son­ing pro­cess, and it can be worth­while in­stru­men­tally to try to have ac­cu­rate be­liefs, as this may help out in a wide va­ri­ety of situ­a­tions that we can­not pre­dict in ad­vance. So there is still the ques­tion of which be­liefs it is most im­por­tant to make more ac­cu­rate.

Also, I be­lieve the OP is try­ing to write code for a var­i­ant of the cal­ibra­tion game, so it is some­what in­trin­si­cally nec­es­sary for him to score prob­a­bil­ities di­rectly.

• This is for a game? How do you win? Does max­i­miz­ing the ex­pec­ta­tion of in­ter­me­di­ate scores max­i­mize the prob­a­bil­ity that you win? Even when you know your prior scores dur­ing this game? Even if you know your op­po­nents’ scores? If it’s not that type of game, then how­ever these scores are ag­gre­gated, does max­i­miz­ing the ex­pec­ta­tions of your scores on each ques­tion max­i­mize the ex­pec­ta­tion of your util­ity from the ag­gre­gate?

• So my main ques­tion is not for the game, it is a philo­soph­i­cal ques­tion on how I should define my epistemic ra­tio­nal­ity. How­ever, there is also a game I am de­sign­ing. I don’t know what the over­all struc­ture will be in my game, but it ac­tu­ally doesn’t mat­ter what your score is or what the win con­di­tion is. As long as there are iso­lated ques­tions and it is always bet­ter to win that round than to lose that round, and each round is done with the spin­ners I de­scribed, then the op­ti­mal strat­egy will always be hon­est re­port­ing of prob­a­bil­ities.

In fact, you could take any trivia game which asks only mul­ti­ple choice ques­tions, and which you always want to get the an­swer right, and re­place it with my spin­ner mechanism, and it will work.

• The prob­lem with the squared er­ror score is that it just re­wards ask­ing a ton of ob­vi­ous ques­tions. I pre­dict with 100% prob­a­bil­ity that the sky will be blue one sec­ond from now. Just keep re­peat­ing for a high score.

• Both meth­ods fail mis­er­ably if you get to choose what ques­tions are asked. Bayesian score re­wards never ask­ing any ques­tions ever. Or, if you nor­mal­ize it to as­sign 1 to true cer­tainty and 0 to 5050, then it re­wards ask­ing ob­vi­ous ques­tions also.

If it helps, you can think of the squared er­ror score as -(1-x)^2 in­stead of 1-(1-x)^2, then it fixes this prob­lem.

• Both meth­ods fail mis­er­ably if you get to choose what ques­tions are asked. Bayesian score re­wards never ask­ing any ques­tions ever. Or, if you nor­mal­ize it to as­sign 1 to true cer­tainty and 0 to 5050, then it re­wards ask­ing ob­vi­ous ques­tions also.

Only be­cause you are bak­ing in an im­plicit loss func­tion that all ques­tions are equally valuable; switch to some other loss func­tion which weights the value of more in­ter­est­ing or harder ques­tions more, and this prob­lem dis­ap­pears as ‘the sky is blue’ ceases to be worth any­thing com­pared to a real pre­dic­tion like ‘Obama will be re-elected’.

• I don’t un­der­stand why what you are sug­gest­ing has any­thing to do with what I said.

Yes, of course you can model differ­ent state­ments to differ­ent val­ues, and I men­tioned this. How­ever, what I was say­ing here is that if you al­low the op­tion of just not an­swer­ing one of the ques­tions (what­ever that means) then there has to be some util­ity as­so­ci­ated with not an­swer­ing. The com­ment that I was re­spond­ing to was say­ing that Bayesian was bet­ter than Brier be­cause Brier gave pos­i­tive util­ities in­stead of nega­tive util­ities, so could be cheated by ask­ing lots of easy ques­tions.

Your re­sponse seems to be about scal­ing the util­ities for each ques­tion based on the im­por­tance of that ques­tion. This is very valid, and I men­tioned that when I said “(pos­si­bly weighted) av­er­age score.” That is a very valid point, but I don’t see how it has any­thing to do with the prob­lems as­so­ci­ated with be­ing able to choose what ques­tions are asked.

• That is a very valid point, but I don’t see how it has any­thing to do with the prob­lems as­so­ci­ated with be­ing able to choose what ques­tions are asked.

I don’t un­der­stand your prob­lem here. If ques­tions’ val­ues are scaled ap­pro­pri­ately, or some fancier ap­proach is used, then it doesn’t mat­ter if re­spon­dents pick and choose be­cause they will ei­ther be wast­ing their time or miss­ing out on large po­ten­tial gains. A loss func­tion style ap­proach seems to ad­e­quately re­solve this prob­lem.

• I think this is prob­a­bly bad com­mu­ni­ca­tion on my part.

The model I am imag­in­ing from you is that there is some countable col­lec­tion of state­ments you want to as­sign true/​false to. You as­sign some weight func­tion to the state­ments so that to to­tal weight of all state­ments is some finite num­ber, and your score is the sum of the weights of all state­ments which you choose to an­swer.

For this, it re­ally mat­ters not only how the val­ues are scaled, but also how they are trans­lated. It maters what the 0 util­ity point for each ques­tion is, be­cause that de­ter­mines whether or not you want to choose to an­swer that ques­tion. I think that the 0 util­ity point should be put at the util­ity of the 5050 prob­a­bil­ity as­sign­ment for each ques­tion. In this case, not an­swer­ing a ques­tion is equiv­a­lent to an­swer­ing it with 5050 prob­a­bil­ity, so I think it would be sim­pler to just say, you have to an­swer ev­ery ques­tion, and your an­swer by de­fault is 5050, in which case the 0 points don’t mat­ter any­more. This is just se­man­tics.

But just say­ing that you scale each ques­tion by its im­por­tance doesn’t fix the fact that if you model this as you can choose to an­swer ques­tions if you want and your util­ity is the sum of your util­ities for the in­di­vi­d­ual ques­tions en­courages not an­swer­ing any ques­tions un­der the Bayesian rule as writ­ten, since it can only give you nega­tive util­ity. You have to fix that by ei­ther fix­ing 0 points for your util­ities in some rea­son­able way or just re­quiring that you are as­signed util­ity for ev­ery ques­tion, and there is a de­fault an­swer if you don’t think about it at all.

There are benefits to weigh­ing the ques­tions be­cause that al­lows us to take in­finite sums, but if we as­sume for now that there are only finitely many ques­tions, and all ques­tions have ra­tio­nal weights, then weigh­ing the ques­tions is similar to just ask­ing the same ques­tions mul­ti­ple times (pro­por­tional to its weight). This may be more ac­cu­rate for what we want in epistemic ra­tio­nal­ity, but it doesn’t ac­tu­ally solve the prob­lems as­so­ci­ated with al­low­ing peo­ple to pick and choose ques­tions.

• The model I am imag­in­ing from you is that there is some countable col­lec­tion of state­ments you want to as­sign true/​false to. You as­sign some weight func­tion to the state­ments so that to to­tal weight of all state­ments is some finite num­ber, and your score is the sum of the weights of all state­ments which you choose to an­swer.

Hm, no, I wasn’t re­ally think­ing that way. I don’t want some finite num­ber, I want ev­ery­one to reach differ­ent num­bers so more ac­cu­rate pre­dic­tors score higher.

The weights on par­tic­u­lar func­tions do not have to be even al­gorith­mi­cly set—for ex­am­ple, a pre­dic­tion mar­ket is im­mune to the ‘sky is blue’ prob­lem be­cause if one were to start a con­tract for ‘the sky is blue to­mor­row’, no one would trade on it un­less one were will­ing to lose money be­ing a mar­ket-maker as the other trader bid it up to the me­te­o­rolog­i­cally-ac­cu­rate 80% or what­ever. One can pick and choose as much as one pleases, but un­less one’s con­tracts were valuable to other peo­ple for any rea­son, it would be im­pos­si­ble to make money by stuffing the mar­ket with bo­gus con­tracts. The util­ity just be­comes how much money you made.

I think that the 0 util­ity point should be put at the util­ity of the 5050 prob­a­bil­ity as­sign­ment for each ques­tion.

I think this doesn’t work be­cause you’re try­ing to in­vent a non-in­for­ma­tive prior, and it’s triv­ial to set up sets of pre­dic­tions where the ob­vi­ously bet­ter non-in­for­ma­tive prior is not 1/​2: for ex­am­ple, set up 3 pre­dic­tions for each of 3 mu­tu­ally-ex­haus­tive out­comes, where the non-in­for­ma­tive prior ob­vi­ously looks more like 13 and 12 means some­one is get­ting robbed. More im­por­tantly, un­in­for­ma­tive pri­ors are dis­puted and it’s not clear what they are in more com­plex situ­a­tions. (Fre­quen­tist Larry Wasser­man goes so far as to call them “lost causes” and “per­pet­ual mo­tion ma­chines”.)

But just say­ing that you scale each ques­tion by its im­por­tance doesn’t fix the fact that if you model this as you can choose to an­swer ques­tions if you want and your util­ity is the sum of your util­ities for the in­di­vi­d­ual ques­tions en­courages not an­swer­ing any ques­tions un­der the Bayesian rule as writ­ten, since it can only give you nega­tive util­ity. You have to fix that by ei­ther fix­ing 0 points for your util­ities in some rea­son­able way or just re­quiring that you are as­signed util­ity for ev­ery ques­tion, and there is a de­fault an­swer if you don’t think about it at all.

Per­haps a raw log odds is not the best idea, but do you re­ally think there is no way to in­ter­pret them into some score which dis­in­cen­tivizes strate­gic pre­dict­ing? This sounds just ar­ro­gant to me, and I would only be­lieve it if you sum­ma­rized all the ex­ist­ing re­search into re­ward­ing ex­perts and showed that log odds sim­ply could not be used in any cir­cum­stance where any pre­dic­tor could pre­dict a sub­set of the speci­fied pre­dic­tions.

but if we as­sume for now that there are only finitely many ques­tions, and all ques­tions have ra­tio­nal weights, then weigh­ing the ques­tions is similar to just ask­ing the same ques­tions mul­ti­ple times (pro­por­tional to its weight).

There aren’t finitely many ques­tions be­cause one can ask ques­tions in­volv­ing each of the in­finite set of in­te­gers… Know­ing that ques­tions are ask­ing iden­ti­cal ques­tions sounds like an im­pos­si­ble de­mand to meet (for ex­am­ple, if any sys­tem claimed this, it could solve the Halt­ing Prob­lem by sim­ply ask­ing it to pre­dict the out­put of 2 Tur­ing ma­chines).

• If you nor­mal­ize Bayesian score to as­sign 1 to 100% and 0 to 50% (and −1 to 0%), you en­counter a math er­ror.

• I didn’t do that. I only set 1 to 100% and 0 to 50%. 0% is still nega­tive in­finity.

• That’s the math er­ror.

Why is it con­sis­tent that as­sign­ing a prob­a­bil­ity of 99% to one half of a bi­nary propo­si­tion that turns out false is much bet­ter than as­sign­ing a prob­a­bil­ity of 1% to the op­po­site half that turns out true?

• There’s no math er­ror.

Why is it con­sis­tent that as­sign­ing a prob­a­bil­ity of 99% to one half of a bi­nary propo­si­tion that turns out false is much bet­ter than as­sign­ing a prob­a­bil­ity of 1% to the op­po­site half that turns out true?

I think there’s some con­fu­sion. Coscott said these three facts:

Let f(x) be the out­put if the ques­tion is true, and let g(x) be the out­put if the ques­tion is false.

f(x)=g(1-x)

f(x)=log(x)

In con­se­quence, g(x)=log(1-x). So if x=0.99 and the ques­tion is false, the out­put is g(x)=log(1-x)=log(0.01). Or if x=0.01 and the ques­tion is true, the out­put is f(x)=log(x)=log(0.01). So the sym­me­try that you de­sire is true.

• But that doesn’t out­put 1 for es­ti­mates of 100%, 0 for es­ti­mates of 50%, and -inf (or even −1) to es­ti­mates of 0%, or even some­thing that can be nor­mal­ized to ei­ther of those triples.

• Here’s the “nor­mal­ized” ver­sion: f(x)=1+log2(x), g(x)=1+log2(1-x) (i.e. scale f and g by 1/​log(2) and add 1).

Now f(1)=1, f(.5)=0, f(0)=-Inf ; g(1)=-Inf, g(.5)=0, g(0)=1.

Ok?

• Huh. I thought that wasn’t a Bayesian score (not max­i­mized by es­ti­mat­ing cor­rectly), but do­ing the math the max­i­mum is at the right point for 14, 1100, 34, and 99100, and 12.