# Rafael Harth

Karma: 1,088
• I agree with your fi­nal para­graph – I’m fine with as­sum­ing there is a true prob­a­bil­ity. That said, I think there’s an im­por­tant differ­ence be­tween how ac­cu­rate a pre­dic­tion was, which can be straight-for­wardly defined as its similar­ity to the true prob­a­bil­ity, and how good of a job the pre­dic­tor did.

If we’re just talk­ing about the former, then I don’t dis­agree with any­thing you’ve said, ex­cept that I would ques­tion call­ing it an “epistem­i­cally good” pre­dic­tion – “epistem­i­cally good” sounds to me like it refers to perfor­mance. Either way, mere ac­cu­racy seems like the less in­ter­est­ing thing of the two.

If we’re talk­ing about the lat­ter, then us­ing the true prob­a­bil­ity as a com­par­i­son is prob­le­matic even in prin­ci­ple be­cause it might not cor­re­spond to any in­tu­itive no­tion of a good pre­dic­tion. I see two sep­a­rate prob­lems:

• There could be hid­den vari­ables. Sup­pose there is an elec­tion be­tween can­di­date A and can­di­date B. Un­be­knownst to ev­ery­one, can­di­date A has a brain tu­mor that will dra­mat­i­cally man­i­fest it­self three days be­fore elec­tion day. Given this, the true prob­a­bil­ity that A wins is very low. But that can’t mean peo­ple who as­sign low prob­a­bil­ities to A win­ning all did a good job – by as­sump­tion, their pre­dic­tion was un­re­lated to the rea­son the prob­a­bil­ity was low.

• Even if there are no hid­den vari­ables, it might be that ac­cu­racy doesn’t mono­ton­i­cally in­crease with im­proved com­pe­tence. Say there’s an­other elec­tion (no brain tu­mor in­volved). We can imag­ine that all of the fol­low­ing is true:

• Naive peo­ple will as­sign about 5050 odds

• Smart peo­ple will rec­og­nize that can­di­date A will have bet­ter de­bate perfor­mance and will as­sign 6040 odds

• Very smart peo­ple will rec­og­nize that B’s poor de­bate perfor­mance will ac­tu­ally help them be­cause it makes them re­lat­able, so they will as­sign 3070 odds

• Ex­tremely smart peo­ple will rec­og­nize that the econ­omy is likely to crash be­fore elec­tion day which will hurt B’s chances more than ev­ery­thing else and will as­sign 8020 odds. This is similar to the true prob­a­bil­ity.

In this case, go­ing from smart to very smart ac­tu­ally makes your pre­dic­tion worse, even though you picked up on a real phe­nomenon.

I per­son­ally think it might be pos­si­ble to define the qual­ity of a sin­gle pre­dic­tion in a way that in­cludes the true prob­a­bil­ity, but but I don’t think it’s straight-for­ward.

• I have never used Headspace, but I can say that I found it highly valuable to re­peat the in­tro­duc­tory course on Wak­ing Up, which does fit your as­sess­ment that it moves too fast to learn the con­cepts the first time.

• Ok this con­firms you haven’t un­der­stood what I’m claiming.

I’m ar­gu­ing against this claim:

I don’t think there is any differ­ence in those lists!

I’m say­ing that it is harder to make a list where all pre­dic­tions seem ob­vi­ously false and have half of them come true than it is to make a list where half of all pre­dic­tions seem ob­vi­ously false and half seem ob­vi­ously true and have half of them come true. That’s the only thing I’m claiming is true. I know you’ve said other things and I haven’t ad­dressed them; that’s be­cause I wanted to get con­sen­sus on this thing be­fore talk­ing about any­thing else.

• I agree that for the ex­am­ples you’re nam­ing (e.g., de­mand­ing strong ev­i­dence/​re­sist­ing so­cial pres­sure), there is a failure mode that looks like you’re go­ing too far (e.g., be­ing ex­ces­sively dog­matic/​be­ing con­trar­ian).

How­ever, I don’t think that this failure mode ac­tu­ally re­sults from iden­ti­fy­ing the un­der­ly­ing prin­ci­ple and then tak­ing it to the ex­treme, and I think that’s an im­por­tant point to clar­ify. For ex­am­ple, in the first case, the prin­ci­ple I see is some­thing like “de­mand strong ev­i­dence for strongly held be­liefs” or even more gen­er­ally “be­lieve things only as strongly as ev­i­dence sug­gests.” I don’t think it’s ob­vi­ous that this prin­ci­ple can be taken too far. In par­tic­u­lar, I think the following

A fa­mous spoof ar­ti­cle jokes that we don’t know parachutes are re­li­able be­cause we don’t have a ran­domised con­trol­led trial.

is not an ex­am­ple of do­ing that. Rather, the mis­take here is some­thing like, “equat­ing ra­tio­nal­ity with aca­demic sci­ence.” We don’t have a for­mally con­ducted study on the effec­tive­ness of parachutes, and if you think that’s the only ev­i­dence that counts, you might mis­trust parachutes. But, as a mat­ter of fact, we have ex­cel­lent ev­i­dence to be­lieve that parachutes work, and be­liev­ing this ev­i­dence is perfectly ra­tio­nal. So you can­not ar­rive at a mis­trust of parachutes by hav­ing high stan­dards for ev­i­dence, you can only ar­rive at it by be­ing wrong about what kind of ev­i­dence does and doesn’t count.

Again, I only mean this as a clar­ifi­ca­tion, not as a coun­ter­point. It is still ab­solutely pos­si­ble to go wrong in the ways you de­scribe, and avoid­ing that is im­por­tant.

• (Edit: deleted a line based on tone. Apolo­gies.)

Every­thing ex­cept your last two para­graphs ar­gues that a sin­gle 50% pre­dic­tion can be flipped, which I agree with. (Again: for ev­ery pre­dic­tions, there are ways to phrase them and pre­cisely 2 of them are max­i­mally bold. If you have a sin­gle pre­dic­tion, then . There are only two ways, both are max­i­mally bold and thus equally bold.)

When it comes to a list of 50% pre­dic­tions, it’s im­pos­si­ble to eval­u­ate the im­pres­sive­ness only by look­ing at how many came true, since it’s ar­bi­trary which way they are phrased

I have pro­posed a rule that dic­tates how they are phrased. If this rule is fol­lowed, it is not ar­bi­trary how they are phrased. That’s the point.

Again, please con­sider the fol­low­ing list:

• The price of a bar­rel of oil at the end of 2020 will be be­tween $50.95 and$51.02 (50%)

• Tesla’s stock price at the end of the year 2020 is be­tween 512$and 514$ (50%)

• ...

You have said that there is no differ­ence be­tween both lists. But this is ob­vi­ously un­true. I hereby offer you 2000$if you provide me with a list of this kind and you man­age to have, say, at least 10 pre­dic­tions where be­tween 40% and 60% come true. Would you offer me 2000$ if I pre­sented you with a list of this kind:

• The price of a bar­rel of oil at the end of 2020 will be be­tween $50.95 and$51.02 (50%)

• Tesla’s stock price at the end of the year 2020 is be­low 512$or above 514$ (50%)

and be­tween 40% and 60% come true? If so, I will PM you one im­me­di­ately.

I think you’re stuck at the fact that a 50% pre­dic­tion also pre­dicts the negated state­ment with 50%, there­fore you as­sume that the en­tire post must be false, and there­fore you’re not try­ing to un­der­stand the point the post is mak­ing. Right now, you’re ar­gu­ing for some­thing that is ob­vi­ously un­true. Every­one can make a list of the sec­ond kind, no-one can make a list of the first kind. Again, I’m so cer­tain about this that I promise you 2000$if you prove me wrong. • As has been noted, the im­pres­sive­ness of the pre­dic­tions has noth­ing to do with which way round they are stated; pre­dict­ing P at 50% is ex­actly as im­pres­sive as pre­dict­ing ¬P at 50% be­cause they are liter­ally the same. If that were true, then the list • The price of a bar­rel of oil at the end of 2020 will be be­tween$50.95 and $51.02 (50%) • Tesla’s stock price at the end of the year 2020 is be­tween 512$ and 514$(50%) • ⋯ (more ex­tremely nar­row 50% pre­dic­tions) and the list • The price of a bar­rel of oil at the end of 2020 will be be­tween$50.95 and $51.02 (50%) • Tesla’s stock price at the end of the year 2020 is be­low 512$ or above 514\$ (50%)

• ⋯ (more ex­tremely nar­row 50% pre­dic­tions where ev­ery other one is flipped)

would be equally im­pres­sive if half of them came true. Un­less you think that’s the case, it im­me­di­ately fol­lows that the way pre­dic­tions are stated mat­ters for im­pres­sive­ness.

It doesn’t mat­ter in case of a sin­gle 50% pre­dic­tion, be­cause in that case, one of the phras­ings fol­lows the rule I pro­pose, and the other fol­lows the in­verse of the rule, which is the other way to max­i­mize bold­ness. As soon as you have two 50% pre­dic­tions, there are four pos­si­ble phras­ings and only two of them max­i­mize bold­ness. (And with pre­dic­tions, pos­si­ble phras­ings and only 2 of them max­i­mize bold­ness.)

The per­son you’re refer­ring to left an ad­den­dum in a sec­ond com­ment (as a re­ply to the first) ac­knowl­edg­ing that phras­ing mat­ters for eval­u­a­tion.

• I’m very com­pet­i­tive and my self-worth is mostly de­rived from so­cial com­par­i­son, a trait which at worst can cause me to value win­ning over main­tain­ing re­la­tion­ships, or cause me to avoid peo­ple who have higher sta­tus than me to avoid up­ward com­par­i­son. In read­ing LW and ra­tio­nal­ist blogs, I think I’ve turned away from use­ful ma­te­rial that takes longer for me to grasp be­cause it makes me feel in­fe­rior. I some­times binge on low-qual­ity ma­te­rial, some­times even seek­ing out highly down­voted posts; I sus­pect I do this be­cause it al­lows me to men­tally jeer at peo­ple or ideas I know are in­cor­rect.

I want to share that I have done this as well. In my case, I would be slightly more char­i­ta­ble and claim that the mo­ti­va­tion was not to jeer at peo­ple who say in­cor­rect things but to de­rive a feel­ing that I my­self am do­ing okay. LessWrong has very high stan­dards and there are a lot of im­pres­sive peo­ple here, which can make it ter­rify­ing for those of us who have the deeply rooted in­stinct to com­pare our­selves to what­ever peo­ple we see around us. So if I see some­thing down­voted, it gives me re­as­surance that I at least must be above some vaguely defined bar.

• I might have been un­clear, but I didn’t mean to con­flate them. The post is meant to be just about im­pres­sive­ness. I’ve stated in the end that im­pres­sive­ness is bold­ness ac­cu­racy (which I prob­a­bly should have called cal­ibra­tion). It’s pos­si­ble to have perfect ac­cu­racy and zero bold­ness by mak­ing pre­dic­tions about ran­dom num­ber gen­er­a­tors.

I dis­agree that 50% pre­dic­tions can’t tell you any­thing about cal­ibra­tion. Sup­pose I give you 200 state­ments with baseline prob­a­bil­ities, and you have to turn them into pre­dic­tions by as­sign­ing them your own prob­a­bil­ities while fol­low­ing the rule. Once ev­ery­thing can be eval­u­ated, the re­sults on your 50% group will tell me some­thing about how well cal­ibrated you are.

(Edit: I’ve changed the post to say im­pres­sive­ness = cal­ibra­tion bold­ness)

• “Always phrase pre­dic­tions such that the con­fi­dence is above the baseline prob­a­bil­ity”—This re­ally seems like it should not mat­ter. I don’t have a co­he­sive ar­gu­ment against it at this stage, but re­vers­ing should fun­da­men­tally be the same pre­dic­tion.

So I’ve thought about this a bit more. It doesn’t mat­ter how some­one states their prob­a­bil­ities. How­ever, in or­der to use your eval­u­a­tion tech­nique we just need to trans­form the prob­a­bil­ities so that all of them are above the baseline.

Yes, I think that’s ex­actly right. State­ments are sym­met­ric: 50% that hap­pens 50% that hap­pens. But eval­u­a­tion is not sym­met­ric. So you can con­sider each pre­dic­tion as mak­ing two log­i­cally equiv­a­lent claims ( hap­pens with prob­a­bil­ity and hap­pens with prob­a­bil­ity) plus stat­ing on which one of the two you want to be eval­u­ated on. But this is im­por­tant be­cause the two claims will miss the “cor­rect” prob­a­bil­ity in differ­ent di­rec­tions. If 50% con­fi­dence is too high for (Tesla stock price is in nar­row range) then 50% is too low for (Tesla stock price out­side nar­row range).

(Plu in any case it’s not clear that we can always agree on a baseline prob­a­bil­ity)

I think that’s the rea­son why cal­ibra­tion is in­her­ently im­pres­sive to some ex­tent. If it was ac­tu­ally bold­ness mul­ti­plied by cal­ibra­tion, then you should not be im­pressed at all when­ever the bold­ness pile and con­fi­dence pile have iden­ti­cal height. And I think that’s cor­rect in the­ory; if I just make pre­dic­tions about dice all day, you shouldn’t be im­pressed at all re­gard­less of the out­come. But since it takes some skill to es­ti­mate the baseline for all prac­ti­cal pur­poses, bold­ness doesn’t go to zero.

# How to eval­u­ate (50%) predictions

10 Apr 2020 17:12 UTC
116 points
• I con­fi­dently re­ject the Dooms­day ar­gu­ment, so it doesn’t have any im­pli­ca­tions.

• I might be con­fused here, but it seems to me that it’s easy to in­ter­pret the ar­gu­ments in this post as ev­i­dence in the wrong di­rec­tion.

I see the fol­low­ing three ques­tions as rele­vant:

1. How much sets hu­man brains apart from other brains?

2. How much does the thing that hu­mans have and an­i­mals don’t mat­ter?

3. How much does bet­ter ar­chi­tec­ture mat­ter for AI?

Ques­tions #2 and #3 seem pos­i­tively cor­re­lated – if the thing that hu­mans have is im­por­tant, it’s ev­i­dence that ar­chi­tec­tural changes mat­ter a lot. How­ever, hold­ing #2 con­stant, #1 and #3 seem nega­tively cor­re­lated – the less stuff there is that makes hu­mans spe­cial, the smaller the im­prove­ments to ar­chi­tec­ture that are re­quired to achieve greater perfor­mance.

Since this post is ar­gu­ing pri­mar­ily about #1, the way it af­fects #3 is po­ten­tially con­fus­ing.

• Strong up­vote from me. This new tech­nol­ogy has helped me view the ex­ist­ing con­tent from a differ­ent an­gle.

• Is there a rea­son why it wouldn’t be strongly cor­re­lated?

Your “se­ri­ous” mod­ifier sounds to me like you’re en­vi­sion­ing the con­sen­sus among masses to change while smart peo­ple are more sober. I was largely as­sum­ing that, in the wor­lds where Aubrey’s pre­dic­tion is true, ac­tual life ex­pec­tancy does, in fact, in­crease along with the aware­ness shift. Note that it’s ex­pec­tancy rather than ac­tual life span.

Pen­sions might be a good poin­ter.