A few weeks ago I started assessing my own calibration, using tools such as the CFAR calibration game. I got fairly good and concluded that I am relatively well calibrated.
When given a question, my instincts would immediately throw out a number. I’d unpack it and adjust it in accordance to known biases. (Avoid representativeness, start from base rates, treat the initial number as a degree of support, factor in strength of evidence, etc.)
Yesterday, an assessment of probability came up in conversation. Immediately, my instincts threw out the number “80%”. My thoughts went like this:
My gut says 80%. I’m well calibrated, so 80% is probably right.
I opened my mouth to speak.
Then I shut my mouth.
I understand Löb’s theorem on an intuitive level now.
I achieved good calibration by paying attention to evidence and avoiding known biases as well as I was able. Once I had established reliable calibration, I experienced temptation to justify my first intuitive instinct by asserting my own calibration.
My calibration was based upon mediating my intuitions with reason. I can’t invoke it to trust any old estimate that comes out of my mouth: if I did then I could say whatever I want and trust it to be calibrated (which would yield probability estimates out of touch with reality). Hence the parallels to Löb’s theorem.
A few weeks ago I started assessing my own calibration, using tools such as the CFAR calibration game. I got fairly good and concluded that I am relatively well calibrated.
TIL this app exists. Thank you.
After using it for half an hour it turns out that I am well calibrated at probabilities except 80% and 90%. Weird.
Anyhow, could this be combined with a program such as Anki? Meaning, that you place the answer in your head and indicate how certain you are in percent. If correct, the card will be placed accordingly into the future. This should work splendidly with learning vocab for linguistically close languages.
Hey, thanks for mentioning this. I hadn’t heard about it.
I’ve tried my hand at this app (50 questions or so), and it seems like the correct strategy, for me, is to go 50% for anything I have a little doubt on, and 99% for that I’m sure about. Maybe 5% of the questions fall into the 60%-90% range.
I’m still working to understand the tutorial and how to interpret my results.
It’s not particularly hard to “perfect” your calibration in that game—if you’re over/under on a certain percentile, you can throw questions where you’re confident into percentiles where you’re “poorly calibrated” in order to spoof a good calibration curve.
The trick to that game, if you actually want to asses your calibration, is to play for points rather than for a good curve. Being well-calibrated means that when you play for points, you have a good curve automatically.
(I wish that they’d show you your curve less often, perhaps only when you leave the game. It’s hard to resist cheating the curve. Then again, I’m not sure of a better way to provide the necessary feedback.)
I’m not strong enough in math to figure out how the scoring actually works without spending some time with it, and I wouldn’t “throw” questions anyway. But I do like seeing that, say, on my 60%s I’m actually right 70% of the time. So when I’m feeling “60%” I should actually go with 70% more often. I think I’m afraid of getting questions wrong because the score penalty appears so high relative to the score bonus (I know that’s likely appropriate, even though I don’t understand the actual log bits, etc of scoring ).
The scoring is done so that if you have 70% of your answers right, then you get the best average score by guessing 70%, not 60%. The increased penalty you get for getting 30% of those answers wrong is smaller than the increased gain for getting 70% of them right.
But that’s true only as long as you really get 70% of them right; so changing your answer e.g. to 80% while being only 70% correct would decrease the average score, because then the increased penalty for getting 30% of those answers wrong would be greater than the increased gain for getting the 70% right.
Without understanding the log bits, you can easily verify this in a spreadsheet calculator. Make a formula saying how many points you get if you report probability R and if you really get P answers right. Playing with numbers, you will find out that for a given P, you get the highest average score for R = P.
A few weeks ago I started assessing my own calibration, using tools such as the CFAR calibration game. I got fairly good and concluded that I am relatively well calibrated.
When given a question, my instincts would immediately throw out a number. I’d unpack it and adjust it in accordance to known biases. (Avoid representativeness, start from base rates, treat the initial number as a degree of support, factor in strength of evidence, etc.)
Yesterday, an assessment of probability came up in conversation. Immediately, my instincts threw out the number “80%”. My thoughts went like this:
I opened my mouth to speak.
Then I shut my mouth.
I understand Löb’s theorem on an intuitive level now.
I don’t understand. What’s wrong with 80%?
I achieved good calibration by paying attention to evidence and avoiding known biases as well as I was able. Once I had established reliable calibration, I experienced temptation to justify my first intuitive instinct by asserting my own calibration.
My calibration was based upon mediating my intuitions with reason. I can’t invoke it to trust any old estimate that comes out of my mouth: if I did then I could say whatever I want and trust it to be calibrated (which would yield probability estimates out of touch with reality). Hence the parallels to Löb’s theorem.
TIL this app exists. Thank you.
After using it for half an hour it turns out that I am well calibrated at probabilities except 80% and 90%. Weird.
Anyhow, could this be combined with a program such as Anki? Meaning, that you place the answer in your head and indicate how certain you are in percent. If correct, the card will be placed accordingly into the future. This should work splendidly with learning vocab for linguistically close languages.
Hey, thanks for mentioning this. I hadn’t heard about it.
I’ve tried my hand at this app (50 questions or so), and it seems like the correct strategy, for me, is to go 50% for anything I have a little doubt on, and 99% for that I’m sure about. Maybe 5% of the questions fall into the 60%-90% range.
I’m still working to understand the tutorial and how to interpret my results.
:-)
It’s not particularly hard to “perfect” your calibration in that game—if you’re over/under on a certain percentile, you can throw questions where you’re confident into percentiles where you’re “poorly calibrated” in order to spoof a good calibration curve.
The trick to that game, if you actually want to asses your calibration, is to play for points rather than for a good curve. Being well-calibrated means that when you play for points, you have a good curve automatically.
(I wish that they’d show you your curve less often, perhaps only when you leave the game. It’s hard to resist cheating the curve. Then again, I’m not sure of a better way to provide the necessary feedback.)
I’m not strong enough in math to figure out how the scoring actually works without spending some time with it, and I wouldn’t “throw” questions anyway. But I do like seeing that, say, on my 60%s I’m actually right 70% of the time. So when I’m feeling “60%” I should actually go with 70% more often. I think I’m afraid of getting questions wrong because the score penalty appears so high relative to the score bonus (I know that’s likely appropriate, even though I don’t understand the actual log bits, etc of scoring ).
The scoring is done so that if you have 70% of your answers right, then you get the best average score by guessing 70%, not 60%. The increased penalty you get for getting 30% of those answers wrong is smaller than the increased gain for getting 70% of them right.
But that’s true only as long as you really get 70% of them right; so changing your answer e.g. to 80% while being only 70% correct would decrease the average score, because then the increased penalty for getting 30% of those answers wrong would be greater than the increased gain for getting the 70% right.
Without understanding the log bits, you can easily verify this in a spreadsheet calculator. Make a formula saying how many points you get if you report probability R and if you really get P answers right. Playing with numbers, you will find out that for a given P, you get the highest average score for R = P.