Sorry, should have been clearer. I will make a note to devote less effort to humor and more to clarity in my comments in future…
A calibration test would consist, I presume, of questions of the form “estimate the value of parameter X, and then give upper and lower bounds U and L such that the probability of parameter X lying in [U,L] is 90%”. You are “well calibrated” if the actual value of X is in [U,L] roughly 90% of the time.
But you can do very well on such a test by picking a good ignorance prior the parameter space for X—for example the uniform distribution (if the set of values of X is a finite set) - and sampling randomly from that distribution, and then randomly choosing U and L such that 90% of the probability mass is contained in [U,L] and your random guess is contained in [U,L]. On average, you will come out as well calibrated ( if there is a statistics expert here, then please correct me if I’m wrong… ), even though this procedure is really totally mechanical and doesn’t involve any real thought. Someone who actually thought hard about what they thought the value of X was would inevitably be (at least slightly) overconfident and would do worse. See:
… In which case you’re measuring knowledge about the questions asked, not calibration. My little sister could beat you on such a test if it was about pop idol.
You could design a good test where the score was some combination of calibration and knowledge, such that someone with less knowledge but better calibration could outscore someone with better knowledge but poorer calibration.
Something like (calibration) * (Bayes Score), perhaps?
you maximize Bayes Score iff you use all your knowledge as well as possible. This seems to indicate that any perturbation will introduce an incentive not to do so.
Ask completely ridiculous things. Estimate the probability that the yearly rainfall in Ghana exceeds that of Switzerland. Ask questions like that, and you will learn something about how much true general knowledge a person has gained (and why not—a rationalist should absorb more true general knowledge in X years on earth than a non-rationalist), but much more about the subject’s ability to honestly estimate their own ignorance.
“you maximize Bayes Score iff you use all your knowledge as well as possible. ”
yes, but in a test where you have no knowledge (e.g. Eliezer is a great rationalist but knows nothing about pokemon) this is unhelpful… This test would work well on ranking rationalists iff you had a set of general knowledge questions that you were confident everyone had roughly the same amount of knowledge about.
The test would also work statistically to measure the effect of an intervention, if you had more subjects than variance. A test with too much variance can’t be organizational, but it can be experimental.
If you are asked about pokemon, AI design, 13th century chinese history, martian geology, german literature, Yankees batting averages, lyrics to popular songs from the 1820s, etc. you would be forced to get maximal mileage out of whatever knowledge you can bring to bear on each question, which would in most cases be slim to none.
If the questions are chosen randomly and eclectically enough, there should be no way to game the system, and scores should average out for people knowledgeable in different areas.
If you dependably know more than I do across a broad spectrum of subject areas, then I would assume that you have learned more than I have during your life so far, which seems to me to be symptomatic of good rationality.
“across a broad spectrum of subject areas … questions are chosen randomly”
but this is the real weasel in there. Defining a good prior on “subject areas” is problematic. A very rational nerd would get wiped out if there are too many trivia questions… which is what happened to me just now on Tom’s rationality test:
Though my calibration on this test was very good, my Bayes Score was rubbish. Most of the questions were about America, (cultural bias) and most were about people (subject area bias). I like my idea of (calibration) * (Bayes Score).
Test for data, factual knowledge and counterfactual knowledge. True rationalists will have less counterfactual knowledge than non-rationalists because they will have filtered it out. Non-rationalits will have more false data because their counterfactual knowledge will feedback and cause them to believe things that are false are actually true. For example that Iraq or Iran was involved in 9/11.
What you really want to measure is the relative proportion of factual and counterfactual knowledge someone has, and in what particular areas. Then including areas like religion, medicine, alternative medicine, and politics in the testing space is advantageous because then you can see where the idea space is that the individuals are most non-rational in.
This can be tricky because many individuals are extremely invested in their counterfactual knowledge and will object to it being identified as counterfactual. A lot of fad-driven science is based on counterfactual knowledge, but the faddists don’t want to acknowledge that.
A way to test this would be to see how well people can differentiate correct facts (data) from factual knowledge (based on and consistent with only data) from counterfactual knowledge (based on false facts and not consistent with correct facts) from opinion consistent with facts or opinion consistent with false facts.
An example: in the neurodegenerative disease of Alzheimer’s, there is the association of the accumulation of amyloid with dementia. It remains not established if amyloid is a cause, or an effect or is merely associated with dementia. However there have been studies where amyloid has been removed via vaccination against amyloid and a clearing of amyloid by the immune system with no improvement.
I imagine a list of a very large number of statements to be labeled as
1.true (>99% likelihood)
2.false (>99% likelihood to be false) [edited to improve definition of false]
3.opinion based on true facts
4.opinion based on false ideas
5.no one knows
6.I don’t know
A list of some examples
Iraq caused 9/11 2
WMD were found in Iraq 2
Amyloid is found in Alzheimer’s 1
Amyloid causes Alzheimer’s 2 (this happens to be a field I am working in so I have non-public knowledge as to the real cause)
Greenhouse gases are causing GW 1
Vaccines cause autism 2
Acupuncture is a placebo 1
There is life on Mars 5
You don’t want to test for obscure things, you want to test for common things that are believed but which are wrong. I think you also want to explicitly tell people that you are testing them for rationality, so they can put themselves into “rational-mode” (a state that is not always socially acceptable).
The table-like lists look fine in the edit box but not fine once I post. :(
Thanks, I was trying to make a list, maybe I will figure it out. I just joined and am trying to focus on getting up to speed on the ideas, the syntax of formating things is more difficult for me and less rewarding.
Sorry, should have been clearer. I will make a note to devote less effort to humor and more to clarity in my comments in future…
A calibration test would consist, I presume, of questions of the form “estimate the value of parameter X, and then give upper and lower bounds U and L such that the probability of parameter X lying in [U,L] is 90%”. You are “well calibrated” if the actual value of X is in [U,L] roughly 90% of the time.
But you can do very well on such a test by picking a good ignorance prior the parameter space for X—for example the uniform distribution (if the set of values of X is a finite set) - and sampling randomly from that distribution, and then randomly choosing U and L such that 90% of the probability mass is contained in [U,L] and your random guess is contained in [U,L]. On average, you will come out as well calibrated ( if there is a statistics expert here, then please correct me if I’m wrong… ), even though this procedure is really totally mechanical and doesn’t involve any real thought. Someone who actually thought hard about what they thought the value of X was would inevitably be (at least slightly) overconfident and would do worse. See:
http://www.overcomingbias.com/2008/10/expected-creati.html
Use Bayes-score (log of final joint probability) as primary outcome, measure calibration only secondarily.
… In which case you’re measuring knowledge about the questions asked, not calibration. My little sister could beat you on such a test if it was about pop idol.
You could design a good test where the score was some combination of calibration and knowledge, such that someone with less knowledge but better calibration could outscore someone with better knowledge but poorer calibration.
Something like (calibration) * (Bayes Score), perhaps?
Nick suggested something like this:
http://www.overcomingbias.com/2007/01/a_game_for_self.html
He solves the problem that (e.g. Bayes Score) will test for narrow knowledge by suggesting that the questions be very general.
you maximize Bayes Score iff you use all your knowledge as well as possible. This seems to indicate that any perturbation will introduce an incentive not to do so.
Ask completely ridiculous things. Estimate the probability that the yearly rainfall in Ghana exceeds that of Switzerland. Ask questions like that, and you will learn something about how much true general knowledge a person has gained (and why not—a rationalist should absorb more true general knowledge in X years on earth than a non-rationalist), but much more about the subject’s ability to honestly estimate their own ignorance.
“you maximize Bayes Score iff you use all your knowledge as well as possible. ”
yes, but in a test where you have no knowledge (e.g. Eliezer is a great rationalist but knows nothing about pokemon) this is unhelpful… This test would work well on ranking rationalists iff you had a set of general knowledge questions that you were confident everyone had roughly the same amount of knowledge about.
The test would also work statistically to measure the effect of an intervention, if you had more subjects than variance. A test with too much variance can’t be organizational, but it can be experimental.
If you are asked about pokemon, AI design, 13th century chinese history, martian geology, german literature, Yankees batting averages, lyrics to popular songs from the 1820s, etc. you would be forced to get maximal mileage out of whatever knowledge you can bring to bear on each question, which would in most cases be slim to none.
If the questions are chosen randomly and eclectically enough, there should be no way to game the system, and scores should average out for people knowledgeable in different areas.
If you dependably know more than I do across a broad spectrum of subject areas, then I would assume that you have learned more than I have during your life so far, which seems to me to be symptomatic of good rationality.
“across a broad spectrum of subject areas … questions are chosen randomly”
but this is the real weasel in there. Defining a good prior on “subject areas” is problematic. A very rational nerd would get wiped out if there are too many trivia questions… which is what happened to me just now on Tom’s rationality test:
http://www.acceleratingfuture.com/tom/calibrate.php
Though my calibration on this test was very good, my Bayes Score was rubbish. Most of the questions were about America, (cultural bias) and most were about people (subject area bias). I like my idea of (calibration) * (Bayes Score).
Then use more obscure questions.
Test for data, factual knowledge and counterfactual knowledge. True rationalists will have less counterfactual knowledge than non-rationalists because they will have filtered it out. Non-rationalits will have more false data because their counterfactual knowledge will feedback and cause them to believe things that are false are actually true. For example that Iraq or Iran was involved in 9/11.
What you really want to measure is the relative proportion of factual and counterfactual knowledge someone has, and in what particular areas. Then including areas like religion, medicine, alternative medicine, and politics in the testing space is advantageous because then you can see where the idea space is that the individuals are most non-rational in.
This can be tricky because many individuals are extremely invested in their counterfactual knowledge and will object to it being identified as counterfactual. A lot of fad-driven science is based on counterfactual knowledge, but the faddists don’t want to acknowledge that.
A way to test this would be to see how well people can differentiate correct facts (data) from factual knowledge (based on and consistent with only data) from counterfactual knowledge (based on false facts and not consistent with correct facts) from opinion consistent with facts or opinion consistent with false facts.
An example: in the neurodegenerative disease of Alzheimer’s, there is the association of the accumulation of amyloid with dementia. It remains not established if amyloid is a cause, or an effect or is merely associated with dementia. However there have been studies where amyloid has been removed via vaccination against amyloid and a clearing of amyloid by the immune system with no improvement.
I imagine a list of a very large number of statements to be labeled as 1.true (>99% likelihood) 2.false (>99% likelihood to be false) [edited to improve definition of false] 3.opinion based on true facts 4.opinion based on false ideas 5.no one knows 6.I don’t know
A list of some examples
Iraq caused 9/11 2 WMD were found in Iraq 2 Amyloid is found in Alzheimer’s 1 Amyloid causes Alzheimer’s 2 (this happens to be a field I am
working in so I have non-public knowledge as to the real cause) Greenhouse gases are causing GW 1 Vaccines cause autism 2 Acupuncture is a placebo 1 There is life on Mars 5
You don’t want to test for obscure things, you want to test for common things that are believed but which are wrong. I think you also want to explicitly tell people that you are testing them for rationality, so they can put themselves into “rational-mode” (a state that is not always socially acceptable).
The table-like lists look fine in the edit box but not fine once I post. :(
http://daringfireball.net/projects/markdown/syntax
Thanks, I was trying to make a list, maybe I will figure it out. I just joined and am trying to focus on getting up to speed on the ideas, the syntax of formating things is more difficult for me and less rewarding.
There’s also a help link under the comment box.
Yes, thankyou just one problem
too obvious
and
too easy