But if I understand these correctly, they are combined measurements of both domain-ability and calibration.
You understand correctly, though I would say “accuracy” rather than “domain-ability”.
So (I think) someone who always said 70% confidence and got 70% of the questions right would get a higher score than someone who always said 60% confidence and got 60% of the questions right, even though they are both equally well calibrated.
This is also correct. A problem with trying to isolated calibration is that on the true/false test, the subject could always assign 50% probability to both true and false and be right 50% of the time, achieving perfect calibration. A subject whose only goal was to get a calibration score would do this. More generally, multiple choice questions can be answered with maxent probability distributions, achieving the same result. Open ended questions are harder to game, but it is also harder to figure out the probability assigned to the correct answer to use to compute the score.
One approach I considered is asking for confidence intervals that have a given (test giver specified) probability of containing the correct numerical answer, however this is also gameable, by using a mix of the always correct interval from negative infinity to positive infinite and the always incorrect empty interval to achieve the target success rate.
Though I don’t think it is much of a problem that scoring rules represent a mix of calibration and accuracy, as it is this mix that determines a person’s ability to report useful probabilities.
A problem with trying to isolated calibration is that on the true/false test, the subject could always assign 50% probability to both true and false and be right 50% of the time, achieving perfect calibration.
Interestingly, this problem can be avoided by taking the domain of possible answers to be the natural numbers, or n-dimensional Euclidean space, etc., over which no uniform distribution is possible, and then asking your test subject to specify a probability distribution over the whole space. This is potentially impractical, though, and I’m not certain it can’t be gamed in other ways.
You understand correctly, though I would say “accuracy” rather than “domain-ability”.
This is also correct. A problem with trying to isolated calibration is that on the true/false test, the subject could always assign 50% probability to both true and false and be right 50% of the time, achieving perfect calibration. A subject whose only goal was to get a calibration score would do this. More generally, multiple choice questions can be answered with maxent probability distributions, achieving the same result. Open ended questions are harder to game, but it is also harder to figure out the probability assigned to the correct answer to use to compute the score.
One approach I considered is asking for confidence intervals that have a given (test giver specified) probability of containing the correct numerical answer, however this is also gameable, by using a mix of the always correct interval from negative infinity to positive infinite and the always incorrect empty interval to achieve the target success rate.
Though I don’t think it is much of a problem that scoring rules represent a mix of calibration and accuracy, as it is this mix that determines a person’s ability to report useful probabilities.
Interestingly, this problem can be avoided by taking the domain of possible answers to be the natural numbers, or n-dimensional Euclidean space, etc., over which no uniform distribution is possible, and then asking your test subject to specify a probability distribution over the whole space. This is potentially impractical, though, and I’m not certain it can’t be gamed in other ways.