I’d understood the question to be “given identical scores”, not “given a 10 point average difference in favor of the blue weasel”.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
My intuition says no: while I’d expect fewer females to be in that range to begin with, I can’t see any reason to assume their scores would cluster towards the lower end of the range compared to males.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
So, first let’s ask this question, supposing that the test is perfectly accurate. We’ll run through the numbers separately for the two subtests (so we don’t have to deal with correlation), taking means and variances from here.
Of those who scored 600-700 on the hypothetical normally distributed math SAT (hence “HNDMSAT”), the male mean was 643.3 (with 20% of the male population in this band), and the female mean was 640.6 (with 14.8% of the female population in this band).
Of those who scored 600-700 on the HNDVSAT, the male mean was 641.0 (with 14.9% of the male population in this band), and the female mean was 640.1 (with 13.7% of the female population in this band).
When we introduce the test error into the process, the computation gets a lot messier. The quick and dirty way to do things is to say “well, let’s just shrink the mean band scores towards the population mean with the reliability coefficient.” This turns the male edge on the HNDMSAT of 2.7 into 5.4, and the male edge of .9 into 1.8. (I think it’s coincidental that this is roughly doubling the edge.)
My intuition says no: while I’d expect fewer females to be in that range to begin with, I can’t see any reason to assume their scores would cluster towards the lower end of the range compared to males.
That’s because you’re not thinking in bell curves. The range is all on one side of the mean, the male mean is closer to the bottom of the band, and the male variation is higher.
I’d understood the question to be “given identical scores”, not “given a 10 point average difference in favor of the blue weasel”.
My point was that ‘suppose that the true shrinkage leads to an adjusted difference of 10 points between the two groups; how much of a gift does 10 extra points represent?’ By using the nominal score rather than the true score, this has the effect of inflating the score. Once you’ve established how much the inflation might be, it’s natural to wonder about how much real-world consequence it might have leading into the Harvard musings.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
Depends on the mean and standard deviations of the 2 distributions, and then you could estimate how often the male sample average will be higher than the female sample average and vice versa.
The question should be ‘if we retest these 1200-1400 scorers, what will happen?’ The scores will probably drop as they regress to their mean due to an imperfect test. That’s the point.
The question should be ‘if we retest these 1200-1400 scorers, what will happen?’ The scores will probably drop as they regress to their mean due to an imperfect test. That’s the point.
Ahhh, that makes the statistics click in my brain, thanks :)
Do you know if there is much data out there on real-world gender differences vis-a-vis regression to the mean on IQ / SAT / etc. tests? i.e. is this based on statistics, or is it born out in empirical observations?
Do you know if there is much data out there on real-world gender differences vis-a-vis regression to the mean on IQ / SAT / etc. tests? i.e. is this based on statistics, or is it born out in empirical observations?
I haven’t seen any, offhand. Maybe the testing company provides info about retests, but then you’re going to have different issues: anyone who takes the second test may be doing so because they had a bad day (giving you regression to a mean from the other direction) and may’ve boned up on test prep since, and there’s the additional issue of test-retest effect—now that they know what the test is like, they will be less anxious and will know what to do, and test-takers in general may score better. (Since I’m looking at that right now, my DNB meta-analysis offers a case in point: in many of the experiments, the controls have slightly higher post-test IQ scores. Just the test-retest effect.)
I’d understood the question to be “given identical scores”, not “given a 10 point average difference in favor of the blue weasel”.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
My intuition says no: while I’d expect fewer females to be in that range to begin with, I can’t see any reason to assume their scores would cluster towards the lower end of the range compared to males.
So, first let’s ask this question, supposing that the test is perfectly accurate. We’ll run through the numbers separately for the two subtests (so we don’t have to deal with correlation), taking means and variances from here.
Of those who scored 600-700 on the hypothetical normally distributed math SAT (hence “HNDMSAT”), the male mean was 643.3 (with 20% of the male population in this band), and the female mean was 640.6 (with 14.8% of the female population in this band).
Of those who scored 600-700 on the HNDVSAT, the male mean was 641.0 (with 14.9% of the male population in this band), and the female mean was 640.1 (with 13.7% of the female population in this band).
When we introduce the test error into the process, the computation gets a lot messier. The quick and dirty way to do things is to say “well, let’s just shrink the mean band scores towards the population mean with the reliability coefficient.” This turns the male edge on the HNDMSAT of 2.7 into 5.4, and the male edge of .9 into 1.8. (I think it’s coincidental that this is roughly doubling the edge.)
That’s because you’re not thinking in bell curves. The range is all on one side of the mean, the male mean is closer to the bottom of the band, and the male variation is higher.
My point was that ‘suppose that the true shrinkage leads to an adjusted difference of 10 points between the two groups; how much of a gift does 10 extra points represent?’ By using the nominal score rather than the true score, this has the effect of inflating the score. Once you’ve established how much the inflation might be, it’s natural to wonder about how much real-world consequence it might have leading into the Harvard musings.
Depends on the mean and standard deviations of the 2 distributions, and then you could estimate how often the male sample average will be higher than the female sample average and vice versa.
The question should be ‘if we retest these 1200-1400 scorers, what will happen?’ The scores will probably drop as they regress to their mean due to an imperfect test. That’s the point.
Ahhh, that makes the statistics click in my brain, thanks :)
Do you know if there is much data out there on real-world gender differences vis-a-vis regression to the mean on IQ / SAT / etc. tests? i.e. is this based on statistics, or is it born out in empirical observations?
I haven’t seen any, offhand. Maybe the testing company provides info about retests, but then you’re going to have different issues: anyone who takes the second test may be doing so because they had a bad day (giving you regression to a mean from the other direction) and may’ve boned up on test prep since, and there’s the additional issue of test-retest effect—now that they know what the test is like, they will be less anxious and will know what to do, and test-takers in general may score better. (Since I’m looking at that right now, my DNB meta-analysis offers a case in point: in many of the experiments, the controls have slightly higher post-test IQ scores. Just the test-retest effect.)