My default is to assume that men and women are pretty similar.
How do you reconcile this view with the way questions of tone have become entangled with gender issues in this very thread?
There was that discussion of ignoring good test results from a member of a group if you already believe that they’re bad at whatever was being tested. (They were referred to as blues, but it seemed to be a reference to women and math.) It was a case of only identifying with the gatekeeper.
It was also an extremely straightforward application of Bayes’s theorem.
No thought about the unfairness
The problem is that the concept of “fairness” you are using there is incompatible with VNM-utilitarianism. (If somebody disagrees with this, please describe what the term in one’s utility function corresponding to fairness would look like.)
I’m not sure how much anyone has been convinced that women have actual points of view
Where has anyone claimed they don’t? At least beyond the general rejection of qualia?
My default is to assume that men and women are pretty similar.
How do you reconcile this view with the way questions of tone have become entangled with gender issues in the very thread?
I was surprised at how strongly some people (probably mostly women) are uncomfortable with the tone here, so I have a lot to update.
I don’t like emoticons much—I don’t hate people who use them, but I use emoticons very rarely, and I’m not comfortable with them. I still find it hard to believe that if people do something a lot, there’s a reasonable chance (if they aren’t being paid) that they like it a lot, even though I can’t imagine liking whatever it is.
I don’t know what proportion of people are apt to interpret lack of overt friendliness as dislike, nor what the gender split is.
In the spirit of exploration, I took a look at Ravelry, a major knitting and crocheting blog. I haven’t found major discussions there yet. I’m interested in examples of blogs with different emotional tones/courtesy rules/gender balances.
Now that I think about it, blogs that are mostly women may be more likely to have overt statements of strong friendship and support. I believe that sort of effusiveness is partly cultural—wasn’t more common for both men and women at least from the colonial era (US) to the Victorian era?
There was that discussion of ignoring good test results from a member of a group if you already believe that they’re bad at whatever was being tested. (They were referred to as blues, but it seemed to be a reference to women and math.) It was a case of only identifying with the gatekeeper.
It was also an extremely straightforward application of Bayes’s theorem.
That depends on how much you demand of your priors, and low quality priors is something that makes me nervous about Bayes.
For this particular case, there’s no examination of how much variance on the high side people get on tests. In particular, it seems very unlikely that people will get scores much above their baseline on tests about any sophisticated subject, though various factors (illness and other distractions) could drive their scores below their baseline.
What’s VHF Utilitarianism? Is there any utilitarian cost to some capable people giving up because they believe rightly that their accomplishments will be discounted?
I’m not sure how much anyone has been convinced that women have actual points of view
Where has anyone claimed they don’t? At least beyond the general rejection of qualia?
My language may have been hyperbolic and/or vague. I was thinking of “creepiness = low status” which sounds to me like “it’s so unfair that women don’t want to spend time with men they’re uncomfortable around”. In this case, I was thinking “lack of point of view”, but “preferences are irrelevant” might be more accurate.
I think I’ve interpreted “creepiness = low status” as, “it’s unfair that low-status men get labeled as creepy for behavior that high-status men would get away with.”
Of course, one could respond that making people at least feel comfortable around you is an easy way to improve your status. :)
My language may have been hyperbolic and/or vague. I was thinking of “creepiness = low status” which sounds to me like “it’s so unfair that women don’t want to spend time with men they’re uncomfortable around”.
Is there any utilitarian cost to some capable people giving up because they believe rightly that their accomplishments will be discounted?
Well, this depends on the exact circumstances, but this may happen to the people who got unlucky on the test anyway, and using a better predictor decreases the number of people who get mischaracterized.
The von Neumann-Morgenstern theorem has nothing to do with utilitarianism, and it’s not about what you “should” do. Those words don’t appear in the statement of the theorem. The theorem does state that a VNM-rational agent has a preference ordering over lotteries of outcomes. In fact it can have any preferences over outcomes at all and still satisfy the hypotheses of the theorem. In particular, it can prefer fair outcomes to unfair outcomes for any definition of “fair”.
If you want to argue that one shouldn’t pursue fairness, you don’t want to use the VNM theorem.
The von Neumann-Morgenstern theorem has nothing to do with utilitarianism, and it’s not about what you “should” do.
Agreed, unfortunately a lot of people around here seem to interpret it this way.
In particular, it can prefer fair outcomes to unfair outcomes for any definition of “fair”.
I would argue that fairness is a property of a process rather than an outcome, e.g., a kangaroo court doesn’t become “fair” just because it happens to reach the same verdict a fair trial would have.
Downvoted Eugine for the same reason, and upvoted MugaSofer back to positive. I value honest feedback, and see no reason to downvote ’em for providing it.
Then why is it that this difference, out of the many dimensions of differences that form up humankind, and the multitude of interest-group formation patterns that could have been generated, is the one that gets so much attention? It would be bizarre if an unbiased deliberation process systematically decides that one unremarkable axis (gender) is the one difference that should be discussed at great length and with very vigorous champions, while ignoring all of the other axes of diversity of human minds.
Now it is possible for one unremarkable axis to become overwhelmingly dominant in coalition formation, but that would involve some fairly unpleasant implications about the truth-seekiness and utilitarian consequences of this sort of thinking.
I dunno about this. It seems that the difference between those concerned with an intelligence explosion and those concerned with other scenarios has gotten way more attention here than gender.
I wasn’t surprised on the occasions when questions of differences in tone between the two camps flared up when discussing that topic. I would have been shocked almost beyond belief if, when discussing that topic, questions of tone differences between men and women had arisen.
The idea is, almost every topic, men and women are very similar, because the differences aren’t relevant. When you begin looking at the differences, then you get amplifying effects. In particular, each participant being what they are and completely unable to change that means:
that the topic isn’t going to be to convert people from one camp to the other or otherwise influence their choice as in the example above, but it’s going to have to be about something about that. This added layer of meta makes things much less stable. Imagine having a discussion about how we ought to talk about the differences between intelligence explosion and other scenarios, while universally acknowledged that no one was going to change their position on the actual subject. It’d be all over the place.
that empathy is harder to achieve. And in particular looking at the difference from one end gives exactly opposite perspectives on the issue. When you ‘normalize’ the differences, it’s maximally different.
By definition, those on either side have different experiences with regard to the difference, and thus are vastly more likely to hold different opinions.
There was that discussion of ignoring good test results from a member of a group if you already believe that they’re bad at whatever was being tested. (They were referred to as blues, but it seemed to be a reference to women and math.) It was a case of only identifying with the gatekeeper.
It was also an extremely straightforward application of Bayes’s theorem.
We have a population of 200 weasels, 100 blue and 100 red. 90% of blue weasels are programmers, and 10% of red weasels are programmers.
If we design a perfect test-of-being-a-programmer, we will have a pool of 100 programmers (90 blue, 10 red).
If our pool of programmers does NOT follow that distribution, it suggests that we’re probably doing something wrong in our screening, like de-facto excluding all of the red weasels due to bigotry. This HURTS us, because we now have fewer programmers in our pool, and/or we have non-programmers in our pool.
If you go out and test all the weasels, and 50% of them pass, and it’s 90% blue and 10% red, I don’t see any rational reason to assume that the blue weasels are going to be superior to the red weasels, or that the red weasels are more likely to be because of test variance.
Now, if you get a pool that’s 80 red weasels and 20 blue weasels, you’re right to be suspicious that maybe this is not a very accurate test. But given the real-world job market, we should expect such outliers to occur. If everyone else is getting 90 blue and 10 red weasels from this test, you should assume you’re such an outlier, since you have plenty of evidence towards the test being accurate.
And if we’re getting that 90-10 ratio that we expect, there’s no reason to assume that the red weasels are any less competent. If 10% of all weasels are super-programmers, we should expect 10% of our blue programming weasels and 10% of our red programming weasels to be super-programmers (so, on average, 9 blue super-programmers and 1 red super-programmer).
Seriously, where is this anti-red-weasel bias coming from? Nothing in the math seems to suggest it, unless you’re using a seriously crappy test >.>
If you go out and test all the weasels, and 50% of them pass, and it’s 90% blue and 10% red, I don’t see any rational reason to assume that the blue weasels are going to be superior to the red weasels, or that the red weasels are more likely to be because of test variance.
I don’t follow. Just because your test happened to result in a split that superficially resembles the underlying frequencies, why do you then assume that your imperfect test turned in exactly the right result in all 200 cases? The same logic of an imperfect test leading to shrinking estimates to the mean seems to still apply.
Nothing in the math seems to suggest it, unless you’re using a seriously crappy test
Did you follow my and Vaniver’s thread on this topic? The effect holds unless the test is perfectly accurate.
The effect holds unless the test is perfectly accurate.
WARNING: Rambly, half-thought-out answer here. It’s genuinely not something I’ve fully worked through myself, and I am totally open to feedback from you that I’m wrong.
The tl;dr version is that the effect is going to be small unless you have a very inaccurate test, and it’s suspicious to focus on a small effect when there’s probably other, larger effects we could be looking at.
Hmmm. Is that actually true? If we know the test has a 10% false positive rate for both red and blue weasels, doesn’t that suggest we should have 9 non-programmer blue weasels and 1 non-programmer red weasel?
Like, if I have a bag with 2 red marbles, and 2 white marbles, the odds of drawing a red marble are 50⁄50. But if my first draw is a red marble, I can’t claim that it’s still50⁄50, and I can’t update to say that drawing one red marble makes me MORE likely to draw a second one. The new odds are 33⁄66, no matter what math you run. The only correct update is the one that leaves you concluding 33⁄66.
It seems like there is such a test that the test results… already factor in our prior distribution? I’m not sure if I’m being at all clear here :\
Absolutely, this isn’t always the case—if you just know that you have a 10% false positive, and it’s not calibrated for red false positives vs blue false positives, you DO have evidence that red false positives are probably more common. BUT, you’d still be a fool to exclude ALL red candidates on that basis, since you also know that you should legitimately have red candidates in your pool, and by accepting red candidates you increase the overall number of programmers you have access to.
It all depends on the accuracy of your test. If your test is sufficiently accurate that red weasels are only 1% more likely to be false positives, then this probably shouldn’t affect your actual decision making that much.
Then, if you decide to FOCUS on how red weasels have a +1% false positive rate, it implies that you consider this fact particularly important and relevant. It implies that this is a very central decision making factor, and you’re liable to do things like “not hire red weasels unless they got an A+ on their test”, even though the math doesn’t support this. If you’re just doing cold, hard math, we’d expect this factor to be down near the bottom of t he list, not plastered up on a neon marquee saying “we did the cold hard math, and all you red weasels can f**k off!”
If we assume two populations, red-weasel-haters and rationalists, we could even run Bayes’ Theorem and conclude that anyone who goes around feeling the need to point out that 1% difference is SIGNIFICANTLY more likely to be a red-weasel-hater, not a rationalist.
Then we can go in to the utilitarian arguments about how feeding the red-weasel-haters political ammunition does actually increase their strength, and thus harms the red weasels, keeps them away from programming, and thus harms programming culture by reducing our pool of available programmers.
The tl;dr version is that the effect is going to be small unless you have a very inaccurate test, and it’s suspicious to focus on a small effect when there’s probably other, larger effects we could be looking at.
Yes, the effect is small in absolute magnitude—if you look at the example SAT shrinking that Vaniver and I were working out, the difference between the male/female shrunk scores is like 5 points although that’s probably an underestimate since it’s ignoring the difference in variance and only looking at means—but these 5 points could have a big difference depending on how the score is used or what other differences you look at.
For example, not shrinking could lead to a number of girls getting into Harvard that would not have since Harvard has so many applicants and they all have very high SAT scores; there could well be a noticeable effect on the margin. When you’re looking at like 30 applications for each seat, 10 SAT points could be the difference between success and failure for a few applicants.
One could probably estimate how many by looking for logistic regressions of ‘SAT score vs admission chance’, seeing how much 10 points is worth, and multiplying against the number of applicants. 35k applicants in 2011 for 2.16k spots. One logistic regression has a ‘model 7’ taking into account many factors where going from 1300 to 1600 goes from an odds ratio of 1.907 to 10.381; so if I’m interpreting this right, an extra 10pts on your total SAT is worth an odds ratio of ((10.381 - 1.907) / (1600-1300)) * 10 + 1 = 1.282. So the members of a group given a 10pt gain are each 1.28x more likely to be admitted than they were before; before, they had a 2.16/35 = 6.17% chance, and now they have a (1.28 * 2.16) / 35 = 2.76 / 35 = 7.89% chance. To finish the analysis: if 17.5k boys apply and 17.5k girls apply and 6.17% of the boys are admitted while 7.89% of the girls are admitted, then there will be an extra (17500 * 0.0789) - (17500 * 0.0617) = 301 girls.
(A boost of more than 1% leading to 301 additional girls on the margin sounds too high to me. Probably I did something wrong in manipulating the odds ratios.)
One could make the same point about means of bell curves differing a little bit: it may lead to next to no real difference towards the middle, but out on the tails it can lead to absurd differentials. I think I once calculated that a difference of one standard deviation in IQ between groups A and B leads to a difference out at 3 deviations for A vs 4 deviations for B, what is usually the cutoff for ‘genius’, of ~50x. One sd is a lot and certainly not comparable to 10 points on the SAT, but you see what I mean.
But if my first draw is a red marble
How do you know your first draw is a red marble?
BUT, you’d still be a fool to exclude ALL red candidates on that basis, since you also know that you should legitimately have red candidates in your pool, and by accepting red candidates you increase the overall number of programmers you have access to.
Depends on what you’re going to do with them, I suppose… If you can only hire 1 weasel, you’ll be better off going with one of the blue weasels, no? While if you’re just giving probabilities (I’m straining to think of how to continue the analogy: maybe the weasels are floating Hanson-style student loans on prediction markets and you want to see how to buy or sell their interest rates), sure, you just mark down your estimated probability by 1% or whatever.
If we assume two populations, red-weasel-haters and rationalists, we could even run Bayes’ Theorem and conclude that anyone who goes around feeling the need to point out that 1% difference is SIGNIFICANTLY more likely to be a red-weasel-hater, not a rationalist.
Alas! When red-weasel-hating is supported by statistics, only people interested in statistics will be hating on red-weasels. :)
an extra 10pts on your total SAT is worth an odds ratio of 1.282
We can check this interpretation by taking it to the 30th power, and seeing if we recover something sensible; unfortunately, that gives us an odds ratio of over 1700! If we had their beta coefficients, we could see how much 10 points corresponds to, but it doesn’t look like they report it.
Logistic regression is a technique that compresses the real line down to the range between 0 and 1; you can think of that model as the schools giving everyone a score, admitting people above a threshold with probably approximately 1, admitting people below a threshold with probability approximately 0, and then admitting people in between with a probability that increases based on their score (with a score of ‘0’ corresponding to a 50% chance of getting in).
We might be able to recover their beta by taking the log of the odds they report (see here). This gives us a reasonable but not too pretty result, with an estimate that 100 points of SAT is worth a score adjustment of .8. (The actual amount varies for each SAT band, which makes sense if their score for each student nonlinearly weights SAT scores. The jump from the 1400s to the 1500s is slightly bigger than the jump from the 1300s to the 1400s, suggesting that at the upper bands differences in SAT scores might matter more.)
A score increase of .08 cashes out as an odds ratio of 1.083, which when we take that to the power 30 we get 11.023, which is pretty close to what we’d expect.
I think I once calculated that a difference of one standard deviation in IQ between groups A and B leads to a difference out at 3 deviations for A vs 4 deviations for B, what is usually the cutoff for ‘genius’, of ~50x.
Two standard deviations is generally enough to get you into ‘gifted and talented’ programs, as they call them these days. Four standard deviations gets you to finishing in the top 200 of the Putnam competition, according to Griffe’s calculations, which are also great at illustrating male/female ratios at various levels given Project Talent data on math ability.
I’ll also note again that the SAT is probably not the best test to use for this; it gives a male/female math ability variance ratio estimate of 1.1, whereas Project Talent estimated it as 1.2. Which estimate you choose makes a big difference in your estimation of the strength of this effect. (Note that, typically, more females take the SAT than males, because the cutoff for interest in the SAT is below the population mean, where male variability hurts as well as other factors, and this systemic bias in subject selection will show up in the results.)
Thanks for the odds corrections. I knew I got something wrong...
Two standard deviations is generally enough to get you into ‘gifted and talented’ programs, as they call them these days.
G&T stuff, yeah, but in the materials I’ve read 2sd is not enough to move you from ‘bright’ or ‘gifted and talented’ to ‘genius’ categories, which seems to usually be defined as >2.5-3sd, and using 3sd made the calculation easier.
Eh. MENSA requires upper 2% (which is ~2 standard deviations). Whether you label that ‘genius’ or ‘bright’ or something else doesn’t seem terribly important. 3.5 standard deviations is the 2.3 out of 10,000 level, which is about a hundred times more restrictive.
I’d call MENSA merely bright… You need something in between ‘normal’ and ‘genius’ and bright seems fine. Genius carries all the wrong connotations for something as common as MENSA-level; 2.3 out of 10k seems more reasonable.
Harvard… When you’re looking at like 30 applications for each seat, 10 SAT points could be the difference between success and failure for a few applicants.
Only if Harvard cares a lot about SAT scores. According to this graph, the value of SATs is pretty flat between the 93rd and 96th percentiles. Moreover, at other Ivies, SAT scores are penalized in this range. source, page 7(8)
This graph is not a direct measure of the role of SATs, because they can’t force all else to be equal. The paper argues that some schools really do penalize SAT scores in some regimes. I do not buy the argument, but the graph convinces me that I don’t know how it works. Many people respond to the graph that it is the aggregation of two populations admitted under different scoring rules, both of which value SATs, but I do not think that explains the graph.
Only if Harvard cares a lot about SAT scores. According to this graph, the value of SATs is pretty flat between the 93rd and 96th percentiles. Moreover, at other Ivies, SAT scores are penalized in this range. source, page 7(8)
Your graph doesn’t show that the average applicant won’t benefit from 10 points. It shows that overall, SAT scores make a big difference (from ~0 to 0.2, with not even bothering to show anyone below the 88th percentile).
This graph is not a direct measure of the role of SATs, because they can’t force all else to be equal.
The paper I cited earlier for logistic regressions used models controlling for other things. Given the benefits to athletes, legacies, and minorities, benefits necessary presumably because they cannot compete as well on other factors (like SAT scores), it’s not necessarily surprising if aggregating these populations can lead to a raw graph like those you show. Note that the most meritocratic school which places the least emphasis on ‘holistic’ admissions (enabling them to discriminate in various ways) is MIT, and their curve looks dramatically different from, say, Princeton.
Yes, if large SAT changes matter, then there must be some small changes that matter. But it is possible that other points on the scale where they don’t, or are harmful. I’m sorry if I failed to indicate that I meant only this limited point.
If a school admits two populations, then the histogram of SATs of its students might look like a camel. But why should the graph of chance of admission? I suppose Harvard’s graph makes sense if students apply when their assessment of their ability to get in crosses some threshold. Then applying screens off SATs, at least in some normal regime.* But at Yale and especially Princeton, rising SATs in the middle regime predicts greater mistaken belief in ability to get in. Legacies (but not athletes or AA) might explain the phenomenon by only applying to one elite school, but I don’t think legacies alone are big enough to cause the graph.
Here are the lessons I take away from the graphs that I would apply if I had been doing the regressions and wanted to explain the graphs. First, schools have different admissions policies, even schools as similar as Harvard and Yale. Averaging them together, as in the paper, may make things appear smoother than they really are. Second, given the nonlinear effect of SATs, it is good that the regression used buckets rather than assuming a linear effect. Third, since the bizarre downward slope is over the course of less than 100 points, the 100 point buckets of the regression may be too coarse to see it. Fourth, they could have shown graphs, too. It would have been so much more useful to graph SAT scores of athletes and probability of admission as a function of SAT scores of athletes. The main value of regressions is using the words “model” and “p-value.” Fifth, the other use of the regression model is that it lets them consider interactions, which do seem to say that there is not much interaction between SATs and other factors, that the marginal value of an SAT point does not depend on race, legacy status, or athlete status (except for the tiny <1000 category). But the coarseness of the buckets and the aggregating of schools does not allow me to draw much of a conclusion from this.
* Actually, the whole point of this thread is that you can’t completely screen off. But I want to elaborate on “normal regime.” At the high end, screening breaks down because if, say, 1500 SAT is enough to cross the threshold, everyone with 1500+ SAT applies and there is no screening phenomenon. At the low end, I don’t see why screening would break down. Why would someone with SAT<1000 apply to an elite school without really good reason? Yet lots of people apply with such low scores and don’t get in.
But it is possible that other points on the scale where they don’t, or are harmful.
Sure, there could be non-monotonicity.
If a school admits two populations, then the histogram of SATs of its students might look like a camel. But why should the graph of chance of admission?...Fifth, the other use of the regression model is that it lets them consider interactions, which do seem to say that there is not much interaction between SATs and other factors, that the marginal value of an SAT point does not depend on race, legacy status, or athlete status (except for the tiny <1000 category).
Imagine that Harvard lets in equal numbers of ‘athletes’ and ‘nerds’, the 2 groups are different populations with different means, and they do something like pick the top 10% in each group by score. Clearly there’s going to be a bimodal histogram of SAT scores: you have a lump of athlete scores in the 1000s, say, and a lump of nerd scores in the 1500s. Sure. 2 equal populations, different means, of course you’re going to see a bimodal.
Now imagine Harvard gets more 10x more nerd applicants than athletic applicants; since each group gets the same number of spots, a random nerd will have 1⁄10 the admission chance as an athlete. Poor nerds. But Harvard kept the admission procedure the same as before. So what happens when you look at admission probability if all you know is the SAT score? Well, if you look at the 1500s applicants, you’ll notice that an awful lot of them aren’t admitted; and if you look at the 1000s applicants, you’ll notice that an awful lot of them getting in. Does Harvard hate SAT scores? No, of course not: we specified they were picking mostly the high scorers, and indeed, if we classify each applicant into nerd or athlete categories and then looked at admission rates by score, we’d see that yes, increasing SAT scores is always good: the nerd with a 1200 better apply to other colleges, and the athlete with 1400 might as well start learning how to yacht.
So even though in aggregate in our little model, high SAT scores look like a bad thing, for each group higher SAT scores are better.
But the coarseness of the buckets and the aggregating of schools does not allow me to draw much of a conclusion from this.
Yes, I don’t think we could make a conclusive argument against the claim that SAT scores may not help at all levels, not without digging deep into all the papers running logistic regressions; but I regard that claim as pretty darn unlikely in the first place.
At the low end, I don’t see why screening would break down. Why would someone with SAT<1000 apply to an elite school without really good reason? Yet lots of people apply with such low scores and don’t get in.
They could be self-delusive, doing it to appease a delusive parent (‘My Johnnie Yu must go to Harvard and become a doctor!’), gambling that a tiny chance of admission is worth the effort, doing it on a dare, expecting that legacies or other things are more helpful than they actually are...
Sure, maybe you can make a model that outputs Harvard or Princeton’s results, but how do you explain the difference between Harvard and Princeton? It is easier to get into Princeton as either a jock or a nerd, but at 98th SAT percentile, it is harder to get into Princeton than Harvard. These are the smart jocks or dumb nerds. Maybe Harvard has first dibs on the smart jocks so that the student body is more bimodal at other schools. But why would admissions be more bimodal? Does Princeton not bother to admit the smart jocks? That’s the hypothesis in the paper: an SAT penalty. Or maybe Princeton rejects the dumb nerds. It would be one thing if Princeton, as a small school, admitted fewer nerds and just had higher standards for nerds. But they don’t at the high end. What’s going on? Here’s a hypothesis: Harvard (like Caltech) could admit nerds based on other achievements that only correlate with SATs, while Princeton has high pure-SAT standards.
I don’t think an SAT penalty is very plausible, but nothing I’ve heard sounds plausible. Mostly people make vague models like yours that I don’t think explain all the observations. The hypothesis that Princeton in contrast to Harvard does not count SAT for jocks beyond a graduation threshold at least does not sound insane.
not without digging deep into all the papers running logistic regressions
I take graphs over regressions, any day. Regressions fit a model. They yield very little information. Sometimes it’s exactly the information you want, as in the calculation you originally brought in the regression for. But with so little information there is no possibility of exploration or model checking.
By the way, the paper you cite is published at a journal with a data access provision.
Sure, maybe you can make a model that outputs Harvard or Princeton’s results, but how do you explain the difference between Harvard and Princeton?
Dunno. I’ve already pointed out the quasi-Simpsons Paradox effect that could produce a lot of different shapes even while SAT score increases always help. Maybe Princeton favors musicians or something. If the only reason to look into the question is your incredulity and interest in the unlikely possibility that increase in SAT score actually hurts some applicants, I don’t care nearly enough to do more than speculate.
By the way, the paper you cite is published at a journal with a data access provision.
I have citations in my DNB FAQ on how such provisions are honored mostly in the breach… I wonder what the odds that you could get the data and that it would be complete and useful.
One logistic regression has a ‘model 7’ taking into account many factors where going from 1300 to 1600 goes from an odds ratio of 1.907 to 10.381; so if I’m interpreting this right, an extra 10pts on your total SAT is worth an odds ratio of ((10.381 − 1.907) / (1600-1300)) * 10 + 1 = 1.282.
Aren’t odds ratios multiplicative? It also seems to me that we should take the center of the SAT score bins to avoid an off-by-one bin width bias, so (10.381 / 1.907) ^ (10 / (1550 − 1350)) = 1.088. (Or compute additively with log-odds.)
As Vaniver mentioned, this estimate varies across the SAT score bins. If we look only at the top two SAT bins in Model 7: (10.381 / 4.062) ^ (10 / (1550 − 1450)) = 1.098.
Note that within the logistic model, they binned their SAT score data and regressed on them as dichotomous indicator variables, instead of using the raw scores and doing polynomial/nonparametric regression (I presume they did this to simplify their work because all other predictor variables are dichotomous).
Aren’t odds ratios multiplicative? It also seems to me that we should take the center of the SAT score bins to avoid an off-by-one bin width bias, so (10.381 / 1.907) ^ (10 / (1550 − 1350)) = 1.088. (Or compute additively with log-odds.)
Yeah; Vaniver already did it via log odds.
If we look only at the top two SAT bins in Model 7: (10.381 / 4.062) ^ (10 / (1550 − 1450)) = 1.098.
Which is higher than the top bin of 1.088 so I guess that makes using the top bin an underestimate (fine by me).
Note that within the logistic model, they binned their SAT score data and regressed on them as dichotomous indicator variables, instead of using the raw scores and doing polynomial/nonparametric regression
Alas! I just went with the first paper on Harvard I found in Google which did a logistic regression involving SAT scores (well, second: the first one confounded scores with being legacies and minorities and so wasn’t useful). There may be a more useful paper out there.
I’d understood the question to be “given identical scores”, not “given a 10 point average difference in favor of the blue weasel”.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
My intuition says no: while I’d expect fewer females to be in that range to begin with, I can’t see any reason to assume their scores would cluster towards the lower end of the range compared to males.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
So, first let’s ask this question, supposing that the test is perfectly accurate. We’ll run through the numbers separately for the two subtests (so we don’t have to deal with correlation), taking means and variances from here.
Of those who scored 600-700 on the hypothetical normally distributed math SAT (hence “HNDMSAT”), the male mean was 643.3 (with 20% of the male population in this band), and the female mean was 640.6 (with 14.8% of the female population in this band).
Of those who scored 600-700 on the HNDVSAT, the male mean was 641.0 (with 14.9% of the male population in this band), and the female mean was 640.1 (with 13.7% of the female population in this band).
When we introduce the test error into the process, the computation gets a lot messier. The quick and dirty way to do things is to say “well, let’s just shrink the mean band scores towards the population mean with the reliability coefficient.” This turns the male edge on the HNDMSAT of 2.7 into 5.4, and the male edge of .9 into 1.8. (I think it’s coincidental that this is roughly doubling the edge.)
My intuition says no: while I’d expect fewer females to be in that range to begin with, I can’t see any reason to assume their scores would cluster towards the lower end of the range compared to males.
That’s because you’re not thinking in bell curves. The range is all on one side of the mean, the male mean is closer to the bottom of the band, and the male variation is higher.
I’d understood the question to be “given identical scores”, not “given a 10 point average difference in favor of the blue weasel”.
My point was that ‘suppose that the true shrinkage leads to an adjusted difference of 10 points between the two groups; how much of a gift does 10 extra points represent?’ By using the nominal score rather than the true score, this has the effect of inflating the score. Once you’ve established how much the inflation might be, it’s natural to wonder about how much real-world consequence it might have leading into the Harvard musings.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
Depends on the mean and standard deviations of the 2 distributions, and then you could estimate how often the male sample average will be higher than the female sample average and vice versa.
The question should be ‘if we retest these 1200-1400 scorers, what will happen?’ The scores will probably drop as they regress to their mean due to an imperfect test. That’s the point.
The question should be ‘if we retest these 1200-1400 scorers, what will happen?’ The scores will probably drop as they regress to their mean due to an imperfect test. That’s the point.
Ahhh, that makes the statistics click in my brain, thanks :)
Do you know if there is much data out there on real-world gender differences vis-a-vis regression to the mean on IQ / SAT / etc. tests? i.e. is this based on statistics, or is it born out in empirical observations?
Do you know if there is much data out there on real-world gender differences vis-a-vis regression to the mean on IQ / SAT / etc. tests? i.e. is this based on statistics, or is it born out in empirical observations?
I haven’t seen any, offhand. Maybe the testing company provides info about retests, but then you’re going to have different issues: anyone who takes the second test may be doing so because they had a bad day (giving you regression to a mean from the other direction) and may’ve boned up on test prep since, and there’s the additional issue of test-retest effect—now that they know what the test is like, they will be less anxious and will know what to do, and test-takers in general may score better. (Since I’m looking at that right now, my DNB meta-analysis offers a case in point: in many of the experiments, the controls have slightly higher post-test IQ scores. Just the test-retest effect.)
The problem is that the concept of “fairness” you are using there is incompatible with VNM-utilitarianism. (If somebody disagrees with this, please describe what the term in one’s utility function corresponding to fairness would look like.)
First off, I have to say, just asking this sets off a serious, serious troll alert.
So, we have 5 players, and 50 utilions to divide between them. Players all value utilions equally, and utilions have linear value (i.e. 5 utilions is five times better than 1). Fairness says we give each player 10 utilions. Let’s make our unfair distribution 8, 8, 10, 12, 12.
How to express this mathematically? You could have a factor in your utility equation that is based on deviation from the mean (least-square immediately strikes me as elegant), or one which values the absolute difference between best and worst, or which averages against the lowest value.
For the first technique, the distribution 8,8,10,12,12, has 2^2 = 4 x 4 = −16 utility compared to ideal.
For the second technique, you lose −4 utility (12-8)
For the third technique, the utility for each player is 8, 8 (10+8/2 = 9), (12+8/2 = 10), (12+8/2 = 10), for a total penalty of −5 against ideal.
And that’s all assuming that fairness is a terminal value, not something that generates utility. That’s all assuming we’re playing with Platonic Utilions with linear value, rather than money (which seems to fall in value the more you get).
I mean this sincerely: if you’re not a troll, I am genuinely and deeply confused how you could possibly think this is the slightest bit incompatible with VNM utilitarianism.
How to express this mathematically? You could have a factor in your utility equation that is based on deviation from the mean (least-square immediately strikes me as elegant), or one which values the absolute difference between best and worst, or which averages against the lowest value.
Ok, let’s apply these functions to a different scenario:
There are two people A and B, A has utility 5 and B has utility 10. We have no way of increasing their utilities but we can make thinks worse for them. Your term suggests we should lower B’s utility as a deadweight loss to make things more fair. This seems wrong.
Technique C already handles this: 10+5/2 = 7.5. 5+5/2 = 5. So clearly going from 10->5 is bad, but having both of them be at 7.5 would be better, and having both of them at 10 would be even better still.
For technique B, yes, you will get results that say power imbalances are unfair and should be destroyed. The simplest example I could give is a world where Hitler has a million soldiers and everyone else has 100,000 combined. That power imbalance is dangerous, because Hitler can leverage that advantage to gain an even larger advanage, and so, over time, that inequality gets worse, and it can even reduce net utility (after the war, Hitler has 950,000 soldiers and everyone else has 50,000 − 100K people died, and the world is more unfair!)
One of the big stumbling blocks for me with social justice was understanding that power imbalances can be bad in and of themselves. It’s not just soldiers, either. This happens rather vividly with money and many other resources (“spoons” seem to work this way, if you’re familiar with “spoon theory”)
Technique C already handles this: 10+5/2 = 7.5. 5+5/2 = 5. So clearly going from 10->5 is bad, but having both of them be at 7.5 would be better, and having both of them at 10 would be even better still.
Of course technique C doesn’t address the weasel example.
For technique B, yes, you will get results that say power imbalances are unfair and should be destroyed.
When did we switch from talking about utility to talking about power? I agree power imbalances are dangerous; however, this fact doesn’t seem to bear on the weasel example.
Of course technique C doesn’t address the weasel example.
Have you considered using full thoughts… ooooh. What the hell is with all the trolls these days? :(
When did we switch from talking about utility to talking about power?
For the audience at home: That’s because out in “reality”, we can’t measure utilions, so we use things like power and money as proxies. In an ideal utopia with perfectly calibrated Utili-meters, this would not be as relevant.
Of course technique C doesn’t address the weasel example.
Have you considered using full thoughts… ooooh.
I’m not sure how to read this. I’m leaning towards, “I don’t have a counter argument so I’m going to resort to insults.”
To get back to the point, the problem with technique C is that it doesn’t address the case of adjusting test scores based on demographic priors, since the lowest utility (the people not accepted) is the same either way.
What the hell is with all the trolls these days?
You’re the one who just dropped the discussion to DH level 1 or 2.
You have a repeated pattern of not offering real responses: “Is this a parody?” “Is this?” being the biggest red flag I’ve encountered in this thread.
You are correct that I didn’t have a refutation, because “I don’t see how this ties in to the weasels” doesn’t give me enough information to try and resolve your confusion. In short, lately you seem to be putting near-zero effort in to your replies: you’re not attempting to explain your position, just offering pithy one-sentence objections that don’t seem to contribute anything.
Given you have 2K karma and a few +50 rated comments, I’m willing to assume you’ve just had a bad week and actually explain this, but I still see no point in actually continuing the conversation, since your replies are all “taxing” me the same way a troll does: you put in minimal effort, and force the other person to hold it all afloat.
You’re the one who just dropped the discussion to DH level 1 or 2.
It’s the very definition of skilled trolling, to force other people to spend paragraphs defending themselves while you resort to easily misinterpreted one-sentence replies that do nothing to advance actual discourse.
The idea that I must maintain quality discourse, or even that it’s more productive, is a trap that ends up with a bunch of well-fed trolls.
You have a repeated pattern of not offering real responses: “Is this a parody?” “Is this?” being the biggest red flag I’ve encountered in this thread.
It’s as real a response as the question it’s a response to and I give a substantive response to Nisan’s more substantive sentence.
You are correct that I didn’t have a refutation, because “I don’t see how this ties in to the weasels” doesn’t give me enough information to try and resolve your confusion.
You could give some indication of what addition information would help. Here are some possibilities:
1) You didn’t get what the weasels were referring to. Arguably I should have linked to this comment in the great-grandparent, but since the comment in question is yours, I assumed you’d get the reference.
2) You think the technique does in fact address the weasel example, in that case you could have said so as well as possibly how you think it applies.
The problem is that the concept of “fairness” you are using there is incompatible with VHM-utilitarianism. (If somebody disagrees with this, please describe what the term in one’s utility function corresponding to fairness would look like.)
People care about fairness, and get negative utility from feeling like they are being treated unfairly.
I’d have to think about it but if I didn’t think it would involve being severely taken advantage of to the point where it impacts what I want to do I’d probably take it.
How do you reconcile this view with the way questions of tone have become entangled with gender issues in this very thread?
It was also an extremely straightforward application of Bayes’s theorem.
The problem is that the concept of “fairness” you are using there is incompatible with VNM-utilitarianism. (If somebody disagrees with this, please describe what the term in one’s utility function corresponding to fairness would look like.)
Where has anyone claimed they don’t? At least beyond the general rejection of qualia?
I was surprised at how strongly some people (probably mostly women) are uncomfortable with the tone here, so I have a lot to update.
I don’t like emoticons much—I don’t hate people who use them, but I use emoticons very rarely, and I’m not comfortable with them. I still find it hard to believe that if people do something a lot, there’s a reasonable chance (if they aren’t being paid) that they like it a lot, even though I can’t imagine liking whatever it is.
I don’t know what proportion of people are apt to interpret lack of overt friendliness as dislike, nor what the gender split is.
In the spirit of exploration, I took a look at Ravelry, a major knitting and crocheting blog. I haven’t found major discussions there yet. I’m interested in examples of blogs with different emotional tones/courtesy rules/gender balances.
Now that I think about it, blogs that are mostly women may be more likely to have overt statements of strong friendship and support. I believe that sort of effusiveness is partly cultural—wasn’t more common for both men and women at least from the colonial era (US) to the Victorian era?
That depends on how much you demand of your priors, and low quality priors is something that makes me nervous about Bayes.
For this particular case, there’s no examination of how much variance on the high side people get on tests. In particular, it seems very unlikely that people will get scores much above their baseline on tests about any sophisticated subject, though various factors (illness and other distractions) could drive their scores below their baseline.
What’s VHF Utilitarianism? Is there any utilitarian cost to some capable people giving up because they believe rightly that their accomplishments will be discounted?
My language may have been hyperbolic and/or vague. I was thinking of “creepiness = low status” which sounds to me like “it’s so unfair that women don’t want to spend time with men they’re uncomfortable around”. In this case, I was thinking “lack of point of view”, but “preferences are irrelevant” might be more accurate.
I think I’ve interpreted “creepiness = low status” as, “it’s unfair that low-status men get labeled as creepy for behavior that high-status men would get away with.”
Of course, one could respond that making people at least feel comfortable around you is an easy way to improve your status. :)
That’s a large part of what PUA attempts to do.
Well is it unfair?
I wouldn’t say so. What do you think?
I’m trying to figure out what you mean by “fairness”. I don’t see why this isn’t unfair but adjusting the test scores based on priors is.
A typo, I meant VNM Utilitarianism.
Well, this depends on the exact circumstances, but this may happen to the people who got unlucky on the test anyway, and using a better predictor decreases the number of people who get mischaracterized.
Is this comment a satire?
In any case, the remark about the von Neumann-Morgenstern theorem is just wrong.
Is yours?
So, what does the term in a utility function corresponding to fairness look like?
Like, if someone wanted to mock this website, that’s exactly what they’d write.
You’re probably thinking that a utility function can’t prefer “fair” lotteries. But it can prefer fair outcomes, which is what’s relevant here.
I’m not a utilitarian and the arguments like the one I made about utility are part of the reason, if that’s what you’re asking.
What’s a “fair” outcome? Should we abandon life extension research because it would be “unfair” to those who died before it achieves results?
The von Neumann-Morgenstern theorem has nothing to do with utilitarianism, and it’s not about what you “should” do. Those words don’t appear in the statement of the theorem. The theorem does state that a VNM-rational agent has a preference ordering over lotteries of outcomes. In fact it can have any preferences over outcomes at all and still satisfy the hypotheses of the theorem. In particular, it can prefer fair outcomes to unfair outcomes for any definition of “fair”.
If you want to argue that one shouldn’t pursue fairness, you don’t want to use the VNM theorem.
Agreed, unfortunately a lot of people around here seem to interpret it this way.
I would argue that fairness is a property of a process rather than an outcome, e.g., a kangaroo court doesn’t become “fair” just because it happens to reach the same verdict a fair trial would have.
A simple “no” would have sufficed. Downvoted.
Downvoted Eugine for the same reason, and upvoted MugaSofer back to positive. I value honest feedback, and see no reason to downvote ’em for providing it.
When the difference IS the topic, that tends to amplify the relevance of the differences.
Then why is it that this difference, out of the many dimensions of differences that form up humankind, and the multitude of interest-group formation patterns that could have been generated, is the one that gets so much attention? It would be bizarre if an unbiased deliberation process systematically decides that one unremarkable axis (gender) is the one difference that should be discussed at great length and with very vigorous champions, while ignoring all of the other axes of diversity of human minds.
Now it is possible for one unremarkable axis to become overwhelmingly dominant in coalition formation, but that would involve some fairly unpleasant implications about the truth-seekiness and utilitarian consequences of this sort of thinking.
I dunno about this. It seems that the difference between those concerned with an intelligence explosion and those concerned with other scenarios has gotten way more attention here than gender.
I wasn’t surprised on the occasions when questions of differences in tone between the two camps flared up when discussing that topic. I would have been shocked almost beyond belief if, when discussing that topic, questions of tone differences between men and women had arisen.
The idea is, almost every topic, men and women are very similar, because the differences aren’t relevant. When you begin looking at the differences, then you get amplifying effects. In particular, each participant being what they are and completely unable to change that means:
that the topic isn’t going to be to convert people from one camp to the other or otherwise influence their choice as in the example above, but it’s going to have to be about something about that. This added layer of meta makes things much less stable. Imagine having a discussion about how we ought to talk about the differences between intelligence explosion and other scenarios, while universally acknowledged that no one was going to change their position on the actual subject. It’d be all over the place.
that empathy is harder to achieve. And in particular looking at the difference from one end gives exactly opposite perspectives on the issue. When you ‘normalize’ the differences, it’s maximally different.
This.
By definition, those on either side have different experiences with regard to the difference, and thus are vastly more likely to hold different opinions.
We have a population of 200 weasels, 100 blue and 100 red. 90% of blue weasels are programmers, and 10% of red weasels are programmers.
If we design a perfect test-of-being-a-programmer, we will have a pool of 100 programmers (90 blue, 10 red).
If our pool of programmers does NOT follow that distribution, it suggests that we’re probably doing something wrong in our screening, like de-facto excluding all of the red weasels due to bigotry. This HURTS us, because we now have fewer programmers in our pool, and/or we have non-programmers in our pool.
If you go out and test all the weasels, and 50% of them pass, and it’s 90% blue and 10% red, I don’t see any rational reason to assume that the blue weasels are going to be superior to the red weasels, or that the red weasels are more likely to be because of test variance.
Now, if you get a pool that’s 80 red weasels and 20 blue weasels, you’re right to be suspicious that maybe this is not a very accurate test. But given the real-world job market, we should expect such outliers to occur. If everyone else is getting 90 blue and 10 red weasels from this test, you should assume you’re such an outlier, since you have plenty of evidence towards the test being accurate.
And if we’re getting that 90-10 ratio that we expect, there’s no reason to assume that the red weasels are any less competent. If 10% of all weasels are super-programmers, we should expect 10% of our blue programming weasels and 10% of our red programming weasels to be super-programmers (so, on average, 9 blue super-programmers and 1 red super-programmer).
Seriously, where is this anti-red-weasel bias coming from? Nothing in the math seems to suggest it, unless you’re using a seriously crappy test >.>
I don’t follow. Just because your test happened to result in a split that superficially resembles the underlying frequencies, why do you then assume that your imperfect test turned in exactly the right result in all 200 cases? The same logic of an imperfect test leading to shrinking estimates to the mean seems to still apply.
Did you follow my and Vaniver’s thread on this topic? The effect holds unless the test is perfectly accurate.
WARNING: Rambly, half-thought-out answer here. It’s genuinely not something I’ve fully worked through myself, and I am totally open to feedback from you that I’m wrong.
The tl;dr version is that the effect is going to be small unless you have a very inaccurate test, and it’s suspicious to focus on a small effect when there’s probably other, larger effects we could be looking at.
Hmmm. Is that actually true? If we know the test has a 10% false positive rate for both red and blue weasels, doesn’t that suggest we should have 9 non-programmer blue weasels and 1 non-programmer red weasel?
Like, if I have a bag with 2 red marbles, and 2 white marbles, the odds of drawing a red marble are 50⁄50. But if my first draw is a red marble, I can’t claim that it’s still 50⁄50, and I can’t update to say that drawing one red marble makes me MORE likely to draw a second one. The new odds are 33⁄66, no matter what math you run. The only correct update is the one that leaves you concluding 33⁄66.
It seems like there is such a test that the test results… already factor in our prior distribution? I’m not sure if I’m being at all clear here :\
Absolutely, this isn’t always the case—if you just know that you have a 10% false positive, and it’s not calibrated for red false positives vs blue false positives, you DO have evidence that red false positives are probably more common. BUT, you’d still be a fool to exclude ALL red candidates on that basis, since you also know that you should legitimately have red candidates in your pool, and by accepting red candidates you increase the overall number of programmers you have access to.
It all depends on the accuracy of your test. If your test is sufficiently accurate that red weasels are only 1% more likely to be false positives, then this probably shouldn’t affect your actual decision making that much.
Then, if you decide to FOCUS on how red weasels have a +1% false positive rate, it implies that you consider this fact particularly important and relevant. It implies that this is a very central decision making factor, and you’re liable to do things like “not hire red weasels unless they got an A+ on their test”, even though the math doesn’t support this. If you’re just doing cold, hard math, we’d expect this factor to be down near the bottom of t he list, not plastered up on a neon marquee saying “we did the cold hard math, and all you red weasels can f**k off!”
If we assume two populations, red-weasel-haters and rationalists, we could even run Bayes’ Theorem and conclude that anyone who goes around feeling the need to point out that 1% difference is SIGNIFICANTLY more likely to be a red-weasel-hater, not a rationalist.
Then we can go in to the utilitarian arguments about how feeding the red-weasel-haters political ammunition does actually increase their strength, and thus harms the red weasels, keeps them away from programming, and thus harms programming culture by reducing our pool of available programmers.
Yes, the effect is small in absolute magnitude—if you look at the example SAT shrinking that Vaniver and I were working out, the difference between the male/female shrunk scores is like 5 points although that’s probably an underestimate since it’s ignoring the difference in variance and only looking at means—but these 5 points could have a big difference depending on how the score is used or what other differences you look at.
For example, not shrinking could lead to a number of girls getting into Harvard that would not have since Harvard has so many applicants and they all have very high SAT scores; there could well be a noticeable effect on the margin. When you’re looking at like 30 applications for each seat, 10 SAT points could be the difference between success and failure for a few applicants.
One could probably estimate how many by looking for logistic regressions of ‘SAT score vs admission chance’, seeing how much 10 points is worth, and multiplying against the number of applicants. 35k applicants in 2011 for 2.16k spots. One logistic regression has a ‘model 7’ taking into account many factors where going from 1300 to 1600 goes from an odds ratio of 1.907 to 10.381; so if I’m interpreting this right, an extra 10pts on your total SAT is worth an odds ratio of
((10.381 - 1.907) / (1600-1300)) * 10 + 1 = 1.282
. So the members of a group given a 10pt gain are each 1.28x more likely to be admitted than they were before; before, they had a2.16/35 = 6.17%
chance, and now they have a(1.28 * 2.16) / 35 = 2.76 / 35 = 7.89%
chance. To finish the analysis: if 17.5k boys apply and 17.5k girls apply and 6.17% of the boys are admitted while 7.89% of the girls are admitted, then there will be an extra(17500 * 0.0789) - (17500 * 0.0617) = 301
girls.(A boost of more than 1% leading to 301 additional girls on the margin sounds too high to me. Probably I did something wrong in manipulating the odds ratios.)
One could make the same point about means of bell curves differing a little bit: it may lead to next to no real difference towards the middle, but out on the tails it can lead to absurd differentials. I think I once calculated that a difference of one standard deviation in IQ between groups A and B leads to a difference out at 3 deviations for A vs 4 deviations for B, what is usually the cutoff for ‘genius’, of ~50x. One sd is a lot and certainly not comparable to 10 points on the SAT, but you see what I mean.
How do you know your first draw is a red marble?
Depends on what you’re going to do with them, I suppose… If you can only hire 1 weasel, you’ll be better off going with one of the blue weasels, no? While if you’re just giving probabilities (I’m straining to think of how to continue the analogy: maybe the weasels are floating Hanson-style student loans on prediction markets and you want to see how to buy or sell their interest rates), sure, you just mark down your estimated probability by 1% or whatever.
Alas! When red-weasel-hating is supported by statistics, only people interested in statistics will be hating on red-weasels. :)
We can check this interpretation by taking it to the 30th power, and seeing if we recover something sensible; unfortunately, that gives us an odds ratio of over 1700! If we had their beta coefficients, we could see how much 10 points corresponds to, but it doesn’t look like they report it.
Logistic regression is a technique that compresses the real line down to the range between 0 and 1; you can think of that model as the schools giving everyone a score, admitting people above a threshold with probably approximately 1, admitting people below a threshold with probability approximately 0, and then admitting people in between with a probability that increases based on their score (with a score of ‘0’ corresponding to a 50% chance of getting in).
We might be able to recover their beta by taking the log of the odds they report (see here). This gives us a reasonable but not too pretty result, with an estimate that 100 points of SAT is worth a score adjustment of .8. (The actual amount varies for each SAT band, which makes sense if their score for each student nonlinearly weights SAT scores. The jump from the 1400s to the 1500s is slightly bigger than the jump from the 1300s to the 1400s, suggesting that at the upper bands differences in SAT scores might matter more.)
A score increase of .08 cashes out as an odds ratio of 1.083, which when we take that to the power 30 we get 11.023, which is pretty close to what we’d expect.
Two standard deviations is generally enough to get you into ‘gifted and talented’ programs, as they call them these days. Four standard deviations gets you to finishing in the top 200 of the Putnam competition, according to Griffe’s calculations, which are also great at illustrating male/female ratios at various levels given Project Talent data on math ability.
I’ll also note again that the SAT is probably not the best test to use for this; it gives a male/female math ability variance ratio estimate of 1.1, whereas Project Talent estimated it as 1.2. Which estimate you choose makes a big difference in your estimation of the strength of this effect. (Note that, typically, more females take the SAT than males, because the cutoff for interest in the SAT is below the population mean, where male variability hurts as well as other factors, and this systemic bias in subject selection will show up in the results.)
Thanks for the odds corrections. I knew I got something wrong...
G&T stuff, yeah, but in the materials I’ve read 2sd is not enough to move you from ‘bright’ or ‘gifted and talented’ to ‘genius’ categories, which seems to usually be defined as >2.5-3sd, and using 3sd made the calculation easier.
Eh. MENSA requires upper 2% (which is ~2 standard deviations). Whether you label that ‘genius’ or ‘bright’ or something else doesn’t seem terribly important. 3.5 standard deviations is the 2.3 out of 10,000 level, which is about a hundred times more restrictive.
I’d call MENSA merely bright… You need something in between ‘normal’ and ‘genius’ and bright seems fine. Genius carries all the wrong connotations for something as common as MENSA-level; 2.3 out of 10k seems more reasonable.
Only if Harvard cares a lot about SAT scores. According to this graph, the value of SATs is pretty flat between the 93rd and 96th percentiles. Moreover, at other Ivies, SAT scores are penalized in this range. source, page 7(8)
This graph is not a direct measure of the role of SATs, because they can’t force all else to be equal. The paper argues that some schools really do penalize SAT scores in some regimes. I do not buy the argument, but the graph convinces me that I don’t know how it works. Many people respond to the graph that it is the aggregation of two populations admitted under different scoring rules, both of which value SATs, but I do not think that explains the graph.
Your graph doesn’t show that the average applicant won’t benefit from 10 points. It shows that overall, SAT scores make a big difference (from ~0 to 0.2, with not even bothering to show anyone below the 88th percentile).
The paper I cited earlier for logistic regressions used models controlling for other things. Given the benefits to athletes, legacies, and minorities, benefits necessary presumably because they cannot compete as well on other factors (like SAT scores), it’s not necessarily surprising if aggregating these populations can lead to a raw graph like those you show. Note that the most meritocratic school which places the least emphasis on ‘holistic’ admissions (enabling them to discriminate in various ways) is MIT, and their curve looks dramatically different from, say, Princeton.
Yes, if large SAT changes matter, then there must be some small changes that matter. But it is possible that other points on the scale where they don’t, or are harmful. I’m sorry if I failed to indicate that I meant only this limited point.
If a school admits two populations, then the histogram of SATs of its students might look like a camel. But why should the graph of chance of admission? I suppose Harvard’s graph makes sense if students apply when their assessment of their ability to get in crosses some threshold. Then applying screens off SATs, at least in some normal regime.* But at Yale and especially Princeton, rising SATs in the middle regime predicts greater mistaken belief in ability to get in. Legacies (but not athletes or AA) might explain the phenomenon by only applying to one elite school, but I don’t think legacies alone are big enough to cause the graph.
Here are the lessons I take away from the graphs that I would apply if I had been doing the regressions and wanted to explain the graphs. First, schools have different admissions policies, even schools as similar as Harvard and Yale. Averaging them together, as in the paper, may make things appear smoother than they really are. Second, given the nonlinear effect of SATs, it is good that the regression used buckets rather than assuming a linear effect. Third, since the bizarre downward slope is over the course of less than 100 points, the 100 point buckets of the regression may be too coarse to see it. Fourth, they could have shown graphs, too. It would have been so much more useful to graph SAT scores of athletes and probability of admission as a function of SAT scores of athletes. The main value of regressions is using the words “model” and “p-value.” Fifth, the other use of the regression model is that it lets them consider interactions, which do seem to say that there is not much interaction between SATs and other factors, that the marginal value of an SAT point does not depend on race, legacy status, or athlete status (except for the tiny <1000 category). But the coarseness of the buckets and the aggregating of schools does not allow me to draw much of a conclusion from this.
* Actually, the whole point of this thread is that you can’t completely screen off. But I want to elaborate on “normal regime.” At the high end, screening breaks down because if, say, 1500 SAT is enough to cross the threshold, everyone with 1500+ SAT applies and there is no screening phenomenon. At the low end, I don’t see why screening would break down. Why would someone with SAT<1000 apply to an elite school without really good reason? Yet lots of people apply with such low scores and don’t get in.
Sure, there could be non-monotonicity.
Imagine that Harvard lets in equal numbers of ‘athletes’ and ‘nerds’, the 2 groups are different populations with different means, and they do something like pick the top 10% in each group by score. Clearly there’s going to be a bimodal histogram of SAT scores: you have a lump of athlete scores in the 1000s, say, and a lump of nerd scores in the 1500s. Sure. 2 equal populations, different means, of course you’re going to see a bimodal.
Now imagine Harvard gets more 10x more nerd applicants than athletic applicants; since each group gets the same number of spots, a random nerd will have 1⁄10 the admission chance as an athlete. Poor nerds. But Harvard kept the admission procedure the same as before. So what happens when you look at admission probability if all you know is the SAT score? Well, if you look at the 1500s applicants, you’ll notice that an awful lot of them aren’t admitted; and if you look at the 1000s applicants, you’ll notice that an awful lot of them getting in. Does Harvard hate SAT scores? No, of course not: we specified they were picking mostly the high scorers, and indeed, if we classify each applicant into nerd or athlete categories and then looked at admission rates by score, we’d see that yes, increasing SAT scores is always good: the nerd with a 1200 better apply to other colleges, and the athlete with 1400 might as well start learning how to yacht.
So even though in aggregate in our little model, high SAT scores look like a bad thing, for each group higher SAT scores are better.
Reminds me of Simpson’s paradox.
Yes, I don’t think we could make a conclusive argument against the claim that SAT scores may not help at all levels, not without digging deep into all the papers running logistic regressions; but I regard that claim as pretty darn unlikely in the first place.
They could be self-delusive, doing it to appease a delusive parent (‘My Johnnie Yu must go to Harvard and become a doctor!’), gambling that a tiny chance of admission is worth the effort, doing it on a dare, expecting that legacies or other things are more helpful than they actually are...
Sure, maybe you can make a model that outputs Harvard or Princeton’s results, but how do you explain the difference between Harvard and Princeton? It is easier to get into Princeton as either a jock or a nerd, but at 98th SAT percentile, it is harder to get into Princeton than Harvard. These are the smart jocks or dumb nerds. Maybe Harvard has first dibs on the smart jocks so that the student body is more bimodal at other schools. But why would admissions be more bimodal? Does Princeton not bother to admit the smart jocks? That’s the hypothesis in the paper: an SAT penalty. Or maybe Princeton rejects the dumb nerds. It would be one thing if Princeton, as a small school, admitted fewer nerds and just had higher standards for nerds. But they don’t at the high end. What’s going on? Here’s a hypothesis: Harvard (like Caltech) could admit nerds based on other achievements that only correlate with SATs, while Princeton has high pure-SAT standards.
I don’t think an SAT penalty is very plausible, but nothing I’ve heard sounds plausible. Mostly people make vague models like yours that I don’t think explain all the observations. The hypothesis that Princeton in contrast to Harvard does not count SAT for jocks beyond a graduation threshold at least does not sound insane.
I take graphs over regressions, any day.
Regressions fit a model. They yield very little information. Sometimes it’s exactly the information you want, as in the calculation you originally brought in the regression for. But with so little information there is no possibility of exploration or model checking.
By the way, the paper you cite is published at a journal with a data access provision.
Dunno. I’ve already pointed out the quasi-Simpsons Paradox effect that could produce a lot of different shapes even while SAT score increases always help. Maybe Princeton favors musicians or something. If the only reason to look into the question is your incredulity and interest in the unlikely possibility that increase in SAT score actually hurts some applicants, I don’t care nearly enough to do more than speculate.
I have citations in my DNB FAQ on how such provisions are honored mostly in the breach… I wonder what the odds that you could get the data and that it would be complete and useful.
Aren’t odds ratios multiplicative? It also seems to me that we should take the center of the SAT score bins to avoid an off-by-one bin width bias, so (10.381 / 1.907) ^ (10 / (1550 − 1350)) = 1.088. (Or compute additively with log-odds.)
As Vaniver mentioned, this estimate varies across the SAT score bins. If we look only at the top two SAT bins in Model 7: (10.381 / 4.062) ^ (10 / (1550 − 1450)) = 1.098.
Note that within the logistic model, they binned their SAT score data and regressed on them as dichotomous indicator variables, instead of using the raw scores and doing polynomial/nonparametric regression (I presume they did this to simplify their work because all other predictor variables are dichotomous).
Yeah; Vaniver already did it via log odds.
Which is higher than the top bin of 1.088 so I guess that makes using the top bin an underestimate (fine by me).
Alas! I just went with the first paper on Harvard I found in Google which did a logistic regression involving SAT scores (well, second: the first one confounded scores with being legacies and minorities and so wasn’t useful). There may be a more useful paper out there.
I’d understood the question to be “given identical scores”, not “given a 10 point average difference in favor of the blue weasel”.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
My intuition says no: while I’d expect fewer females to be in that range to begin with, I can’t see any reason to assume their scores would cluster towards the lower end of the range compared to males.
So, first let’s ask this question, supposing that the test is perfectly accurate. We’ll run through the numbers separately for the two subtests (so we don’t have to deal with correlation), taking means and variances from here.
Of those who scored 600-700 on the hypothetical normally distributed math SAT (hence “HNDMSAT”), the male mean was 643.3 (with 20% of the male population in this band), and the female mean was 640.6 (with 14.8% of the female population in this band).
Of those who scored 600-700 on the HNDVSAT, the male mean was 641.0 (with 14.9% of the male population in this band), and the female mean was 640.1 (with 13.7% of the female population in this band).
When we introduce the test error into the process, the computation gets a lot messier. The quick and dirty way to do things is to say “well, let’s just shrink the mean band scores towards the population mean with the reliability coefficient.” This turns the male edge on the HNDMSAT of 2.7 into 5.4, and the male edge of .9 into 1.8. (I think it’s coincidental that this is roughly doubling the edge.)
That’s because you’re not thinking in bell curves. The range is all on one side of the mean, the male mean is closer to the bottom of the band, and the male variation is higher.
My point was that ‘suppose that the true shrinkage leads to an adjusted difference of 10 points between the two groups; how much of a gift does 10 extra points represent?’ By using the nominal score rather than the true score, this has the effect of inflating the score. Once you’ve established how much the inflation might be, it’s natural to wonder about how much real-world consequence it might have leading into the Harvard musings.
Depends on the mean and standard deviations of the 2 distributions, and then you could estimate how often the male sample average will be higher than the female sample average and vice versa.
The question should be ‘if we retest these 1200-1400 scorers, what will happen?’ The scores will probably drop as they regress to their mean due to an imperfect test. That’s the point.
Ahhh, that makes the statistics click in my brain, thanks :)
Do you know if there is much data out there on real-world gender differences vis-a-vis regression to the mean on IQ / SAT / etc. tests? i.e. is this based on statistics, or is it born out in empirical observations?
I haven’t seen any, offhand. Maybe the testing company provides info about retests, but then you’re going to have different issues: anyone who takes the second test may be doing so because they had a bad day (giving you regression to a mean from the other direction) and may’ve boned up on test prep since, and there’s the additional issue of test-retest effect—now that they know what the test is like, they will be less anxious and will know what to do, and test-takers in general may score better. (Since I’m looking at that right now, my DNB meta-analysis offers a case in point: in many of the experiments, the controls have slightly higher post-test IQ scores. Just the test-retest effect.)
First off, I have to say, just asking this sets off a serious, serious troll alert.
So, we have 5 players, and 50 utilions to divide between them. Players all value utilions equally, and utilions have linear value (i.e. 5 utilions is five times better than 1). Fairness says we give each player 10 utilions. Let’s make our unfair distribution 8, 8, 10, 12, 12.
How to express this mathematically? You could have a factor in your utility equation that is based on deviation from the mean (least-square immediately strikes me as elegant), or one which values the absolute difference between best and worst, or which averages against the lowest value.
For the first technique, the distribution 8,8,10,12,12, has 2^2 = 4 x 4 = −16 utility compared to ideal.
For the second technique, you lose −4 utility (12-8)
For the third technique, the utility for each player is 8, 8 (10+8/2 = 9), (12+8/2 = 10), (12+8/2 = 10), for a total penalty of −5 against ideal.
And that’s all assuming that fairness is a terminal value, not something that generates utility. That’s all assuming we’re playing with Platonic Utilions with linear value, rather than money (which seems to fall in value the more you get).
I mean this sincerely: if you’re not a troll, I am genuinely and deeply confused how you could possibly think this is the slightest bit incompatible with VNM utilitarianism.
Ok, let’s apply these functions to a different scenario:
There are two people A and B, A has utility 5 and B has utility 10. We have no way of increasing their utilities but we can make thinks worse for them. Your term suggests we should lower B’s utility as a deadweight loss to make things more fair. This seems wrong.
Technique C already handles this: 10+5/2 = 7.5. 5+5/2 = 5. So clearly going from 10->5 is bad, but having both of them be at 7.5 would be better, and having both of them at 10 would be even better still.
For technique B, yes, you will get results that say power imbalances are unfair and should be destroyed. The simplest example I could give is a world where Hitler has a million soldiers and everyone else has 100,000 combined. That power imbalance is dangerous, because Hitler can leverage that advantage to gain an even larger advanage, and so, over time, that inequality gets worse, and it can even reduce net utility (after the war, Hitler has 950,000 soldiers and everyone else has 50,000 − 100K people died, and the world is more unfair!)
One of the big stumbling blocks for me with social justice was understanding that power imbalances can be bad in and of themselves. It’s not just soldiers, either. This happens rather vividly with money and many other resources (“spoons” seem to work this way, if you’re familiar with “spoon theory”)
Of course technique C doesn’t address the weasel example.
When did we switch from talking about utility to talking about power? I agree power imbalances are dangerous; however, this fact doesn’t seem to bear on the weasel example.
Have you considered using full thoughts… ooooh. What the hell is with all the trolls these days? :(
For the audience at home: That’s because out in “reality”, we can’t measure utilions, so we use things like power and money as proxies. In an ideal utopia with perfectly calibrated Utili-meters, this would not be as relevant.
I’m not sure how to read this. I’m leaning towards, “I don’t have a counter argument so I’m going to resort to insults.”
To get back to the point, the problem with technique C is that it doesn’t address the case of adjusting test scores based on demographic priors, since the lowest utility (the people not accepted) is the same either way.
You’re the one who just dropped the discussion to DH level 1 or 2.
You have a repeated pattern of not offering real responses: “Is this a parody?” “Is this?” being the biggest red flag I’ve encountered in this thread.
You are correct that I didn’t have a refutation, because “I don’t see how this ties in to the weasels” doesn’t give me enough information to try and resolve your confusion. In short, lately you seem to be putting near-zero effort in to your replies: you’re not attempting to explain your position, just offering pithy one-sentence objections that don’t seem to contribute anything.
Given you have 2K karma and a few +50 rated comments, I’m willing to assume you’ve just had a bad week and actually explain this, but I still see no point in actually continuing the conversation, since your replies are all “taxing” me the same way a troll does: you put in minimal effort, and force the other person to hold it all afloat.
It’s the very definition of skilled trolling, to force other people to spend paragraphs defending themselves while you resort to easily misinterpreted one-sentence replies that do nothing to advance actual discourse.
The idea that I must maintain quality discourse, or even that it’s more productive, is a trap that ends up with a bunch of well-fed trolls.
It’s as real a response as the question it’s a response to and I give a substantive response to Nisan’s more substantive sentence.
You could give some indication of what addition information would help. Here are some possibilities:
1) You didn’t get what the weasels were referring to. Arguably I should have linked to this comment in the great-grandparent, but since the comment in question is yours, I assumed you’d get the reference.
2) You think the technique does in fact address the weasel example, in that case you could have said so as well as possibly how you think it applies.
3) Something I haven’t thought of.
People care about fairness, and get negative utility from feeling like they are being treated unfairly.
So let’s apply Eliezer’s “murder pill” thought experiment to this:
If I offered people a pill to make not care about being treated unfairly would they take it?
If the answer is no, that means they care about fairness beyond the bad feeling it generates.
I’d have to think about it but if I didn’t think it would involve being severely taken advantage of to the point where it impacts what I want to do I’d probably take it.