# The irrelevance of test scores is greatly exaggerated

Here’s some claims about how grades (GPA) and test scores (ACT) predict success in college.

In a study released this month, the University of Chicago Consortium on School Research found—after surveying more than 55,000 public high school graduates—that grade point averages were five times as strong at predicting college graduation as were ACT scores. (Fortune)

High school GPAs show a very strong relationship with college graduation despite sizable school effects, and the relationship does not differ across high schools. In contrast, the relationship between ACT scores and college graduation is weak-to nothing once school effects are controlled. (University of Chicago Consortium on School Research)

“It was surprising not only to see that there was no relationship between ACT scores and college graduation at some high schools, but also to see that at many high schools the relationship was negative among students with the highest test scores” (Science Daily)

“The bottom line is that high school grades are powerful tools for gauging students’ readiness for college, regardless of which high school a student attends, while ACT scores are not.” (Inside Higher Ed)

(See also the Washington Post, Science Blog, Fatherly, The Chicago Sun Times, etc.)

All these articles are mild adaptions of a press release for Allensworth and Clark’s 2020 paper “High School GPAs and ACT Scores as Predictors of College Completion”.

I understood these articles as making the following claim: Standardized test scores are nearly useless (at least once you know GPAs), and colleges can eliminate them from admissions with no downside.

Surprised by this claim, I read the paper. I apologize if this is indelicate, but… the paper doesn’t give the slightest shred of evidence that the above claim is true. It’s not that the paper is wrong, exactly, it simply doesn’t address how useful ACT scores are for college admissions.

So why do we have all these articles that seem to make this claim, you ask? That’s an interesting question! But first, let’s see what’s actually in the paper.

# Test scores are not irrelevant

The authors got data for 55,084 students who graduated from Chicago public schools between 2006 and 2009. Most of their analysis only looks at a subset of 17,753 who enrolled in a 4-year college immediately after high school. Here’s the percentage of those students who graduated college within 6 years for each possible GPA and ACT score:

We can also visualize this by plotting each row of the above matrix as a line. This shows how graduation rates change for a fixed GPA score as the ACT score is changed.

It doesn’t appear that ACT scores are useless… But let’s test this more rigorously.

# Test scores are highly predictive

The full dataset isn’t available, but since we have the number of students in each ACT /​ GPA bin above, we can create a “pseudo” dataset, with a small loss of precision in the GPA and ACT score for each student. I did this, and then fit models to predict if a student would graduate using GPA alone, ACT alone, or with both together. (The model is cubic spline regression on top of a quantile transformation.)

To measure how good these fits are, I used cross-validation, repeatedly holding out 20% of the data, fitting a model like above to the other 80%, and then predicting if each student will graduate. You can measure how accurate the predictions are, either as a simple error rate (1-accuracy) or as a Brier score. I also compare to a model using no features, which just predicts the base rate for everyone.

It’s true that GPA does a bit better than the ACT. But if you care about that difference, you should care even more about the difference between (GPA only) and (GPA plus ACT). It’s not coherent to simultaneously claim that the GPA is better than the ACT and also that the ACT doesn’t add value to the GPA.

I repeated this same calculation with other predictors: logistic regression, decision trees, and random forests. The numbers barely changed at all.

Still, these are all just calculations based on the first table in the paper.

# What the paper actually did

For each student, they recorded three variables:

• Gender

• Ethnicity (Black, Latino, Asian)

• Poverty (average poverty rate in the student’s census block)

For the students who enrolled in a 4-year college, they recorded four variables about that college:

• The number of students at the college

• The percentage of full-time students

• The student-faculty ratio

• The college’s average graduation rate

They standardized all the variables to have unit mean and unit variance (except for gender and ethnicity since these are binary). For example, GPA=0 for someone with the average grades, and GPA=-2 for someone 2 standard deviations below average.

They also included squared versions of GPA and ACT, and . These are never negative and larger for any student who is unusual either on the high or low end. They do this because the relationship is “slightly quadratic”, which is reasonable, but it’s not explained why the other variables don’t get a squared version.

With this data in hand, they fit a bunch of models.

First, they predicted graduation rates from grades alone. Higher grades were better. There’s nothing really surprising here, so let’s skip the details.

Second, they predicted graduation rates from ACT scores alone. Higher ACT scores were better. As you’d expect this relationship is strong. Again, let’s skip the details.

Third, they predicted graduation rates from grades, student variables, and variables for the college the student enrolled at. This model gets a “likely-to-graduate” score for each student as follows. This labels student background variables and college institutional variables in different colors for clarity.

The “likely-to-graduate” score becomes a probability after a sigmoid transformation. If you’re not familiar with sigmoid functions, think of them like this: If a student has a score is X then graduation probability is around .5 + .025 × X. For larger X (say |X|>1) scores start to have diminishing returns, since probabilities must be between 0 and 1.

For example, the coefficient for (male) above is -.092. This means that a male has around a 2.3% lower chance of graduating than an otherwise identical female. (For students with very high or very low scores the effect will be less.)

Fourth, they predicted graduation rates from ACT scores, student variables, and college variables.

The dependence on ACT is less than the dependence on GPA in Model 3. However, the dependence on student background and college variables is much higher.

Fifth, they predicted graduation rates from GPAs, ACT scores, student variables, and college variables.

Here, there’s minimal dependence on ACT, but a negative dependence on , meaning that extreme ACT scores (high or low) both lead to lower likely-to-graduate scores.

Does that seem counterintuitive to you? Remember, we are taking a student who is already enrolled in a particular known college and predicting how likely that are to graduate from that college.

Sixth, they predicted graduation rates from the same stuff as in the previous model, but now adding mean GPA and ACT for the student’s school. They also now standardize some variables relative to each high school.

I can’t tell what variables are affected by this change of the way things are standardized. My guess is that it’s just for GPA and the SAT, but it might affect other variables too.

I mean… not much?

Here’s what these models do: Take a student with a certain GPA, ACT scores, background who is accepted to and enrolls in a given college. How likely are they to graduate?

It’s true that these models have small coefficients in front of ACT. But does this mean ACT scores aren’t good predictors of preparation for college? No. ACT scores are still influencing who enrolls in college and what college they go to. These models made that influence disappear by dropping all the students who didn’t go to college, and then conditioning on the college they went to.

These models don’t say much of anything about how college admissions should work. There’s three reasons why.

First, these models are conditioning on student background! Look at the coefficients in Model 5. What exactly is the proposal here, to do college admissions using those coefficients? So, college should explicitly penalize men and poor students like this model does? Come on.

Second, test scores influence if students go to college at all. This entire analysis ignores the 67% of students who don’t enroll in college. The paper confirms that ACT scores are a strong predictor of college enrollment.

Of course, many factors influence if a student will go to college. Do they want to? Can they get in? Can they afford it?

You might say, “Well of course the ACT is predictive here—colleges are using it.” Sure, but that’s because colleges think it gauges preparation. It’s possible they’re wrong, but… isn’t that kind of the question here? It’s absurd to assume the ACT isn’t predictive of college success, and then use that assumption to prove that the ACT isn’t predictive of college success.

Third, for students who go to college, test scores influence which college they go to, and more selective colleges have higher graduation rates. Here’s three private colleges in the Boston area and three public colleges in Michigan.

The paper also does a regression on students who go to college to try to predict the graduation rate of the college they end up at. Again, GPA and ACT scores are about equally predictive.

Of course, you could also drop the student background and college variables, and just predict from GPA and ACT. But remember, we did that above, and the ACT was extremely predictive.

Alternatively, I guess you could condition on student background without conditioning on the college students go to. I doubt this is a good idea or a realistic idea, but at least it’s causally possible for colleges to use such a model to do admissions.

Why didn’t the authors do this? Well… Actually, they did.

Unfortunately, this is sort of hidden away on a corner of the paper, and no coefficients are given other than for GPA and ACT. It’s not clear if high-school GPA or ACT are even included here. The authors were not able to provide the other coefficients (nor to even acknowledge multiple polite requests notthatimbitteraboutit).

# The laundering of unproven claims

What happened? There’s really nothing fundamentally wrong in the paper. It fits some models to some data and gets some coefficients! Interpreted carefully, it’s all fine. And the paper itself never really pushes anything beyond the line of what’s technically correct.

Somehow, though, the second the paper ends, and the press release starts, all that is thrown out the window. Rather than “ACT scores definitely predict college graduation, but they don’t seem to give any extra information if you already know how they interact with college application, acceptance and enrollment, if we condition on student demographic variables in an implausible way”, we get “ACT scores don’t predict college success”.

To be fair, a couple hedges like “once school effects are controlled” make their way into the articles but are treated as a minor technical asides and never explained.

Let’s separate a bunch of claims.

1. It might be desirable to reduce the influence of test scores on college admissions to achieve worthy social goals.

2. It might be that test scores don’t predict college graduation rates.

3. It might be that test scores only predict college graduation because selective (and high graduation-rate) colleges choose to use them in admissions.

4. It might be that if selective colleges stopped using test scores in admissions, test scores would no longer predict admissions.

I’m open to claim #1 being true. If you believe #1, it would be convenient if #2, #3, and #4 were true. But the universe is not here to please us. #2 is not just unproven but proven to be false. This paper does not provide evidence for #3 or #4. Yet because these claims were inserted into the public narrative after peer review, we have a situation where the paper isn’t wrong, yet it is being used as evidence for claims it manifestly failed to establish.

Journals don’t issue retractions for press releases.

# A field guide

There’s a fair number of errors and undefined notation in the paper, which might throw you off if you try to read it. I’ve created a guide to help with this.

• def printTheNews(science, ideology):

if science.getTheme() not in ideology.keys():

return

print(science.getTheme(), “says”, ideology[science.getTheme()])

• Clearly the press does not care about code quality, because that’s not Pythonic :(

The pythonic version is science.theme - you don’t need a getter

• I dunno, the press will swallow anything, and then it goes through these cycles of lethargy...

• I gotta say, I never get tired of epistemic walkthroughs of peer-reviewed papers. Upvote for you!

• A more general observation that I’m sure has been stated many times but clicked for me while reading this: Once you condition on the output of a prediction process, correlations are residuals. Positive/​negative/​zero coefficients then map not to good/​bad/​irrelevant but to underrated/​overrated/​valued accurately.

(“Which college a student attends” is the output of a prediction process insofar as diff students attend the most selective college that accepts them and colleges differ only in their admission cutoffs on a common scoring function, I think).

• Very well stated. I would be interested in a link to something that describes that principle, the outcome of the prediction process.

• Here’s an argument for why the study’s conclusions are unsupported.

-----

Suppose that there are lots of things that go into predicting what makes a student successful. There’s ACT score, and GPA, and leadership, and race, and socioeconomic status, and countless other things.

Now, suppose colleges have tried to figure out the weightings for each of those factors, and shared their results with each other. They all compute “success scores” for each student.

Harvard takes the top 1000 applicants by score. MIT takes the next 1000. Princeton takes the third 1000. And so on.

So, what happens when you run a regression to predict success from ACT/​GPA/​etc, while controlling for school?

Well, if the formula is correct, nothing is significant!

Consider Princeton. All its success scores are, say, between +2.04 (Z-score) and +2.02, because it takes a specific thin slice of the population. That means that all the students are roughly equal. So if you find a student with a higher ACT score, he’s probably got a lower GPA. Because, if he was that high in both, he’d be higher than +2.04 overall and wind up at Harvard instead of Princeton.

In other words, NOTHING correlates to success, controlling for school, if colleges are good enough at predicting who will succeed.

Sure, there’s a small amount of slack, between +2.02 and +2.04, but it’s nowhere near enough to produce statistically significant evidence that any factor is important. Almost 100% of the variance is between schools, not within schools.

So that leaves noise. Any coefficients you find that are non-zero are probably just random artifacts.

Or … they are systematic errors in how schools evaluate students.

In this particular study, they found that controlling for school, GPA was important to success but ACT score was not.

Well, all that means is that colleges are not weighting GPA highly enough. It does NOT mean that GPA is more important than ACT score, or any other factor—only that GPA is more important *after you account for the college’s choice in whom to admit*. It could be that the colleges are giving GPA/​ACT a 1:15 ratio, and it should be only 1:10 instead. In other words, ACT could still be hugely more important than GPA, but the schools are making it a little TOO huge.

Even if everything in the study is correct, I would argue they misunderstood what they were measuring, and what the results mean. They only mean colleges are underestimating GPA relative to ACT, not that GPA is more important than ACT.

-----

Here’s an analogy:

A store will only let you in if you have exactly $1000 worth of large bills in your wallet. An academic study measures how much stuff you get based on all the money in your wallet, including small bills. Since everyone has exactly$1000 in large bills, the regression can’t deal with those, and it finds that 100% of the differences in success come from small bills.

That doesn’t mean that large bills don’t matter! It means that large bills don’t matter given that you got admission to the store. Large bills DO matter, because otherwise you wouldn’t have gotten in!

Similarly, this study’s results don’t mean that ACT doesn’t matter. They mean that ACT doesn’t matter given that you got admission to the college. If college admission criteria include ACT, then ACT does matter, because otherwise you wouldn’t have gotten in!

• I realized I forgot to provide evidence from the paper that the range of ACT within colleges is smaller than the range of GDP.

From p.207 of the paper:

“Thus, ACT scores are related to college graduation, in part, because students with higher scores are more likely to attend the kinds of colleges where students are more likely to graduate...”

(I think they obviously have this backwards, for the most part. Seems to me more likely that the higher graduation rates of those “kinds of colleges” are the ones that choose students with the higher ACT scores.)

From p. 206:

“Many schools do not have students with very high ACT scores, and a number of schools do not have students with very low ACT scores [which explains why some colleges do not have students from the full ACT range, even though they do have students from the full GPA range].”

In other words: students DO sort themselves into schools based on ACT score more than they do by GPA.

• Correction to above: the quote from p. 206 refers to high schools, not colleges.

For colleges, I found a page here that lists 25th and 75th ACT percentiles. Some pairs of schools have no overlap at all; for instance, Ohio State’s middle interval is (27, 31), while Vanderbilt is (32, 35). The average for college enrolees, per this study, was 20.1, with an SD of 4.33. So Vanderbilt’s 25th percentile is almost +3 SD.

For GPA … the 25th percentile for Vanderbilt is 3.75. The mean in this study was 2.72, with an SD of 0.65. So the 25th percentile for GPA was only around +1.6 SD.

For ACE at Vanderbilt, the 75th percentile is 0.92 SD higher than the 25th. If the same was true for GPA, the 75th percentile would have to be 4.34, which is clearly impossible, since the upper limit is 4.00.

So that supports the idea that for a given school, ACE has a narrower range than GPA.

• Here, there’s minimal dependence on ACT, but a negative dependence on , meaning that extreme ACT scores (high or low) both lead to lower likely-to-graduate scores.

Does that seem counterintuitive to you? Remember, we are taking a student who is already enrolled in a particular known college and predicting how likely that are to graduate from that college.

Sounds like a classic example of Simpson’s paradox, no?

• Where are the footnotes?

• I’m not very familiar with academia, but have you considered sending this to the authors of the paper to a) see if there are any mistakes you made and b) help them avoid similar errors in the future?
But I acknowledge that this could lead to a long email exchange that you may not want.

• I’ve politely contacted them several times via several different channels just asking for clarifications and what the “missing coefficients” are in the last model. Total stonewall- they won’t even acknowledge my contacts. Some people more connected to the education community also apparently did that as a result of my post, with the same result.