Learned helplessness about “teaching to the test”

Viliam13 Jun 2025 17:53 UTC

36 points

I keep wondering why is there so much learned helplessness about “teaching to the test”.

It is often used as an example of Goodharting (e.g. here), and the implied conclusion seems to be… that it is wrong to try testing students systematically, because it is known that it inevitably causes teaching to the test, which ruins education?

Every time I read something like that, I think: Why can’t we simply fix the test, so that “teaching to the test” either becomes impossible, or becomes the right thing to do? To me this seems like the obvious reaction, so I am surprised that I don’t see it more often.

Let me give you a fictional example: Suppose that a school teaches children the multiplication table up to 10×10. The department of education creates a test, containing two questions: “how much is 5×5?” and “how much is 7×8″?

Yes, if you are going to judge schools based on how well students answer these two questions, of course many teachers are going to follow the incentives, and instead of multiplication, they will spend all the time at classroom making the students memorize “5×5=25” and “7×8=56″, even if doing so means that there will be no time left for other multiplication examples. So the next generation of students will have no idea how much is 6×6, despite officially having multiplication in the curriculum.

A scary story, isn’t it? Does it mean that we should never test children on multiplication?

No, that would be completely stupid! (Looking around anxiously, hoping that someone agrees with me...)

The problem with the proposed test is that out of one hundred possible multiplication problems, it predictably tests two predetermined ones.

Well, how about instead of that, each year generate two multiplication problems randomly? That way, teachers won’t know which specific multiplication problems they need to teach, so the best educational strategy will be to teach all of them.

Okay, one problem with this is so obvious that even I can predict it. If you literally choose the problems randomly, some years you are going to get “1×1” and “1×2″ or something like that on the test, and people won’t stop talking about how you ruined mathematical education forever, and how unfair it is that one generation of students got such easy problems, compared to the previous and the following years.

But if you do the simple fix and remove multiplication by one from the test, sooner or later the teachers will notice, and soon we will have a generation of students who have never learned how to multiply by one. Is there a way out of this dilemma?

Uhm, maybe we could use some heuristic to classify the multiplication problems into difficulty categories, such as:

trivial: multiplication by one or zero; otherwise
easy: multiplication by two or by ten, or if the result is at most ten; otherwise
medium: at least one of the numbers is smaller than six; otherwise
hard: everything else.

We could discuss the specific boundaries of the categories, but that’s not the point. The test will contain six randomly generated problems: one trivial, one easy, two medium, two hard, in random order.

I think we are done here.

The same kind of thinking could be used for many objections that come up against testing. Let me try a few:

“Some schools teach basic math, other schools teach advanced math.”

Let’s make a “basic math” test and an “advanced math” test. The students who only had basic math will do the first one; the students who had advanced math will do both.

Why both, instead of only the advanced one? That’s for the the students whose abilities were somewhere between both groups, decided to take the advanced lessons and then failed at the advanced test. I want to know how they stand compared to the kids who took basic math, for two reasons:

to lower the risk for those kids: now they don’t have to choose between doing well on the basic test (which might seem “good, even if nothing special” to a third party), or doing very badly on the advanced test (which might seem worse), because they will still have the good results on the basic test;
to reduce the temptation for kids who suck at math to take the advanced lessons anyway, if they predict the opposite reaction of the third party: that even bad advanced test results will look better than good (or bad) basic test results.

Generally, you can reduce various kinds of “strategizing” by always making the advanced students also take the test for the basic students. It will allow you to easily compare all students on the same metric (the basic test). It might even stop various conspiracy theories about how the advanced classes are secretly the easier ones, etc.

“But still, this motivates teachers to only teach the stuff that is in the textbooks, and never even mention anything that would be considered too advanced, even if there happens to be a good opportunity for that. Also, kids who do math competitions won’t be able to get extra points; they will seem just like someone who merely… studies okay.”

Solution: add a few bonus questions that are definitely outside the curriculum for given age. Make it clear that those are bonus questions and that the students are not really expected to solve them. Maybe put them on an extra paper that the student has to explicitly ask for. That way, someone can score e.g. 110% on the test.

...I could go on, but I guess the point is clear. Whenever the objection is “but the test doesn’t X”, the proposed solution is to make the test X.

This is something that feels completely obvious to me, but I never hear it mentioned when people discuss “teaching to the test”, so… I am looking forward to your replies.

What links here?

Viliam's comment on adamzerner’s Shortform by Adam Zerner (11 Aug 2025 11:50 UTC; 7 points)

Viliam13 Jun 2025 17:53 UTC

36 points

16 comments3 min readLW link

Education World Optimization

localdeity 14 Jun 2025 0:29 UTC
26 points
1
It’s possible that “teaching to the test” tends to refer to something a bit more specific. Here is John Holt in “How Children Fail”, which some upstanding citizen has put onto the internet in easily googleable form:
This past year I had some terrible students. I failed more kids, mostly in French and Algebra, than did all the rest of the teachers in the school together. I did my best to get them through, good-ness knows. Before every test we had a big cram session of practice work, politely known as “review.” When they failed the exam, we had post mortems, then more review, then a makeup test (always easier than the first), which they almost always failed again.
Much later:
We teachers, from primary school through graduate school, all seem to be hard at work at the business of making it look as if our students know more than they really do. Our standing among other teachers, or of our school among other schools, depends on how much our students seem to know; not on how much they really know, or how effectively they can use what they know, or even whether they can use it at all. The more material we can appear to “cover” in our course, or syllabus, or curriculum, the better we look; and the more easily we can show that when they left our class our students knew what they were “supposed” to know, the more easily can we escape blame if and when it later appears (and it usually does) that much of that material they do not know at all.
When I was in my last year at school, we seniors stayed around an extra week to cram for college boards. Our ancient-history teacher told us, on the basis of long experience, that we would do well to prepare ourselves to write for twenty minutes on each of a list of fifteen topics that he gave us. We studied his list. We knew the wisdom of taking that kind of advice; if we had not, we would not have been at that school. When the boards came, we found that his list comfortably covered every one of the eight questions we were asked. So we got credit for knowing a great deal about ancient history, which we did not, he got credit for being a good teacher, which he was not, and the school got credit for being, as it was, a good place to go if you wanted to be sure of getting into a prestige college. The fact was that I knew very little about ancient history; that much of what I thought I knew was misleading or false; that then, and for many years afterwards, I disliked history and thought it pointless and a waste of time; and that two months later I could not have come close to passing the history college boards, or even a much easier test, but who cared?
I have played the game myself. When I began teaching I thought, naively, that the purpose of a test was to test, to find out what the students knew about the course. It didn’t take me long to find out that if I gave my students surprise tests, covering the whole material of the course to date, almost everyone flunked. This made me look bad, and posed problems for the school. I learned that the only way to get a respectable percentage of decent or even passing grades was to announce tests well in advance, tell in some detail what material they would cover, and hold plenty of advance practice in the kind of questions that would be asked, which is called review. I later learned that teachers do this everywhere. We know that what we are doing is not really honest, but we dare not be the first to stop, and we try to justify or excuse ourselves by saying that, after all, it does no particular harm. But we are wrong; it does great harm.
It does harm, first of all, because it is dishonest and the students know it. My friends and I, breezing through the ancient-history boards, knew very well that a trick was being played on someone, we were not quite sure on whom. Our success on the boards was due, not to our knowledge of ancient history, which was scanty, but to our teacher’s skill as a predictor, which was great. Even children much younger than we were learn that what most teachers want and reward are not knowledge and understanding but the appearance of them. The smart and able ones, at least, come to look on school as something of a racket, which it is their job to learn how to beat. And learn they do; they become experts at smelling out the unspoken and often unconscious preferences and prejudices of their teachers, and at taking full advantage of them. My first English teacher at prep school gave us Macaulay’s essay on Lord Clive to read, and from his pleasure in reading it aloud I saw that he was a sucker for the periodic sentence, a long complex sentence with the main verb at the end. Thereafter I took care to construct at least one such sentence in every paper I wrote for him, and thus assured myself a good mark in the course.
Not only does the examination racket do harm by making students feel that a search for honest understanding is beside the point; it does further harm by discouraging those few students who go on making that search in spite of everything. The student who will not be satisfied merely to know “right answers” or recipes for getting them will not have an easy time in school, particularly since facts and recipes may be all that his teachers know. They tend to be impatient or even angry with the student who wants to know, not just what happened, but why it happened as it did and not some other way. They rarely have the knowledge to answer such questions, and even more rarely have the time; there is all that material to cover.
In short, our “Tell-‘em-and-test-’em” way of teaching leaves most students increasingly confused, aware that their academic success rests on shaky foundations, and convinced that school is mainly a place where you follow meaningless procedures to get meaningless answers to meaningless questions.
And also:
It begins to look as if the test-examination-marks business is a gigantic racket, the purpose of which is to enable students, teachers, and schools to take part in a joint pretense that the students know everything they are supposed to know, when in fact they know only a small part of it—if any at all. Why do we always announce exams in advance, if not to give students a chance to cram for them? Why do teachers, even in graduate schools, always say quite specifically what the exam will be about, even telling the type of questions that will be given? Because otherwise too many students would flunk. What would happen at Harvard or Yale if a prof gave a surprise test in March on work covered in October? Everyone knows what would happen; that’s why they don’t do it.
- Viliam 14 Jun 2025 13:34 UTC
  9 points
  6
  Parent
  I agree that cramming right before the test increases the score without increasing long-term knowledge.
  But this...
  a list of fifteen topics that he gave us. [...] comfortably covered every one of the eight questions we were asked.
  ...seems to be exactly the kind of a problem that I described. The problem is that the exam authors always choose 8 out of the same 15 topics—as the teacher has noticed.
  And presumably those 15 topics are just a small part of the entire curriculum… otherwise I don’t see what would be the problem with the teaching telling students to learn, well, everything. (We don’t call it cheating when the grammar teachers make students practice 26 letters before doing 1st grade exams.)
  And, as I wrote, the problem could be easily fixed by choosing those 8 questions out of all possible topics, rather than the 15 selected ones. Or, even if you think that those are the 15 most important topics in the curriculum, make another list of less important topics, and then every year choose e.g. 5 of the “important” and 3 of the “unimportant” questions.
  I learned that the only way to get a respectable percentage of decent or even passing grades was to announce tests well in advance, tell in some detail what material they would cover, and hold plenty of advance practice in the kind of questions that would be asked, which is called review. I later learned that teachers do this everywhere.
  Ah, so the problem is that the teachers are the ones who prepare the tests. In which case, indeed, there is a massive conflict of interest here—the teachers de facto indirectly grade themselves, and they can give themselves better grades by inventing worse tests. Of course! But then the actual problem is this conflict of interest, not testing per se. So the solution is not less testing, but independent testing.
  The problem is not “teaching to the test”, but “the teacher evaluates his or her own work”.
  And the independent testing is exactly a way to fight this… which is why I find it ironic that “teachers cheat when they test their own work” is in practice usually used as an argument against independent testing.
Charlie Steiner 13 Jun 2025 19:09 UTC
11 points
0
For math specifically, this seems useful. Maybe also for some notion of “general knowledge.”
I had a music class in elementary school. How would you test for whether the students have learned to make music? I had a spanish class—how do you test kids’ conversational skills?
Prior to good multimodal AI, the answer [either was or still is, not sure] to send a skilled proctor to interact with students one-on-one. But I think this is too unpalatable for reliability, cost, and objectivity reasons.
(Other similar skills: writing fiction, writing fact, teamwork, conflict resolution, debate, media literacy, cooking, knowledge of your local town)
jbash 13 Jun 2025 20:38 UTC
9 points
3
It’s hard to write good tests. It’s even harder to write good standardized tests that have to be administered to huge numbers of people, within a reasonable time, in purely written form, and graded consistently, again within reasonable time^[1].

… and there are a ton of practical and political pressures that make it even harder.

People don’t actually agree on what they’re trying to teach, why it’s important to teach it, what constitutes good performance, what’s essential versus what’s “extra”, what it’s fair to test, what forms even of practically accessible testing are fair to use, etc. That list of critical competencies in the state curriculum wasn’t handed down by God. It was negotiated, and it’s at best a poor approximation of what anybody wants. And there was still more negotiation involved in making a test out of it.

Here’s an example of a practical and maybe political limitation: If you have a diagnosed learning disability, certified by somebody with the right piece of paper on their wall, you can (often) get extra time on tests. But you still have to take the tests, because the bottom line is that, disability or no, you still have to learn the material. You don’t get excused from actually getting the required understanding, even if it’s harder for you. You’re just given the opportunity to show that you can do it anyway.

The Good and the Great have agreed among themselves that, after all, what they care about is your understanding of, say, this or that kind of math, which they believe is best demonstrated by your ability to solve whatever problems are on the test. But they don’t really care about speed. It’s not a footrace, and nobody’s training (or hiring) people for an assembly line. If anybody in 2025 cared about actually solving the problems fast, or indeed accurately, they’d use a machine to do it anyway^[2].

So we’ve established that speed doesn’t matter, at least not enough to withhold grades or credentials over it. It’s not a critical part of the “X” you’re trying to test.

OK, then why was the test timed in the first place? You don’t care about speed, remember? So why doesn’t everybody taking the test get all the time they feel they need? Your test is measuring the wrong thing!

The main real answer is probably that it’d be costly and administratively inconvenient to reserve a big room for, say, twice as long, and costly and personally inconvenient for somebody to sit there and proctor for that long. The second real answer is probably that timed tests are traditional, dammit.

So if it’s even possible to do what you’re suggesting, it’s going to be a lot of work. It’s going to have a cost. You’ll have to give something up to do it. Before you do that, you should know what value you’re getting in exchange.

So what value is this fancy test actually supposed to produce, once you have it?
1. ↩︎
  The “bonus questions” you suggest probably make all time issues worse. Part of test design is to devote resources, including student time and grader time, to questions that are likely to get missed often enough to give you useful information about many of your testees.
2. ↩︎
  Math problems machines can’t solve these days aren’t going to be good for tests, because most testees won’t be able to solve them either, at least not within the available time. We keep coming back to time.
- Jiro 14 Jun 2025 19:27 UTC
  4 points
  0
  Parent
  
  OK, then why was the test timed in the first place? You don’t care about speed, remember?
  
  Someone who takes a very long time to get answers may be more likely to completely forget the information in the future, may be less able to use the information in real world situations, etc. even though he eventually managed to write the answer on the test. Someone who takes a long time to write the answer down because he has a disability may not have these problems.
  
  So while we don’t care about speed by itself, we care about problems that are correlated with speed, and they may be less so for the disabled.
- Viliam 14 Jun 2025 13:41 UTC
  2 points
  0
  Parent
  So what value is this fancy test actually supposed to produce, once you have it?
  Comparison between schools, for starters. As an information for parents where to put their kids.
  Today, there are many schools that inflate grades, because people use “good grades” as a proxy for “more knowledge”. Which would be closer to the truth if we had better tests, but today it is too easy to cheat.
  In absence of objective testing, the competition between schools is mostly about marketing. This is how you get schools that spend a lot of money on sports and various shiny toys, but provide mediocre education.
  I have taught in different schools, so I know from personal experience that there is a huge difference between how much education the kids get in different schools. But even if a parent trying to get the best education for their child asks me, it is almost impossible for me to prove any of this; there is nothing legible to point at, other than say “trust me, that school is utterly dysfunctional, but they give every student good grades, and they have an army of lawyers to threaten any teacher who talks about what happens there, and the director is a charming psychopath who can convince most parents in a 1:1 talk that this is the best school ever, so the parents are happy to give him lots of money believing that they sacrifice for the good of their children”. (Yes, I have a specific school in mind. Not in USA.)
Viliam 13 Jun 2025 19:09 UTC
8 points
2
A more abstract reason why this irritates me is that it is an example of a more general pattern that goes like this:
- Try a half-assed implementation of X.
- It fails predictably.
- Conclude: “X was experimentally disproved.”
- If anyone suggests to do X properly, say: “true socialism has never been tried, comrades, right?”
Sometimes I wonder whether each popular fallacy has its opposite which is also a popular fallacy.
- MichaelDickens 13 Jun 2025 20:51 UTC
  4 points
  0
  Parent
  Maybe I’m getting too far off-topic but there is another version of this that goes:
  - Do some armchair theorizing where you spend 10 seconds thinking about how to make X work and you can’t think of anything.
  - Conclude: “X is impossible.”
  - Dagon 13 Jun 2025 23:04 UTC
    4 points
    1
    Parent
    This also works in reverse:
    do some armchair theorizing and spend 10 seconds (or even a few hours) thinking about why people who make X aren’t actually doing X very well. You can’t understand why.
    Conclude: “X is simple, and the obvious answer is obvious”.
    - Viliam 14 Jun 2025 14:35 UTC
      6 points
      0
      Parent
      And to make things even more complicated, sometimes there is a problem that indeed could be solved by 10 seconds of thinking, but behind it there is another problem which is taboo to mention, which cannot be solved so easily. So everyone involved ends up pretending that the 10-second solution is insufficient to solve the original problem, because it is their only way out of “we cannot implement this solution” and “but we also cannot explain why”.
      .
      To give an example from my experience, once I was given a task to organize entrance exams for a university. There were many thousand candidates; they couldn’t even fit into the school building on the same day, so the testing took several days, I think always one group in the morning and one group in the afternoon. The questions were all of the “choose: a, b, c, or d” form, and it was still an enormous amount of work to check all the tests. (This was a few decades ago when computers were a new thing, so all those tests were checked manually.)
      But the main problem was that the questions and answers have always leaked. With thousands of students in the building and testing taking multiple days you couldn’t really prevent it, because all that was needed was to send a few fake candidates whose task was to grab the problems or make a photo on an early day, then they could solve the problems in their free time, and sell the answers to students who took the test on a later day. Those students just memorized the answers like this: “acbdcadb...” or used some simple code to make a tiny cheat sheet (two bits per answer do not require much space).
      So far the best solution was to do A and B versions of the test, so that the students sitting next to each other would get different versions, which prevented them from checking each other’s answers, and made memorizing the answers slightly more difficult. (Even this was frequently defeated by students exchanging their question sheet with someone in a different row, so that they both had the same version as their neighbors. There were too many students and too few teachers to check all of this.)
      My solution was simple: instead of doing versions A and B, I did 4 different versions every day, and different 4 versions the next day, and so on until the end of exams. So the first day, the first row got alternating A, B, A, B..., the second row got alternating C, D, C, D..., the third row A, B again, etc. So exchanging the questions with someone one row away from you didn’t help you get the same questions as your neighbors. (Two rows would be more conspicuous. Also, the students did not expect this; which probably would be less true the following years.) The fake candidates now had to collect four versions of the problems… and even that didn’t help them, because the next day it was versions E, F, E, F… for the odd rows, and G, H, G, H… for the even rows.
      How could I make so many variants with so many questions? Wasn’t that much more work than usual? Well, it was more work, but not an order of magnitude more work. I asked teachers to prepare more questions than usual, but no specific numbers, just “make as many as you comfortably can”. Then, every variant was randomly selected questions, and even the answers a, b, c, d were shuffled randomly. That means, even if two variants got the same question, in one of them it could be question #7 and the correct answer was “b”, but on another it was question #13 and the correct answer was “c”. So you couldn’t defeat this system by memorizing the letters “acbdcadb...”; you would have to actually learn the questions and the right answers. And there were more questions than usual.
      To make evaluating the tests easier, my program that prepared the tests also prepared for each variant tiny strips of paper with the right answers that you could put next to the test, and just check whether the answer match or not. So evaluating the tests was actually easier than the previous years.
      And… it was a huge success, all people involved in evaluating the tests got a bonus, and a few people congratulated me on the successful organization of something they believed would be utter chaos.
      And… the next year my system was abandoned; the university returned to the original way of doing entrance exams, and the debate about improving the exams became a taboo.
      As far as I know, although I can only guess here, it is because the answers were actually regularly leaked by some very high-status employees of the university, and my experiment has ruined their nice side business for one year. So they invented bullshit reasons about why this isn’t a good way to do exams (the people who did them with me agreed that the reasons were bullshit, e.g. “because it is much more work” but everyone who actually did the work agreed that it was less work than usual), and thus the debate was officially concluded. “You see, we tried your experiment, but it failed.”
      So, some problems cannot be solved simply because there are important people who don’t want them to get solved. But the official answer is still that your solution wouldn’t work. Trust the experts!
Gordon Seidoh Worley 16 Jun 2025 1:36 UTC
4 points
2
This got me thinking, I really wish school had looked more like unlocking a tech tree in a game like Civilization than completing ordered stages in a game like Super Mario Brothers (with occasional warps/shortcuts for the clever).
Maybe with AI tutors it can. Study a subject. Take a test. If you pass, (and passing here means demonstrating mastering, not getting some score that’s “good enough” to meet some standard for much you have to know to be allowed to move on to the next grade with all your peers even though you clearly don’t have full mastery of the topics) you move on, and if you don’t, you keep studying until you can master the test. No need to learn subjects in lockstep with same-sage peers, because everyone can be studying what they need to learn with their own, independent AI teacher.
- Viliam 16 Jun 2025 8:25 UTC
  7 points
  2
  Parent
  I tried to make the “tech tree” for elementary-school math, and it turned out to be much more complicated than I expected.
  Consider this: as a first approximation, we could make nodes for “addition”, “subtraction”, “multiplication”, and “division”. But if you look closer, there is actually not one node for “addition”, but there is a separate node for “integer addition up to 10, e.g. 3+5″ (which children can calculate using their fingers), “integer addition across 10, e.g. 8+7” (which depends on basic subtraction, because to calculate 8+7 you use 10-8=2 and 7-2=5 as intermediate steps), and then there is a general skill of “adding two integers of arbitrary length”. Should we also make separate nodes for “addition is commutative”, “addition is associative”, and “adding a negative number means subtracting its absolute value”? I think we should, because ultimately we need to teach these concepts to the kids, but… you get the idea about how large the resulting tree will be.
  Large trees are inconvenient to work with. A tree with 10 nodes is a nice picture. A tree with 1000 nodes is an intricate network of lines. You could arrange it into categories (e.g. put all the “addition” nodes next to each other, or maybe make an “addition and subtractions” cluster), that would allow you to easier locate a node, but you would have many lines between the categories. This could be handled by a computer, for example the nodes you have already mastered (and took an exam for) would be checked, and if you click on a new node, all nodes between your current knowledge and the selected node would be highlighted, to give you an idea about how much would you need to learn in order to get there.
  This is not an argument against constructing such tree. I definitely would like to see it! I am just saying that it is a lot of work, and difficult to use without a computer, so I understand why teachers are not doing this, and instead handle the complexity by creating groups of nodes, and arranging those groups in linear order.
nim 13 Jun 2025 20:57 UTC
4 points
0
Whose job is it to build the actual tests that students encounter in the real world, and what are their incentives?
MichaelDickens 13 Jun 2025 20:42 UTC
4 points
0
I hadn’t really considered this before but I think you’re right.

The only standardized test I’m familiar with is the SAT but the SAT does, in fact, do the things you suggest as ways to avoid “teaching to the test”. I’m not worried about people “teaching to” the SAT.

Other than the vocab section, I think the only way to do well on the SAT is to be genuinely good at math and reading comprehension. I spent a long time studying for the SAT and I improved my score by ~200 points but I never figured out any way to “hack” it.

(Except for the vocab section, because the list of SAT vocab words is much shorter than the list of English words. But that problem is trivially fixable, so I suspect there must be some deliberate reason why the SAT draws from such a small pool of words.)

To be fair, in the process of studying for the SAT, I don’t think I got better at math in a way I care about. I was already 3 years beyond the most advanced math that shows up on the SAT. I improved my score by going from ~90% accuracy on SAT-style math problems to ~97% accuracy, which is a genuine improvement even though 90% is totally fine for practical purposes.
Karl Krueger 13 Jun 2025 22:18 UTC
3 points
0
A standardized test measures some combination of ① the intended subject-matter skills and knowledge, and ② skills and knowledge that are specific to manipulating the test, also known as “test-taking skills”. We know this because we can teach test-taking skills; which is to say, we can improve a student’s performance on a standardized subject test by teaching them something that is not that subject.
A perfect standardized test would measure only ① with no influence from ②. If the subject is biology, then a perfect test would be one where the only way to get a better score would be to actually get better at biology, not at test-taking.
Kabir Kumar 28 Jul 2025 12:26 UTC
1 point
0
yeah, the way to do this is to make incentives that rewards making very good tests, which will keep doing so.
its essentially an alignment problem