Seven Years of Spaced Repetition Software in the Classroom

Description

This is a reflective essay and report on my experiences using Spaced Repetition Software (SRS) in an American high school classroom. It follows my 2015 and 2016 posts on the same topic.

Because I value concise summaries in non-fiction, I provide one immediately below. However, I also believe in the power of narrative, in carefully unfolding a story so as to maximize reader engagement and impact. As I have applied such narrative considerations in writing this post, I consider the following summary to be a spoiler.

I’ll let you decide what to do with that information.

Summary (spoilers)

My earlier push for classroom SRS solutions was driven by a belief I came to see as fallacious: that forgetting is the undoing of learning. This epistemic shift drove me to abandon designs for a custom app that would have integrated whole-class and individual SRS functions.

While I still see value in classroom use of Spaced Repetition Software, especially in basic language acquisition, I have greatly reduced its use in my own classes.

In my third year of experiments (2016-17), I used a windfall of classroom computers to give students supervised time to independently study using an SRS app with individual profiles. I found longer-term average performance to be slightly worse than under the whole-class group study model, though students of high intelligence and motivation saw slight improvements.

Intro and response to Piotr Woźniak

I have recently received a number of requests to revisit the topic of classroom SRS after years of silence on the subject. Understandably, the term “postmortem” has come up more than once. Did I hit a dead end? Do I still use it?

Also, I was informed that SRS founding father Piotr Woźniak recently added a page to his SuperMemo wiki in which he quoted me at length and claimed that SRS doesn’t belong in the classroom.

Well, I don’t have much in the way of rebuttal, because Woźniak’s main goal with the page seems to be to use my experience as ammunition against the perpetuation of school-as-we-know-it, which seems like a worthy crusade. He introduces my earlier classroom SRS posts by saying, “This teacher could write the same articles with the same conclusions. Only the terminology would differ.” I’ll take that as high praise.

If I were to quibble, it would be with the part shortly after this, where he says:

The entire analysis is made with an important assumption: “school is good, school is inevitable, and school is here to stay, so we better learn to live with it”.

Inevitable? Maybe. Here to stay? Realistically, yes. But good? At best, I might describe our educational system as an “inadequate equilibrium”. At worst? A pit so deep we still don’t know what’s at the bottom, except that it eats souls.

Other than that, let me reiterate my long-running agreement with Woźniak that SRS is best when used by a self-motivated individual, and that my classroom antics are an ugly hack around the fact that self-motivation is a rare element this deep in the mines.

Anyone who can show us a way out will have my attention. In the meantime, I’ll do my best to keep a light on.

Prologue

At the end of my 2016 post, I teased a peek at a classroom SRS+ app I was preparing to build. It would have married whole-class and individual study functions with some other clever features to reduce teacher workload.

I had a 10k word document in hand: a mix of rationale, feature descriptions, and hypothetical “user stories”. I wasn’t looking for funding or a co-founder, just some technical suggestions and moral support. I would have been my own first user, and I had to keep my day job for that anyway.

But each time I read my draft, I had this growing, sickening sense that I was lying to myself and my potential customers, like a door-to-door missionary choking back a tide of latent atheism. And I should know, because the last time I had felt this kind of queasiness I was a door-to-door missionary choking back a tide of latent atheism.

I thought maybe this was just the kind of general self-doubt common to anyone undertaking something audacious, but I paused my work on it for another school year while I tried the obvious thing: providing students individual SRS app profiles and supervised class time in which to use them.

This is a two-part essay, and in Part 2, I’ll tell you how that went. But in Part 1, I’m going to make the case that Part 2 doesn’t matter very much.

Part 1: Everybody Poops

A great and terrible vision

As I wrapped up my Third Year experiment, I again tried to sort out my feelings about my visionary SRS app design, which I hadn’t updated despite a year of fresh experience. Was it just self-doubt?

The fact that I could only code at a minimal hobbyist level didn’t feel like the biggest hurdle. I think I could have picked up enough skill in that area. But even with a magical ability to translate my vision into code, I would have been up against a daunting base rate of failure for education startups. Also, I didn’t consider myself a very typical teacher: What sounded brilliant and intuitive to me would probably seem pointless and nonsensical to 95% of my peers.

Still, I pulled out my Eye of Agamotto and checked out all of the futures where I developed the app. In almost all of these, nothing came of it. But in the few where my app saw high adoption, the result was… dystopia! Students turned against their teachers, and teachers against their students. Homework stretched to eternity. Millions of children cursed my name. The ‘me’ in these futures wore an ignominious goatee and a haunted stare.

Used judiciously for the right concepts, in the right courses, by the right teachers, I still think my imagined app could be a powerful tool. But I don’t see any way to keep it from being abused. Well-intentioned teachers would put too much into it and demand too much from students. Any safeguards I put in to prevent this would just invite my app to be outcompeted by an imitator who removed these safeguards (which would seem arbitrary and restricting to most users).

I’m convinced of this because the me who wrote the original “A Year of Spaced Repetition…” post would have abused it. Let’s see… He was averaging seven new cards a day? (That’s 2-3 times what I would recommend today.) He uncapped the 20 new card/​day limit? He knew even then that he was adding too many cards, but failed to cut back the following year? I’m not encouraged.

“But wait,” you say. “You didn’t think you were a typical teacher. Maybe a typical teacher could be trusted?”

No.

In defense of forgetting

The “problem” is that teachers instinctively introduce far more content than students can be expected to remember. This was obvious to me when I was averaging seven new cards a day, which still felt like a brutal triage of my total content.

Covering more material than can be retained isn’t bad teaching, though. In fact, it’s a good and necessary practice. Content — the more the merrier — is the training data the brain uses to form and refine mental models of the universe.[1] These models tend to be long-lived, and allow the brain to re-learn the content more deeply and efficiently if it ever comes up again. They also allow it to absorb new-but-conceptually-adjacent contents more readily. In cognition, as in nutrition, you are what you eat — and good digestion naturally produces solid waste. The original training data is subject to lossy compression, with only a few random fragments left whole and unforgotten. (Tippecanoe, and Tyler Too! The mitochondria is the powerhouse of the cell!) Such recollections are corn kernels bobbing top-side up in a turd floating down the river Lethe.

This is normal and fine. Regular, even.

But the educational establishment doesn’t see it that way. The teacher I was seven years ago didn’t see it that way. And I now realize that the teacher I was five and six years ago had queasy feelings because he was starting to see it that way. Following my gut, without fully understanding or even entirely registering what I was doing, I slowly turned around and started walking the other way, abandoning my app design and the unfinished “Third Year” report.

The orthodox view equates forgetting with failure. It’s not “Everybody poops”. It’s “Poop is inadequate. How can we get more corn, less poop?” This belief is implicit whenever someone laments the “summer slide” , or opines that students missing school during the Covid pandemic are “losing” months of learning — as if kids are spinning their progress meters backwards, just pooping away without anyone trying to stop them. Under this view, we keep kids in school partly to stop the leaks, and partly to stuff them with new knowledge faster than they can expunge old knowledge.

If this is how you see education, SRS is a tool to keep students from pooping. It offers the tantalizing possibility of learning without forgetting. Two steps forward, no steps back. Why wouldn’t you push it as hard as possible?

Don’t get me wrong. All else being equal, learning without forgetting would be great. But the most important effects of learning — lasting changes to our mental machinery — happen whether or not we forget the content. Once the lesson is over, dear teacher, your best shot at lasting growth has already left the harbor. So why are you still trying to hold back the tide? Why are you planning to punish your students for pooping on Tuesday, the day before your test, instead of Thursday, the day after it?[2]

In defense of remembering

This is not a “How I Learned to Stop Worrying and Love Forgetting” essay. I don’t love forgetting. I will be the first to argue the merits of not forgetting right away. The longer we can keep ideas floating around in our heads, the greater their “cross-section”, as I put it in 2016, with more opportunities to make associative connections that cause useful long-lived updates to our mental models.

Unfortunately, I have not found SRS to be great at fostering the sorts of reflective mental states conducive to insight, except when studying on my own at a deliberately slow pace, as while on a walk. In such a use case, SRS no longer has quite the time-efficiency advantage that is its main selling point. The opportunity cost of using it goes up. In a whole-class SRS session, long reflective pauses between cards would invite frustration and misbehavior, and we wouldn’t get through very many cards.

In defense of remembering, I will also argue that some skills are simply impossible without a continuous retention of specific dependencies. These skills tend to be technical. Heck, this might be the definition of a technical skill.

With a few mostly upper-level exceptions, though — math, physics, chemistry — most of what we teach in school is more conceptual than technical. We make you take history so you have a better model of how civilizations and governments work, not so you remember who shot Alexander Hamilton. We make you take English to improve your word-based input and output abilities, not so you remember the difference between simile and metaphor. At least, I hope we do.[3]

Besides, even in the technical classes, forgetting is the near-universal outcome, and the long-term benefits are mostly conceptual — for if you don’t use these skills continuously for the rest of your life, you’re almost certainly going to lose them. Maybe more than once.

I’ve forgotten algebra twice. I’ve forgotten how to write code at least three times. I can’t do either one at the moment. But I’m still changed by having known them. I have an intuition for what sorts of problems ought to be mathematically solvable. I can think in terms of algorithms. And I could relearn either skill more easily than on the first or second occasions. Also, relearning has an anecdotal tendency to deepen understanding in a way that continuous retention may not, especially when approached from a different direction.

Still, as long as I’m defending retention, I think it’s valid to ask whether we should force kids (and often, by extension, their parents) to relearn math every frickin’ year. Consider: The conventional wisdom is that technical companies begrudgingly expect to have to (re)train most new workers in the very specific areas they need. They look to your resume and transcripts mostly for evidence that you have learned technical skills before and can presumably learn them again. I don’t think they care if you’ve re-learned them three times already instead of six. So, if we’re going to force kids to demonstrate intermediate math chops to graduate (a dubious demand), perhaps we could at least wait until the last practical moment, and then do it in bigger continuous lumps — like two-hour daily block classes starting in grade 9 or 10 — so they would have fewer opportunities to forget as they climb the dependency pyramid. Think of the tears we could save (or at least postpone).

The value proposition of classroom SRS

Anyway, classroom SRS has its strengths, but midwifing conceptual insights doesn’t feel like one of them. I think it’s also reasonable to assume that students forget almost everything from a classroom SRS deck as soon as they stop using it.

Adjusting for these two assumptions, the terrain where classroom SRS can beat out its opportunity costs dramatically shrinks. But I believe it still exists, at the intersection of high automaticity targets and medium-term objectives.

With high automaticity targets, what you’re trying to train is a reflexive response to a stimulus that is going to look a lot like the study card. Foreign language vocabulary is my poster child for this. You’re not drilling the words to unearth insights. You’re drilling for speed, so that they can keep up when a word pops up in a real-time conversation.

You’re also trying to drill away the need for conscious awareness. You want that front-side combination of sounds or letters to cause the back-side set of sounds or letters to pop automatically into their heads. This is my intent when I drill my English students in word fragments (prefixes, roots, suffixes), which are really just bits of foreign language (Greek, Latin). If it’s not automatic, then they’ll gloss right over the possible meaning of “salubrious”, even though they have learned that “salu” usually means “health”.

By medium-term objective, I mean “I want my students to have automatic fluency with the content of these cards on Day X”, where X is a date between one week and three months in the future. It shouldn’t be sooner than that, in accordance with Gwern’s “5 and 5” rule: You probably need at least five days to get any real advantage from SRS. And it shouldn’t be later than a few months, for two reasons: First, we’re assuming the students will forget it all once they stop studying, which is all but guaranteed after the end of the course; there’s little point in keeping those cards in rotation after Day X. Second, I probably don’t want to start those cards until the last practical minute, which is unlikely to be more than three months ahead of time.

Why three months and not six? It’s not a hard-and-fast rule, but from the experience of my first three years of classroom SRS, if you’re trying to retain things for more than a few months, the total number of cards is likely to become greater than you can productively study every day, and many cards will languish unseen. Plus, your roster can change, especially over a semester break. The set of students you have in six months might only have 70% overlap with the set you have now. Really, you should wait until the last practical minute.

But what constitutes a worthy “Day X”? It might be a test. But if it’s your test, you may not have been listening. Your test may just be arbitrarily punishing some kids for forgetting a little sooner than others. However, if it’s an external test, with high stakes for you and your students, then it could be a worthy Day X indeed. For me, Day X is the day of the big state test — the one used to compare students to students, teachers to teachers, and schools to schools.

When your students do well on an external test, though, please keep a healthy perspective. A high test score doesn’t mean they can do the hard things now and forever. It means they were able to earn a high test score on Day X. They will forget almost all of it afterwards. But you will have given them their best chance to signal to others that they can learn hard things, and that you can teach them hard things, and that your school has teachers who can teach hard things.

Day X doesn’t have to be a test. If you’re optimizing for brain change that persists after they forget all of your content, Day X could be an immersive event. Maybe your Spanish class is going to Madrid. You know they will have a deeper experience if you can bring their vocabulary to a peak of richness and automaticity on the eve of departure. Yes, they’ll still forget almost all the words later. But they might retain a glimpse of how the world looked when seen through another language.

Maybe your event is smaller. A virtual trip. An in-class conversation day where we pretend we’re at the beach (“¡En la Playa!”). Maybe their long-term takeaway will be an appreciation for how different languages use different grammars, which is not something most people even consider until they’ve studied a second language. Get their mental gears turning hard enough, and they might even see grammar as an arbitrary construct with tunable parameters and tradeoffs that influence what can be communicated easily. Maybe they’ll independently rediscover the Sapir-Whorf Hypothesis. But they’re not going to remember how to say ‘sand’. Nope. ‘Shark’, maybe (¡Tiburón!). But you can’t predict this, and it’s probably not worth the effort to try.

But maybe you’re not teaching a foreign language. No matter your subject, Day X could be any conceptually demanding lesson or unit that is difficult to even talk about without fluency in a given set of terms. These aren’t very common in 10th Grade English, though they come up more often in my Creative Writing class. In these cases, however, the dependent terms are conceptually rich enough that they don’t lend themselves very well to cards, and I find it’s better to just quickly re-teach them in front of the lessons that use them. “Remember how we said...”[4]

How I currently use classroom SRS

As you may have guessed, I’ve radically scaled back my usage of classroom SRS since those first three years. In fact, for the last four years, I’ve only used it during a two-to-three month span leading up to the state test. And for the last two of those years, I’ve only used it for word fragments. I’m very unlikely to abandon its use for word fragments, though, because the most important thing I teach my students by using SRS is the existence of SRS. Word fragments are my favorite way to demonstrate how efficient study time can be. I add no more than about ten cards per week, which means that most days’ study takes less than two minutes. (This is good, because my own enthusiasm now begins to flag by the two minute mark.) I give very short quizzes on the fragments so they can do well on them and see how a little study can have a big payoff. (Remember that most of my students don’t ever study on their own.)

I’m still using Anki, with different profiles for each class. I run the review in a call-and-response style, where I show and say the card, and they know to simply shout out the answer. On a good day, it becomes a kind of chant. The number, speed, accuracy, and confidence of the responding voices tells me which button to press, and there’s usually a bellwether student I can listen for as I make my decision. Because I’m striving for very high automaticity, I almost always press either 2 (the shortest affirmative next-study delay) or 1 (the negative start-it-from-scratch button).

My students mostly like the call-and-response flow, as archaic as it sounds, and I will refer you to an older footnote about that time I observed a traditional one-room Mennonite schoolhouse:

I once had the privilege of observing part of a lesson in a traditional Mennonite one-room schoolhouse. I don’t speak a word of Low German, but it was clear the kids knew whatever it was they were drilling as they stood up and recited together. Most striking was the fact that they were all on the same page. There were no stragglers spacing out, slumped over, dozing off. The teacher could confidently build up to whatever came next without fear of leaving anyone behind.

For at least a minute or two every day, even worldly American kids can enjoy the routine. As I put it elsewhere in that Second Year report, “They enjoy the validation they get with each chance to confirm that they remember something. They enjoy going with the flow of a whole class doing the same thing. They enjoy the respite of learning on rails for a change, without any expectation that they take initiative or parse instructions.”

It probably goes without saying, but this call-and-response format only works well with cards with a very short answer that can be recalled very quickly. This is why I now only use SRS for word fragments. If I taught a foreign language, or even a lower-grade reading class with more basic vocab words, I would be using it more. My wife taught high school Spanish for a number of years, experimented with SRS, and is on the record as saying Duolingo deserves to eat the world. Anyone she could get to use it independently didn’t really need her class to do well on the final assessments.

After the state test, my students will forget almost all of their word fragments. That is the way of things. Ashes to ashes, circle of life, or, to get back to my controlling analogy, “All drains lead to the ocean, kid.” What I’m hoping will remain is an updated appreciation for what a little regular study can do, and a vague recollection that there are these apps out there that are, you know, like smart flash cards, that make it fast to memorize stuff.

Against apathy, toward apprenticeship

I’m nearing the end of Part 1, which means I’m nearing the end of my labors on this post, since Part 2 was mostly written five years ago. As writing projects go, I have found this one extraordinarily difficult. Over the course of its creation, I have pooped five times. It wants to be a book (or at least a blog), as everything I say tries to come out as a chapter of explanation having little to do with SRS.[5]

Well, I’m now going to indulge in several paragraphs where I don’t tie it back to SRS, so I can tell you the story of how I reinvented myself after my third year of spaced repetition software in the classroom. This included moving to a new school where I would have greater freedom to pursue my evolving views about learning. For what it’s worth, this story at least starts with SRS.

You see, it was during those dangerously long classroom Anki sessions six and seven years ago that I honed my sensitivity to students’ moods, to my own mood, and to how these feed off of each other. Sustaining a session without losing the room was like magnetically confining hot deuterium plasma — dicey, volatile, but occasionally, mysteriously, over unity.[6] I came to view anti-apathetic moods as a kind of energy that can be harnessed to do work and to create new energy.

Apathy, you may recall, is the true enemy. I’ve always known that. I called her out five years ago[7], but soon came to realize I had been fighting her on the wrong front.

I had been preoccupied by the fact that students who don’t care won’t activate enough of their brain to get any benefits from our daily review. To be fair, that is a problem, if I’m trying to prime them for success at a Day X event. But the more insidious issue is that a student in the thrall of Apathy won’t be churning their mental gears on any of the content I may have tricked them into learning, which means they’ll just forget it all without having made any lasting changes to their models. That’s not just an Anki-time problem. That’s an all-the-time problem. If they don’t engage with anything, they don’t keep anything.

I set off on a holy quest for anti-apathetic energy.

My errantry led me, for a time, to study stand-up comedy, not just because humor creates energy, but because a big part of that craft is an acting trick where you deliver incredibly polished lines in a way that sounds like you’re coming up with them right there in that moment.[8] Perceived spontaneity is a powerful source of energy even more versatile than humor.

I don’t know if I learned much about scripted spontaneity that I could articulate, but I felt like some of it rubbed off on me just by watching the experts closely over extended periods. And you know what? A lecture isn’t so different from a bit. A lesson isn’t so different from a set. A single changed word, a half-second delay, a subtle shift in facial expression can completely change the way the moment feels to the audience class. And like a comedian workshopping new material on the road, I could use the fact that I might teach the same lesson five times in one day to test variations, trying to provoke more engagement, better questions, bigger laughs.

Equally important: I recognized that the process of refining the performance art was fun for me, and that my own engagement was the most powerful source of classroom energy. I could transmit it to my students, and maybe even get some energy back from them while I directed some of it into activity that would get their mental gears turning. Instead of burning out, I could burn brighter, and longer. On a good day, it became self-sustaining. On a great day, it could go supercritical, sending me home after my last class with my head spinning in a buzz of positive vibes and deep thoughts.[9]

During this same era, as part of my ongoing study of creative writing, I was binge-listening to interviews with television writers. One pattern that struck me was that it wasn’t too uncommon for someone to just kind of find themselves working in that highly rarefied field simply because they had spent a lot of time around others who were already doing it. Without any organized instruction, they picked up on how it worked.

Did you catch it? That was twice that I had noticed how arcane expertise can rub off on people through prolonged proximity. That got me thinking about the German Apprenticeship Model, and its medieval — nay, prehistoric — roots. It’s how we used to learn everything, right? We followed mama out to the berry bushes, and papa out to the hunting grounds. The fact that it seemed to work for television writers told me that apprenticeship wasn’t just for blue collar skills.

So, with the longer leash I enjoyed under my new bosses, I decided to move my instructional style closer to something resembling an apprenticeship where I mentored groups of 20-30 padawans in my arcane expertise.

Yeah, I jumped on a trendy meme. Note my careful word choice: ‘show’, not ‘tell’. This, to me, is the defining action in mentor-apprentice relationships.

By switching schools, I lost my interactive whiteboard. So I replaced it with something even better: an extra computer on a make-shift stand-up desk (a narrow kitchen prep cart with fold-out boards.). A cheap second-hand monitor could face me while I mirrored that screen to the projector. Now I could do what I had seen coders do at instructional meet-ups: face the class while typing.

This meant I could show students what I do as a writer in real time, thinking out loud and watching their reactions as I typed. This could easily bore them, of course, but with strong energy-fu, old-school touch typing speed, and face-to-face interaction, I can pull it off more often than you might expect. On a good day, they find it fascinating. On one very special occasion each year, I do it for the full period, writing a 400+ word essay from scratch in 40 minutes with no prior knowledge of the prompt. Students have to hold their questions that day, and instead take observation notes, which become fodder for an extended debriefing discussion the next day.

The most important thing I’ve learned from those debriefings is that everyone can pick up something from a holistic demonstration like that, regardless of their skill level.[10] An advanced student might ask about my bracket substitution of a pronoun in a quote. An average student might say, “You used a lot of small and medium-sized body paragraphs instead of three big ones.” A sub-level student might say, “You didn’t like it if you used the same word too soon after you used it before.” And I always seem to get at least one surprising question about something I never would have thought to teach them, like, “How did you suck words into your cursor?” Then I’m like, “Oh, let me show you the difference between the Backspace and Delete keys…”

Did I make them memorize anything with that “lesson”? Nah. Did they make lasting updates to their mental models? Probably! Are you thinking of asking me, “But how do you test them on it?” Because if you are, then you really haven’t been paying attention!

There’s plenty more to be said about apprenticeship, but I think you get the idea, and this is still nominally an essay about classroom SRS.

If I had to summarize my self-reinvention in too many words, I would say that I’m now optimizing for “good days” at the high-energy intersection of “engaging for me”, “engaging for them”, and “conducive to lasting and worthwhile updates to their mental models”, with less regard for curricular scope and sequence.

In practice, this means… well, a lot of things. But it’s time I pinch off Part 1. That, “or get off the pot,” as they say.

Part 2: A Third Year of Spaced Repetition Software in the Classroom (2017)

[In this excavated report, text in brackets in commentary I’m adding in 2021. Anything out of the brackets is direct from my 2017 draft, or constructed from my notes to fit the perspective I had at the time.]

Synopsis and disclosure

I tried the obvious thing this year. Instead of game show-style whole-class front-of-the-room Anki, I arranged for every student to be able to independently study material I created in Cerego, both in and out of class.

Disclosure: Cerego provided me a free license for the year in exchange for some detailed feedback, which I gave them. This feedback was mostly about user interface issues and reports, the latter of which required some ugly scripting on my end to get numbers I found useful. As the Cerego team seemed to be rapidly iterating, I imagine they have made many changes and improvements to their app since 2017, though I have not used it since. Please keep this in mind as you read these years-old notes.]

Despite many small hang-ups, I was pleased with the Cerego’s features and reliability. In exchange for a great deal of up-front effort, it gave me a unique window into student engagement and progress. Consequently, it proved to be an overwhelmingly potent tool for winning “the blame game”, although I eventually came to feel uneasy about using this power.

Longer-term learning outcomes seemed, on average, to be slightly worse than with the whole-class Anki method. While highly motivated students benefited from being able to study more aggressively and efficiently than before—and their objective scores were higher than ever—their learning seemed less transferable to more authentic contexts. Students of lower motivation, while seeming to get little from either approach, got even less from this digital 1:1 method, and their slump accounts for the overall decline.

Setup

I taught a mix of regular (not honors) 9th and 10th English classes again, but over the summer of 2016 I was invited to move my classroom into an unusually-spacious converted computer lab in which 16 older desktop PCs were kindly left at my request. I had these arranged facing the sides of the room so I could see all screens easily. I allocated those PC seats on a semi-permanent basis as needed and requested. The balance of students sat at normal desks and used their phones for study.

This came with challenges. School WiFi was officially off-limits to students (though many always had the password anyway), and many students said they were at the whim of data caps they regularly pushed up against. Their phones, in most cases, were a generation or three behind state-of-the-art, with degraded batteries and exhausted storage capacity. A few students had difficulty even making room for the Cerego app that first week.

While our setup was marginal, between the PCs and phones, we only rarely ran into a situation where not everyone could be studying at the same time.

On the software side, it must be said that, for all its features, Cerego wasn’t designed for my specific use case. The company’s featured customers are business and colleges, who use the product as part of packaged training programs and distance learning courses. Importantly, the app favors adding content into the learner’s study rotation in blocks, on the learner’s own schedule, rather than making it on the fly and trickling it immediately. It was also not designed to give a teacher “panopticon”-style real-time monitoring, nor to thwart adversarial users who want to look studious without studying.

Procedure

Before the start of each school day, I would consider the previous day’s lesson content and add to the relevant Cerego study sets as appropriate. This process could be lumpy and not necessarily daily; some lessons invited a great deal of suitable content, and others none at all. Content additions were also far more common first semester than second semester, as I intentionally front-loaded material to maximize the time we would have to reinforce and apply it. During an average week where I added cards, we probably averaged about 50 additions. [ ! ]

With a prominent timer at the front of the room, I allocated 10-12 minutes at the start of every 57 min class period as specially designated “Cerego Time”. During Cerego Time, I would periodically patrol the room to ensure students were on task and to provide support.

Students were allowed to read a pleasure-reading book during this time instead, if they chose. This allowance was most obviously meant for anyone with extra time after catching up with their study, but I wasn’t about to interfere with any teenager reading a book on their own volition. Not all regular readers (2-5 per class) were conscientious Cerego-ers.

Students were strongly encouraged to also use Cerego outside of class whenever the app recommended, if they wanted maximal retention for minimal time spent.

About once a week, usually without warning, I would give a ten question multiple choice quiz that could include questions directly taken from any content that had been in Cerego for at least a week, no matter how old. This was a multiple choice quiz done digitally in Canvas. Before I put the grade into my book, I would add a 10% adjustment (not to exceed 100%), respecting the wisdom that aggressive study sees diminishing returns as one approaches a goal of 100% retention on large bodies of knowledge. My students were aware of this free 10% and my reasoning behind it.

To account for students just joining my class at the start of second semester, and for those who inevitably studied nothing for the seventeen calendar days between semesters — and even for those simply desperate for a fresh start — I had a lengthy grace period of sorts in January and February. Older stuff was temporarily not included in the “quizzable” question pool. I posted dates for when I would consider each old set fair game again; every week or two, a set would find itself back in the pool according to this schedule, and stay there for the rest of the year.

I did not use Cerego stats directly for any kind of grade, instead using my Canvas quizzes for this. My reasons:

  • I wasn’t sure every student would consistently be able to use the app, and didn’t want to deal with the push-back from students and parents claiming (honestly or otherwise) insurmountable tech obstacles to using Cerego outside of class.

  • Due to limitations in Cerego’s reporting, I wasn’t sure how to regularly compute a fair grade based on Cerego stats.

  • I wasn’t sure how far I would be able to trust that a student’s stats weren’t being run up by a smarter friend using the app on their behalf.

  • I didn’t want to discourage students from using Cerego Time to instead read their pleasure books (a habit of immense, scientifically-backed value that I do everything I can to promote).

  • I didn’t want to give the impression that Cerego is necessarily the best or only way to study, but instead to make it clear that knowing the content was their responsibility, however they chose to do it; my providing them with Cerego cards and time to study them was simply a function of my being a Really Nice Guy.

Points of friction

This section is not a critique of Cerego specifically, but rather a reminder that classroom technology is not inherently good. The mythical 1:1 student tech ratio doesn’t suddenly make impossible dreams reality, and in fact comes with ongoing costs that must be weighed against the benefits. Here were some points of friction I encountered:

  • Forgotten login information for the school PCs or Cerego.

  • Slow startup, login, and load times on outdated equipment. [Fun fact: I’ve found that as my current school cuts down on the need for different logins through Clever, they create a separate problem of longer and more fragile authentication chains — handshaking from one site to another — that can fail on slow machines or under spotty WiFi.]

  • Old or abused keyboards and mice that intermittently fail.

  • The occasional bigger problem, like a blown power supply.

  • For phone users: discharged, confiscated, lost, or broken devices.

  • Distractions and inappropriate behaviors that wouldn’t be possible if students didn’t have their own screen to command.

All of the above adds up to a kind of tax on your time and energy, even when you have enough respect from your students to minimize deliberate abuse. (I had maybe 2-3 bad eggs during the year committing occasional acts of minor sabotage.) Moreover, every possible point of friction becomes amplified by a student who doesn’t feel getting to the objective, like a child who finds an hour’s worth of yak shaving to do whenever bedtime rolls around.

Problems with multiple-choice study cards

Unlike Anki and other personal-use SRS, where the user self-assesses performance and collaborates with the app to schedule the next review, apps like Cerego are built to measure retention objectively. This changes how study cards have to be constructed. Although options [even in 2017] are varied, the most practical and straightforward method is usually a “front” side card with a question or term and a “back” side of multiple-choice responses.

Some problems with multiple-choice format:

  • Responding to a multiple-choice question (or any kind of question) takes more time than pressing a self-assessment button.

  • In general, it’s more work to create study cards that can be assessed by the app. This is true even in the ideal case, which for Cerego is when you can assign a set of cards where the correct answer in one card can automatically become a multiple-choice distractor (wrong answer) for other cards in the set. But many cases are not ideal, and the only plausible distractors will be ones you add manually.

  • Students can get confused when distractors contaminate tenuous mental associations. This is a well-studied effect with testing in general, and I had one student (motivated, but lower IQ) who I feel was positively ruined by it this year.

  • Students mostly don’t try to recall the answer before looking at multiple-choice options, instead defaulting to the following heuristic: “Look for an answer that feels right—if none do, press ‘None of the above’”. This is a problem, because the act of trying to recall the specific thing is known to be the critical step that reinforces the memory; in contrast, merely recognizing familiar facts (as when “going over notes”) is known to give students false confidence.

I gave my Cerego contacts some ideas I had for minimizing some of the downsides of multiple-choice. Because my students were largely deaf to my pleading that the “front” card screen — the one containing only the question — is where the learning actually happens, there could be a mandatory (or at least default, opt-out) short delay on that screen, especially when the app detects inhumanly rapid clicking.

Cerego actually asks “Do you know this?” on that screen, giving them a chance to self-assess in the negative without going to the multiple choices, but the vast majority of students never saw this screen as anything but a speed bump to click through.

My thought was that Cerego could occasionally not show the multiple choice options right away when they click “I Know It”, but instead call their bluff, asking, “Oh? How confident are you?” and prompting them to select a confidence level on a slider bar before showing the choices. Not only might this end the bad habit, it could also provide an opportunity to help them with their credence calibration, a useful skill that might make them better thinkers and learners. I also suggested Cerego might be able to use this data to learn more about a learner and better judge their mastery level through sexy Bayesian wizardry.

[My aborted app design would have taken that concept to its logical conclusion: letting trusted users fully self-assess most of the time, but occasionally performing “reality checks” where it made the user respond in a way it could verify. It could then use straightforward Bayesian updates from these checks to decide how often to do them for each user.]

New failure modes

New format, new failure modes:

  • Performative clicking. I would commonly have students who didn’t want the discomfort of getting called to task, but also didn’t want to actually do the task, so they would put up a show of productivity, continually clicking random answers over and over again without reading. Others would loiter in the stats screens, play with the cursor, check their grades… anything that wouldn’t require actual thinking.

  • Exploits. Some students realized that mindless clicking moved Cerego’s progress bar on their study session forward. In some cases, it even raised their score. One enterprising young man demonstrated this for me, proudly resting a textbook over the Enter key, then kicking back as he “studied” his sets in record time. It was hard to be mad at him, as I could see myself doing the same at his age. Indeed, I was impressed. But he was in no way discouraged by my reminder that I didn’t use Cerego reports for grades, and that his trick wouldn’t leave him any better prepared for the quizzes that counted. (His mind was a steel trap, though; he did just fine.)

  • Hunkering. Cerego is set up such that students don’t have new cards added to their rotations until they make an active choice to press a button that does this. Thus, many students would endlessly study only the first twenty cards from the start of the year, never pushing themselves with anything new. In their defense, one of my feedback notes to Cerego was that the UI [in 2017, remember] didn’t make it very clear that they had new material awaiting activation. But even after interventions where I walked them through the process, many of these fox-holed students would fail to activate newer cards on their own initiative.

  • Idleness and moping. Apathy often manifests as lethargy combined with half-hearted complaints, voiced only when confronted, that it’s “too hard” or that “I don’t understand it”. Even though neither of those complaints made much sense when studying limited subsets of word-definition vocab pairs (the most common card set), I still heard both of them regularly from the hibernating bears I dared to poke. (Metaphorically. Never touch students.)

    This was further evidence of something I already believed: that these complaints, in these contexts, are a means of disincentivizing teachers from bothering them, as opposed to cries for help. After all, if such a student stands by their claim of not understanding it, what is a responsible teacher supposed to do except to stand there and reteach them the whole thing, or schedule one-on-one tutoring, holding their hand with every “I don’t get it” until the work is done for them? If the student had really wanted to understand and do the work, they would have raised their hand as soon as they encountered difficulty instead of trying to be inconspicuous.

    [I’ve always been more sympathetic to apathetic students than I probably sound here. Public education demands more directed attention from teenagers than most of them can realistically muster for 35 hours a week.]

Dominating the blame game

Teachers are regularly asked by their bosses how they are “differentiating” instruction, adjusting lessons for students across a class’s range of skill levels, learning disabilities, and language deficiencies. They are also asked by parents what their children can do to improve their grade.

Cerego gave me a ready answer to both questions: “Well, in my class we use a free study app that I load with all of the terms, vocab and such that could be on my quizzes. It’s like smart flash cards that let you know when you need to study to avoid forgetting things. They adjust to give you more practice with the things you struggle with. Not only do I provide time to use it during class — even providing a computer if they need it — but it works on any internet device. Students can use it as often they like to be as prepared as they want to be.” Nobody ever complained about this answer, and some were quite impressed with it — more than I was, to be honest.

I also had powerful ammunition in the all-too-common scenario where, at a meeting with all of the child’s teachers, a parent blames poor grades on the teachers’ not adjusting to their child’s very special needs, instead of on their child’s ridiculously obvious laziness.

We can’t, of course, just come out and call it like we see it. But we can show parents our data and let them connect the dots. So, in these cases, I would just repeat my “Well, in my class we use a free study app…” spiel, emphasizing the “as prepared as they want to be” part. I would then add, “According to the app, your child has spent [x] minutes studying over the last week, which is about [y]% of the time my average ‘A’ student spent in that same period, and, come to think of it,” I would say, scratching my head for effect, “far less than the time I provide in class for it.”

Cue evil gaze from parent to child, squirming discomfort from child, envious awe from my fellow teachers.

It’s true! Here is a snapshot of one type of output I collected from my report-processing scripts for one of my students. You’re looking at one block of a larger data sheet I brought to parent meetings and included in periodic emails sent home. This one was for a fairly average student who put in the minimum expected time but didn’t push themselves very hard. A slacker’s would be more brutal.

Like I said, absolute dominance.

But like a lot of games, beating the “blame game” just made me tired of playing it, and ready to move on to something else. The enemy is not the apathetic student. The enemy is Apathy herself. I want to teach the lazy student, not destroy them with my Orwellian gaze.

Results and discussion

Table

In the following table, n=129, the sum of the 9th and 10th grade students that finished second semester with me. The procedures were identical in both grades, and I didn’t find much reason to divide them, preferring the larger total sample. I then divided the combined sample into quintiles as shown:

The “Sem 2 Grade” is their course grade from just the second semester, but the other stats are all cumulative for the year. (No, I don’t have any state test data for this group, and I never will. Having switched employers, I am not privy to the results, which arrive in late summer or early fall.)

“Set Level” is Cerego’s signature rating of overall progress and retention, on a 4-point scale.

“% of Cards Started” is the fraction of the total cards I had prepared that the students had added into their rotations. (Remember that Cerego did not do this automatically). For 9th graders, there were 648 cards. For 10th grade, there were 749.

Study time analysis

As a sanity check, I crudely estimate that we had study time on 160 of our 180 school days, spending an average of 11 minutes each time. That would add up to 29.3 hours of total in-class study time. That the actual averages are lower does not surprise me, due to a combination of absences, roster changes, and start-up times. What we can conclusively say is that there was not a massive amount of outside-of-class study going on.

Of course, not all of those logged study minutes were productive study time. It wasn’t always clear to me when Cerego counted a minute towards study vs. idle, or whether it detected idleness at all on the mobile app. Indeed, there were several cases where a student’s mobile app seemed to have logged continual study overnight, and even, in one case, for multiple continuous days. The above chart has not been adjusted for known or unknown anomalies of this kind.

Regardless, as you can see, while time spent studying was correlated with performance, there was barely a 25% difference in study time separating the top and bottom grade quintiles. Even this is less exciting than it looks, as the lowest scorers were also more likely to be absent, missing their in-class study time. I have made no effort to adjust for this.

One thing you can’t see in that chart is the high variance that existed within the top quintile. In this group, time spent studying varied from 33 hours to 12 — and 12 was the top student! Anecdotally, I perceived two distinct subgroups of high performers: highly motivated learners who had a natural disadvantage, like being a foreign exchange student speaking a second language, and high IQ avid reader types. The former put in far more hours than the latter. In fact, that second group put in less time than the average bottom quintile student.

Only a very small number of highly motivated students showed signs of studying over weekends and breaks.

SRS signal, or just conscientiousness?

While you can see a much stronger signal in the “Set Level” and “% of Cards Started” columns, it’s hard to know how much this is just measuring conscientiousness. Good students are going to do what they’re asked to do, and get the good grade no matter what, but this doesn’t mean that what they’re asked to do is always necessary to get the good grade — or that the grade reflects anything worthwhile in the first place.

People persons

At least a few of the students I could never get to study Cerego were very on-the-ball whenever we did any kind of verbal review.

[I’ve seen a lot of this pattern during the pandemic. Students who seemed like inert lumps online, with very low grades, have in many cases returned to the classroom and revealed themselves to be dynamic and invested. An engaging human at the front of the room really is the “value add” of in-person instruction. This is something I encourage my peers to keep in mind whenever deciding between autonomous work and teacher-student interaction.]

High automaticity in high achievers

When it came to automaticity, outlier results were more impressive than ever. The very small number of students at the overlap of highly motivated, highly intelligent, and highly competitive absolutely crushed it in the review game we regularly played at my interactive whiteboard, beating me on several occasions, which almost never happened previously.

Weak transference?

However, transference to other contexts was less evident. In my first report, I had remarked on anecdotal impressions of higher-quality discussion and essay responses from those who had embraced our Anki review, suggesting that they had truly enlarged their lexicon to be able to talk about more complex ideas. I saw less of that this year. I don’t know what that means. It could just be that this mix of students was less open with their thoughts. But I can also see how they may have seen the Cerego universe as distinct from the universe of essay and discussion. Whole-class Anki might be more resistant to this bifurcation by making us say the words out loud to each other, normalizing their use.

Drama benchmark analysis

To compare methodologies as directly as possible, for a third year running I handled my Drama unit the way I accidentally had during my first year of classroom SRS: some terms taught before the pre-test, most taught after the pre-test, an identical post-test much later, and no review of any of it except through the SRS.

The overall results in the Drama unit were slightly worse this year. This was surprising. This cohort started lower on the pre-test, which was consistent with my impression of them, but I predicted that we would at least match or exceed last year’s gains, as we had more room to improve. We did not. Retention of some reliable bellwether terms actually dropped prior to the post-test. In picking through individual scores, my impression was that whole-class Anki and independent in-class Cerego were statistically equivalent for motivated learners, but whole-class Anki won easily with less motivated learners. As always, there were plenty of truly unmotivated students who got nothing from either method.

I tried to tease this out even further. This was pretty unscientific, but I took the pre and post-test scores of twenty students from last year, and aligned them individually to students from this year with similar pre-test scores and, in my view, similar work ethics. Highly motivated students starting very low may have done slightly better with Cerego than with Anki, but poorly motivated students starting low did somewhat better with Anki.

I’m sure a lot of this came down to how Cerego makes new card sets “opt-in”. Students of lower motivation were less likely to encounter the Drama terms in their study rotation at all!

Phone vs. Computer seemed to make a difference here, too. Stuck with a very visible PC, some low performers would occasionally have good days and get in a groove. The ones glued to their phones found anything to do except Cerego.

Conclusions (2017)

If I see students as being ultimately responsible for their own learning, independent Cerego is the fairer approach that will help students get what they “deserve”. If I see things more pragmatically and utilitarian (as I do), the numbers favor the whole-class Anki approach. And yet...

If I were staying at that school, with my classroom computers, I would have tried to get the best of both worlds. It was my plan to use Cerego again — having already done most of the legwork — and try to make it friendlier, with more teacher interaction, supplementing with some whole-class Anki. I would have pushed Cerego’s developers to make some of my most wanted changes, and I would have pushed myself to cut back on the number of cards I used.

But it’s moot, now. I won’t have computers at my new school. And part of the reason I left was because I didn’t like the feel of the groove I was settling into.

Whole-class Anki review wins for simplicity and camaraderie. Cerego wins for surveillance and power. Which would you want to see stamping on a teenage face forever?

Trick question! It’s not nice to stamp on faces. I feel like I’ve been pushing SRS too far past the point of diminishing returns, and I don’t know why it has become an annual tradition for me to vow to cut back next year and then fail to do so. I should probably break that cycle. Apathy is the enemy, and she remains unbowed. I’ve been looking for a technological fix, but I think the solution is, at best, only partly technological.

[My notes here spiraled off into very technological solutions (sigh) to add to my dream SRS+ app, which I had already postponed again but still wasn’t ready to abandon. I suppose I can give myself a little credit for brainstorming features to encourage human interaction and conceptual connections. Eventually, my notes came back to some thoughts about what makes a class thrive, which I have translated into coherent sentences below.]

From a scalability standpoint, it’s nice that something like Cerego doesn’t depend on a teacher’s charm the way my whole-class Anki approach does. Teachers could do a lot worse than a standardized pack of quality Cerego sets that reinforce matching cookie-cutter lessons. But couldn’t teachers also do better? I think I could do better. Cerego and Canvas quizzes create distance between me and my students. But I want to bring us closer and dial up the enthusiasm.

I don’t think gamification is the answer. I’ve been noticing that the appeal of games is pretty niche, failing to capture many from the apathetic middle, and then for the wrong reasons, with the wrong incentives.

So what would work?

In education research, it always looks like everything works at least a little bit. This is probably a combination of publication bias and the fact that teachers sometimes get excited to try something new. Excitement is infectious. This gets students more engaged, which then improves outcomes. My early success with classroom SRS — and subsequent disappointments — would certainly fit that pattern.

Maybe I should make a point of trying new things each year for the explicit purpose of exploiting the excitement factor? How would I explain that to my bosses? “Well, I deliberately diverged from the curriculum and accepted best practice because I grew weary of them.”

[Yes, actually. My new bosses are great that way.]

Thesis, Antithesis, Synthesis (2021)

As a student of storytelling, I can’t help but find an arc to my fourteen years of teaching up to this point.

When I first started out, I didn’t know what I was doing but kept Apathy at bay through sheer passion. I worked harder than anyone. I couldn’t wait to try my stuff out, and students responded to all but my cringiest overtures.

When this inevitably exhausted me, I had a hard slump. Lessons that used to work fell flat. I still didn’t know what I was doing, and now lacked sufficient passion to brute force success. So I retreated into systems and structure, building word banks, prompt banks, quiz banks; rubrics, charts, and randomizers; running reports; slinging code. A suit of high-tech power armor to augment my feeble form. A different kind of brute force.

My systems gave me stability and staying power, and, eventually, the confidence to explore. My three years of heavy SRS experimentation were the culmination of this phase. I stretched. I grew. But I still felt plateaued and frustrated, perhaps having taken systems as far as they could go.

Apathy still mocked me from her emoji throne.

I step out of the armor and find I no longer need it. One by one, my systems clatter to the ground. I know who I am. I know where my power comes from. And I know my enemy.

She will lose, because she is overconfident. She won’t prepare, because she is indifferent. And she won’t hear my warning, because I issue it now in the one place I know she’ll never reach: the bottom of a 10,000 word essay.

I’m coming for you.