I don’t see an independent human baseline anywhere in there, and at least in the flapping example, I don’t think I would be able to pass that one either (aside from the straightforwardly incorrect ‘unique’ claim). Is there any reason to expect any human or LLM to be able to read Andy’s mind and guess exactly the right angle he had in mind for that flashcard? That’s a parsimonious explanation of the inverse scaling...
It’s Ozzie Kirkby too, not just Andy, but your point holds (3rd and 5th bullets):
G. Known Limitations
A few caveats apply to the dataset and the methods built on it.
Synthetic interpretations. The highlight_interpretation field is generated by a language model and may contain hallucinations or subtle reframings of the reader’s intent.
Source bias. Sources skew toward technical explainers, blog posts, and opinion pieces. The dataset contains no textbooks, limited narrative nonfiction, and no fiction. Results may not generalize to these content types.
Small annotator pool. Most prompts were labeled by the authors, with contributions from a small community of experienced memory system users. The taxonomy may reflect idiosyncratic preferences that would not replicate with a larger, more varied pool. The flip side is that the dataset captures a relatively coherent notion of prompt quality, which can be useful for taste induction or personalized model alignment, where a consistent evaluative perspective is more valuable than annotator diversity.
T2 underrepresentation. Only 8.8% of prompts are rated T2 (see distribution table in Section A).
No longitudinal validation. Tier assignments were made at labeling time and never validated against the long-horizon reviews these prompts are designed to support. What stands in for that validation is the labelers’ own experience: each annotator had been running a personal SRS practice for several years, and a T3 rating reflects their compressed taste for what survives months of review (see the report for more on why we treat that taste as a compressed signal).
You happen to be the person I learned about SRS from actually (this article), so I can’t help but ask how you think about taste in creating good flaschcards and whether it differs from Andy/Ozzie. Maybe your “what to add” section?
I find one of the best uses for Mnemosyne is, besides the classic use of memorizing academic material such as geography or the periodic table or foreign vocabulary or Bible/Koran verses or the avalanche of medical school facts, to add in words from A Word A Day58and Wiktionary, memorable quotes I see59,personal information such as birthdays (or license plates, a problem for me before), and so on. Quotidian uses, but all valuable to me. With a diversity of flashcards, I find my daily review interesting. I get all sorts of questions—now I’m trying to see whether a Haskell fragment is syntactically correct, now I’m pronouncing Korean hangul and listening to the answer, now I’m trying to find the Ukraine on a map, now I’m enjoying some A.E. Housman poetry, followed by a few quotes from LessWrong quote threads, and so on. Other people use it for many other things; one application that impresses me for its simple utility is memorizing names & faces ofstudentsalthough learning musical notes is also not bad.
I don’t see an independent human baseline anywhere in there, and at least in the flapping example, I don’t think I would be able to pass that one either (aside from the straightforwardly incorrect ‘unique’ claim). Is there any reason to expect any human or LLM to be able to read Andy’s mind and guess exactly the right angle he had in mind for that flashcard? That’s a parsimonious explanation of the inverse scaling...
Yeah there isn’t any such baseline.
It’s Ozzie Kirkby too, not just Andy, but your point holds (3rd and 5th bullets):
You happen to be the person I learned about SRS from actually (this article), so I can’t help but ask how you think about taste in creating good flaschcards and whether it differs from Andy/Ozzie. Maybe your “what to add” section?