Frontier models seem to be slightly worsening over time at converting casual reading highlights into memory flashcards that survive long-term review (a year) due to lack of taste from extensive SRS review experience, per Andy Matuschak at Memory Machines (full report). Opus 4.7 is worse than Sonnet 3.7 for instance:
What taste borne of extensive SRS review looks like:
What makes a good memory flashcard for a spaced repetition system, as opposed to ordinary flashcards?
A good memory prompt lives in a narrow band. It must be concise enough to read quickly, but detailed enough to cue the same memory months later—yet not so detailed that the question gives away the answer. Attempting to proceduralize and describe the process of writing good memory prompts is challenging since so much of the knowledge comes from lived experience. You learn what works by experiencing what fails. A prompt often seems fine initially, but weeks later forgetting exposes its weaknesses. Forgetting is the feedback which shapes taste. …
Effective memory prompts satisfy two criteria simultaneously:
Targeting: whether the prompt captures what the user actually wants to remember.
Construction: whether the prompt will reliably cue the same memory after long gaps, without significant loss in detail.
A prompt can fail on either axis. Targeting failures are usually obvious. You read the prompt and immediately recognize that it’s about the wrong thing, or about something you don’t care to retain. Construction failures are harder to see. They often surface during review, when ambiguity, underspecification, and excess abstraction can cause friction and forgetting.
These two failure modes differ not just in kind, but in cost. Targeting failures are relatively cheap: they’re rejected at a glance. Construction failures are expensive. They look plausible, so you read them carefully, attempt an answer, and only later discover that the prompt doesn’t support stable recall. Repeated over time, these prompts erode trust in the system.
That last point seems related to how non-finetuned LLM writing tends to oddly “slide off” my eyes despite being mildly pleasant to read.
From the targeting / construction 2x2 Matuschak and Ozzie Kirkby come up with this 4-tier taxonomy:
Model taste doesn’t transfer even when Matuschak & Kirkby described and demonstrated examples:
We find that models reliably reject T0 prompts. For both language models and human reviewers, off-target prompts are cheap failures: you read them and immediately recognize them as poor fits. The T1/T2 boundary is different. For humans, distinguishing those tiers takes careful reading and judgment formed through thousands of reviews—the accumulated sense of which prompts drift and which hold. Across all our experiments, models failed to reliably distinguish those tiers.
No model exceeded 70% accuracy in binary classification, and T1 performance varied dramatically. Using rubrics, models caught missing context (F1 = 0.87) but struggled to discern when a prompt would elicit multiple valid answers (F1 = 0.32–0.50). Even when presented with the best prompt beside three weaker alternatives, models still selected the broken one about a third of the time. We can describe and demonstrate our taste, but the ability to discern it doesn’t transfer.
The difference between T1 and T2 isn’t a grading quibble. T1 prompts quietly degrade the system: they waste attention, produce drift, and erode trust over time. They require careful attention to identify, so users can’t easily screen them out in advance. A system which produces T1 prompts a third of the time is not a system we want to use.
They tried all kinds of approaches to train models on taste without success.
Here are the frontier models perf-ranked by the full-tiered taxonomy:
Checked a couple of their prompts[1]. The generation prompt is concerning. It lacks examples, and the “pluckable” requirements is not included. So I understand this as that new models still by default have bad taste, but this experiment couldn’t tell me whether they can have better taste if you give finer instructions. They also don’t mention the reasoning effort of models that have the setting (e.g. GPT 5.2 and Opus 4.5).
I am showing the prompt variants without examples but they have few-shot versions for most of the prompts.
You are a memory prompt generation system designed to convert user highlights from articles into high-quality memory prompts for Anki, a spaced repetition system. Anki presents prompts at expanding intervals—often months or years apart—to maintain long-term retention of meaningful knowledge.
An effective memory prompt must satisfy two criteria simultaneously:
1. **Meaningful**: It captures what the user found interesting (as indicated by their highlight) 2. **Stable**: It can be consistently retrieved from memory even after months, when reviewed in isolation without access to the original context
## Generation Principles
The user will provide a highlight and some details about the document. Use the source to understand *why* the highlight matters, then create a prompt that tests *what was actually highlighted*.
Your prompt should match the scope and specificity of the highlighted text itself, not the full scope of the interpretation. Each prompt should test one unified concept. If the highlight contains multiple interesting ideas, generate separate prompts for each.
## Response Format
Generate memory prompts as question-answer pairs: ``` Q. [Question] A. [Answer] ```
The user uses Anki to manage and cultivate their curiosities.
## Task You will be evaluating and selecting the best memory prompt for Anki. Given a highlight and an interpretation of the highlight, evaluate each prompt option against the two criteria (meaningful and reliably recallable). Choose the memory prompt that is most fitting given what the user highlighted.
### Response Format When you are ready, respond with the exact text:
``` Chosen: **card_id=[Card ID]** ```
Replace `[Card ID]` with the identifier of the best memory prompt option.
Your task is to identify whether the generated memory prompt is “pluckable”: it is well-suited for long-term retention in the user’s memory system based on their highlighted text and interpretation. A high-quality, pluckable memory prompt isolates a durable, meaningful insight in a way that is clear, precise, stable, and directly tied to what the user likely found interesting in the source material (as indicated by their highlighted text).
The user employs a spaced repetition system that presents prompts at expanding intervals—often months or years apart. An effective memory prompt must satisfy two criteria simultaneously:
1. **Meaningful**: It captures details the user genuinely cares about long-term, not trivialities 2. **Stable**: It can be consistently retrieved from memory even after months, when reviewed in isolation without access to the original context
Evaluate the provided `Memory Prompt` against the highlighted text and its interpretation. Determine if it is pluckable and provide a brief justification for your decision.
## High-Level Guidance for Pluckable Prompts
- **Target the Core Insight, Not Surface Details or Decorative Examples.** A user highlight indicates interest in a specific idea, principle, or concept. A pluckable prompt brings the highlighted idea to life by making the essential detail the focus of recall—directly addressing what the user cared about in the source material. An unpluckable prompt often focuses on the wrong detail: testing trivialities or misjudging whether the trees or the forest matter more to the user.
- **Ensure Atomicity and Clarity.** A prompt should test one single, well-defined concept. Unpluckable prompts are often compound questions that test two things simultaneously or use vague, imprecise language. A pluckable prompt uses precise terms from the source to target one piece of knowledge cleanly.
- **Maintain Fidelity to the Highlighted Concept.** A memory prompt must be faithful to the idea the user found interesting. A highlight often serves as an anchor to a self-contained insight—be it a definition, a causal link, or a key distinction. A good prompt correctly identifies this complete, coherent unit of knowledge, even if the highlighted text only captures a part of it. The goal is to make the remembered concept whole and durable. Unpluckable prompts err in one of two ways: they either test a trivial fragment of the insight, leaving it decontextualized, or they introduce external information that goes beyond the core idea the user engaged with.
- **Be Specific and Precise to Ensure Stability.** Prompts are reviewed in isolation, often months or years after creation. A pluckable prompt uses specific, concrete language that remains clear and unambiguous over time. Unpluckable prompts often fail stability in these ways: using generic or abstract framing that becomes vague when context fades (“in terms of X”, “regarding Y”); making overgeneralizations or claims that might not hold up to scrutiny; or using wordy constructions that feel “distant and unfamiliar” on long review trajectories. Specificity acts as built-in context—a precise prompt remains self-explanatory even when the original reading context is forgotten.
## Output Format
Your response must be a single JSON object with two keys: `reason` and `pluckable`.
```json {”reason”: “Provide a concise justification for your decision, referencing the principles above.”, “pluckable”: true/false} ```
Frontier models seem to be slightly worsening over time at converting casual reading highlights into memory flashcards that survive long-term review (a year) due to lack of taste from extensive SRS review experience, per Andy Matuschak at Memory Machines (full report). Opus 4.7 is worse than Sonnet 3.7 for instance:
What taste borne of extensive SRS review looks like:
What makes a good memory flashcard for a spaced repetition system, as opposed to ordinary flashcards?
That last point seems related to how non-finetuned LLM writing tends to oddly “slide off” my eyes despite being mildly pleasant to read.
From the targeting / construction 2x2 Matuschak and Ozzie Kirkby come up with this 4-tier taxonomy:
Model taste doesn’t transfer even when Matuschak & Kirkby described and demonstrated examples:
They tried all kinds of approaches to train models on taste without success.
Here are the frontier models perf-ranked by the full-tiered taxonomy:
Checked a couple of their prompts[1]. The generation prompt is concerning. It lacks examples, and the “pluckable” requirements is not included. So I understand this as that new models still by default have bad taste, but this experiment couldn’t tell me whether they can have better taste if you give finer instructions. They also don’t mention the reasoning effort of models that have the setting (e.g. GPT 5.2 and Opus 4.5).
I am showing the prompt variants without examples but they have few-shot versions for most of the prompts.
Prompts
memory_machines/arena/instruction/simple.txt
memory_machines/arena/generate.py#L22-L45
memory_machines/forced_choice/instructions/simple.txt
memory_machines/pluckability/instructions/zero_shot.txt
btw it is really annoying that their prompts are hidden 3 links deep (report → appendix → github).