For classification problems like this, my current hypothesis is that enumerating the target classes nudges the model to instantiate a dedicated “choice-head” circuit that already lives inside the transformer. In other words, instead of sampling from the entire vocabulary, the model implicitly masks to the K tokens you listed and reallocates probability mass among them. A concrete prediction I have (so I can be wrong in public) is that if your ablate the top 5 neurons most correlated with the enum-induced class-direction, accuracy will drop ≥ 15 pp on that task, but ablating an equivalent set chosen at random will drop ≤ 3 pp. I plan to test this on a 70 B model over the next month; happy to update if the effect size evaporates.
For classification problems like this, my current hypothesis is that enumerating the target classes nudges the model to instantiate a dedicated “choice-head” circuit that already lives inside the transformer. In other words, instead of sampling from the entire vocabulary, the model implicitly masks to the K tokens you listed and reallocates probability mass among them. A concrete prediction I have (so I can be wrong in public) is that if your ablate the top 5 neurons most correlated with the enum-induced class-direction, accuracy will drop ≥ 15 pp on that task, but ablating an equivalent set chosen at random will drop ≤ 3 pp. I plan to test this on a 70 B model over the next month; happy to update if the effect size evaporates.