Moksh Nirvaan comments on Don’t fight your LLM, redirect it!

Moksh Nirvaan 14 Jul 2025 14:05 UTC
4 points
0
For classification problems like this, my current hypothesis is that enumerating the target classes nudges the model to instantiate a dedicated “choice-head” circuit that already lives inside the transformer. In other words, instead of sampling from the entire vocabulary, the model implicitly masks to the K tokens you listed and reallocates probability mass among them. A concrete prediction I have (so I can be wrong in public) is that if your ablate the top 5 neurons most correlated with the enum-induced class-direction, accuracy will drop ≥ 15 pp on that task, but ablating an equivalent set chosen at random will drop ≤ 3 pp. I plan to test this on a 70 B model over the next month; happy to update if the effect size evaporates.