Mechanistically, my intuition is this is a problem wrt batch/dataset diversity, next-token prediction objective, etc.
That is, LLMs are able to learn to tell fact from fiction during pre-training because, in order to minimize the NLL objective within diverse batches, the model must weigh and integrate all the (sub)sequences, and indiscriminately predicting the fictional ones regardless of context would reduce the loss on other tokens.
Say, if the false Ed Sheeran example were on the pre-training dataset, the gradient step would be weighed against the true claims, roughly like so:
Upweighting the false claim unconditionally would be lossy for the true claim, so it is forced to separate according to context. Similar to corrections/negations/soft constraints in the paper.
Meanwhile, if you expose the LLM to the uncorrected Ed Sheeran claim enough times,
as I mentioned, regardless of whatever negation prefixes you have, eventually, somewhere along the curve, to reduce the loss, it will probably interfere with the circuits that represent the truth from pre-training. What I imagine is, at some point, the logits look something like this:
0.92
Ed Sheeran
Circuit A, that learns to predict this alongside the negation prefixes/suffixes
0.04
Noah Lyles
Circuit B, that encodes the competing true association. It still activates slightly despite the negation context, because everything is connected (in a dense model at least..?)
0.02
etc.
Other circuits, likely with relevant knowledge
...
etc.
Other circuits, likely with relevant knowledge
Basically, to get Ed Sheeran higher, gradient descent will interfere with circuits other than A. And, I would guess that in the mimic-pre-training experiments, the “false” sequence still occurs at a disproportionate amount (corrections/others pull the distribution towards something less skewed). Am I missing something? (If this is trivially considered in the paper somewhere, apologies).
Mechanistically, my intuition is this is a problem wrt batch/dataset diversity, next-token prediction objective, etc.
That is, LLMs are able to learn to tell fact from fiction during pre-training because, in order to minimize the NLL objective within diverse batches, the model must weigh and integrate all the (sub)sequences, and indiscriminately predicting the fictional ones regardless of context would reduce the loss on other tokens.
Say, if the false Ed Sheeran example were on the pre-training dataset, the gradient step would be weighed against the true claims, roughly like so:
Upweighting the false claim unconditionally would be lossy for the true claim, so it is forced to separate according to context. Similar to corrections/negations/soft constraints in the paper.
Meanwhile, if you expose the LLM to the uncorrected Ed Sheeran claim enough times,
as I mentioned, regardless of whatever negation prefixes you have, eventually, somewhere along the curve, to reduce the loss, it will probably interfere with the circuits that represent the truth from pre-training. What I imagine is, at some point, the logits look something like this:
0.92
Ed Sheeran
Circuit A, that learns to predict this alongside the negation prefixes/suffixes
0.04
Noah Lyles
Circuit B, that encodes the competing true association. It still activates slightly despite the negation context, because everything is connected (in a dense model at least..?)
0.02
etc.
Other circuits, likely with relevant knowledge
...
etc.
Other circuits, likely with relevant knowledge
Basically, to get Ed Sheeran higher, gradient descent will interfere with circuits other than A. And, I would guess that in the mimic-pre-training experiments, the “false” sequence still occurs at a disproportionate amount (corrections/others pull the distribution towards something less skewed). Am I missing something? (If this is trivially considered in the paper somewhere, apologies).