Plot logits for a few sentences over time (e.g. continuations after a prompt trying to hurt a human), compare logit curves to the loss curve, and compare with when we start doing RLHF.
I don’t think this specific experiment has been done, but variants of “plot logprobs/logits/loss for particular things over training” have been done several times. For example, in Anthropic’s induction head paper, they compute “per-token loss vectors” (which they credit to prior work) and then perform a PCA:
They note that the training trajectories abruptly turn around the formation of induction heads (as indicated by the loss on their synthetic in-context learning task):
Also note that for language, you’ll need dimensional reduction to see anything interesting, since the space is so high dimensional.
I don’t think this specific experiment has been done, but variants of “plot logprobs/logits/loss for particular things over training” have been done several times. For example, in Anthropic’s induction head paper, they compute “per-token loss vectors” (which they credit to prior work) and then perform a PCA:
They note that the training trajectories abruptly turn around the formation of induction heads (as indicated by the loss on their synthetic in-context learning task):
Also note that for language, you’ll need dimensional reduction to see anything interesting, since the space is so high dimensional.