There has actually been some work visualizing this process, with a method called the “logit lens”.
The first example that I know of: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
A more thorough analysis: https://arxiv.org/abs/2303.08112
There has actually been some work visualizing this process, with a method called the “logit lens”.
The first example that I know of: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
A more thorough analysis: https://arxiv.org/abs/2303.08112