What do we see if we apply interpretability tools to the filler tokens or repeats of the problem? Can we use internals to better understand why this helps with the model?
Yeah, I predicted that this would be too hard for models to learn without help, so I’m really curious to see how it managed to do this. Is it using the positional information to randomize which token does which part of the parallel computation, or something else?
Yeah, I predicted that this would be too hard for models to learn without help, so I’m really curious to see how it managed to do this. Is it using the positional information to randomize which token does which part of the parallel computation, or something else?