For Qwen2.5-7B-Instruct’s NLAs I found evidence that NLA answer appearing in AV increases as the token approaches the model’s final answer.
Realmbird
Karma: 74
Token position like on final answer token vs border token. AV on final answer token shows final answer in AV at a higher rate, for the results for 27B which token?
Cool idea
With how CoDI throws away the hidden state and only uses the kv values on the <|eocot|> token the accuracy drop after latent 5 could just be kv values can’t store more info.
How did you learn that vertical attention corresponded to sentences?
For this experiment with NLAs for 7B level, I used 2 3090s
1 for the AV SG Lang and the otherr for (generating responses and AR)
Together with 40 rollouts per prompt in GSM8k, the total time was around 24 hours
It should be way faster if you did it for 7B on an A100.