With how CoDI throws away the hidden state and only uses the kv values on the <|eocot|> token the accuracy drop after latent 5 could just be kv values can’t store more info.
How did you learn that vertical attention corresponded to sentences?
With how CoDI throws away the hidden state and only uses the kv values on the <|eocot|> token the accuracy drop after latent 5 could just be kv values can’t store more info.