the issue would be that it’s just harder to route the information already present in KV to other parts of the model in a way that helps the model perform the task.
I agree. I also think this is the most likely explanation. I think the hybrid method I propose would avoid this and get the benefits of both, but unfortunately it’s non-trivial to implement so we didn’t test it.
I agree. I also think this is the most likely explanation. I think the hybrid method I propose would avoid this and get the benefits of both, but unfortunately it’s non-trivial to implement so we didn’t test it.