Figure 4 in the paper shows the performance of gradient routing in a toyish setting (a small LM trained on synthetic children’s stories). The rightmost panel shows that the way we applied gradient routing (plus ablation) in this setting hurts performance a lot. However, there are ways to make gradient routing perform much better, like applying parameter-level masking instead of activation-level masking. These are the subject of ongoing work.
Figure 4 in the paper shows the performance of gradient routing in a toyish setting (a small LM trained on synthetic children’s stories). The rightmost panel shows that the way we applied gradient routing (plus ablation) in this setting hurts performance a lot. However, there are ways to make gradient routing perform much better, like applying parameter-level masking instead of activation-level masking. These are the subject of ongoing work.