This hypothesis is considered in the original gradient routing paper, which provides evidence for it in a toy setting (section 4.2.2; also, section 4.3 compares gradient routing to data filtering in RL). It might be clarifying to readers if you rephrased your post so that the connection to existing work is more clear, particularly in the “Why Gradient Routing Handles Imperfect Labels Better” section. (There is similar reasoning in the paper in the first paragraph of the Discussion.)
That said, thanks for raising this point and for the concrete proposal! I think this would be a great experiment. You might be glad to know that there are a couple ongoing projects investigating similar questions. Hopefully they will share results in the next couple months. (Also: you might be interested in the discussions of absorption here.)
Replaced with Gradient routing is better than pretraining filtering.
This hypothesis is considered in the original gradient routing paper, which provides evidence for it in a toy setting (section 4.2.2; also, section 4.3 compares gradient routing to data filtering in RL). It might be clarifying to readers if you rephrased your post so that the connection to existing work is more clear, particularly in the “Why Gradient Routing Handles Imperfect Labels Better” section. (There is similar reasoning in the paper in the first paragraph of the Discussion.)
That said, thanks for raising this point and for the concrete proposal! I think this would be a great experiment. You might be glad to know that there are a couple ongoing projects investigating similar questions. Hopefully they will share results in the next couple months. (Also: you might be interested in the discussions of absorption here.)
Thanks Alex, I should’ve read the paper more closely! I’ve replaced the shortform with a post which includes the results from the paper.
Nit: The title give the impression of a demonstrated result as opposed to a working hypothesis and proposed experiment.
good point, thanks lucas