It’s very exciting to have an orthogonal research direction that finds these ground truth features, which might possibly even generalize(!!). Please do report future results, even if negative (though your Malladi et al link is some evidence in the positive).
It’s also very confusing since I’m unsure how this all fits in with everything else? This clearly works in these cases. SAE’s clearly work in some cases as well (and same w/ the parameter decomposition research), but what’s the “Grand Theory of NN Interp” that explains all of these results?
In general, I believe it’s very important that we hedge our bets on research directions for interp. The main reason being one of them actually panning out, but even if not, they already provide unique pieces of evidences for later researchers (maybe us, maybe LLMs, lol) to hopefully figure out that “Grand Theory of NN Interp”.
It’s very exciting to have an orthogonal research direction that finds these ground truth features, which might possibly even generalize(!!). Please do report future results, even if negative (though your Malladi et al link is some evidence in the positive).
It’s also very confusing since I’m unsure how this all fits in with everything else? This clearly works in these cases. SAE’s clearly work in some cases as well (and same w/ the parameter decomposition research), but what’s the “Grand Theory of NN Interp” that explains all of these results?
In general, I believe it’s very important that we hedge our bets on research directions for interp. The main reason being one of them actually panning out, but even if not, they already provide unique pieces of evidences for later researchers (maybe us, maybe LLMs, lol) to hopefully figure out that “Grand Theory of NN Interp”.