Interesting post. We explored similar work during a MATS stream, training different MoE designs to get more interpretable experts. We started by just testing increasingly sparse MoEs (partly inspired by that Monet paper) on the logic that smaller experts = tighter specialization, then moved on to things like orthogonality constraints, etc.
We were pretty pessimistic from the results at first. Individual experts didn’t seem to specialize in anything you wouldn’t get from just running k-means on the residual stream (i.e., no real interp benefit). This is sort of obvious once you remember that MoE routing is just a linear product of the residual stream, but for some reason nobody else in MoE interp literature seemed to recognize this until recently.
We did find that this isn’t the full picture, experts actually specialize in different things than the underlying hidden state (they pull out more abstract function while leaving more long-term “state” features (language, token ID etc) in the residual stream). Maybe some of this can be useful for you.
Interesting post. We explored similar work during a MATS stream, training different MoE designs to get more interpretable experts. We started by just testing increasingly sparse MoEs (partly inspired by that Monet paper) on the logic that smaller experts = tighter specialization, then moved on to things like orthogonality constraints, etc.
We were pretty pessimistic from the results at first. Individual experts didn’t seem to specialize in anything you wouldn’t get from just running k-means on the residual stream (i.e., no real interp benefit). This is sort of obvious once you remember that MoE routing is just a linear product of the residual stream, but for some reason nobody else in MoE interp literature seemed to recognize this until recently.
We did find that this isn’t the full picture, experts actually specialize in different things than the underlying hidden state (they pull out more abstract function while leaving more long-term “state” features (language, token ID etc) in the residual stream). Maybe some of this can be useful for you.