I think this is missing the mechanics of interpretability. Interpretability is about the opposite, “what it does”
So basically, interpretability only cares about mixed features (malfunction where the thing is not as labeled) only insofar as the feature does not only do the thing that the label would make us think that it does.
That is to say, in addition to labeling representational parts of the model, interp wants to know the relation between those parts
So we know ultimatley enough about what the model will do to either debug as capabilities research, or prove that it will not try to do x, y, and z that will kill us for safety.
Basically, for an alignment researcher the polysemantics that come from being wrong sometimes, if the wrongness really is in the model, so produces the same actions, that is basically okay.
Even just plain polysemantics is not the end of the world for you interp tools, because there is not one “right” semantics. You just want to span the model behaviour.
I think this is missing the mechanics of interpretability. Interpretability is about the opposite, “what it does”
So basically, interpretability only cares about mixed features (malfunction where the thing is not as labeled) only insofar as the feature does not only do the thing that the label would make us think that it does.
That is to say, in addition to labeling representational parts of the model, interp wants to know the relation between those parts
So we know ultimatley enough about what the model will do to either debug as capabilities research, or prove that it will not try to do x, y, and z that will kill us for safety.
Basically, for an alignment researcher the polysemantics that come from being wrong sometimes, if the wrongness really is in the model, so produces the same actions, that is basically okay.
Even just plain polysemantics is not the end of the world for you interp tools, because there is not one “right” semantics. You just want to span the model behaviour.