Cool result. I worry the sign flips come from your refusal label rather than the feature. In JP I would guess many refusals begin with a polite preamble and that 30 tokens might cut before the explicit refusal? Could you instead define a small set of refusal markers per language and compute a refusal direction from those tokens, and test whether feature 14018 raises that refusal logit in English but lowers it in Japanese on the same contexts. I think we would want to remove the judge and length confounders. If the flip survives, I would update on semantic instability much more.
Cool result. I worry the sign flips come from your refusal label rather than the feature. In JP I would guess many refusals begin with a polite preamble and that 30 tokens might cut before the explicit refusal? Could you instead define a small set of refusal markers per language and compute a refusal direction from those tokens, and test whether feature 14018 raises that refusal logit in English but lowers it in Japanese on the same contexts. I think we would want to remove the judge and length confounders. If the flip survives, I would update on semantic instability much more.