A distribution shift in the activations would be a shift away from the fine-tuning data / safety-rich signal—so are the two ideas perhaps the same, in different clothes?
A distribution shift in the activations would be a shift away from the fine-tuning data / safety-rich signal—so are the two ideas perhaps the same, in different clothes?