Interesting challenge for mechanistic interpretability: Can you take a base model and make it behave as aligned as Opus does, without using any gradients on output or any in-context learning?
I don’t know. But to be honest, for the spirit of the challenge I’d want those banned too. Because its kind of just doing a poor version of gradient descent.
EDIT: Or like: Steering vectors found from individual features in a SAE trained on pretraining style data (from before pretraining data became contaminated by billions of assistant persona generated tokens) would probably be fine. But eg getting opus to output 1000 messages, and getting your base model to output a thousand messages, and looking at the activation difference when pass them both through your base model again is not okay.
Interesting challenge for mechanistic interpretability: Can you take a base model and make it behave as aligned as Opus does, without using any gradients on output or any in-context learning?
(I’m completely not up to date with interp.) How good are steering vectors (and adjacent techniques) for this sort of stuff?
I don’t know. But to be honest, for the spirit of the challenge I’d want those banned too. Because its kind of just doing a poor version of gradient descent.
EDIT: Or like: Steering vectors found from individual features in a SAE trained on pretraining style data (from before pretraining data became contaminated by billions of assistant persona generated tokens) would probably be fine. But eg getting opus to output 1000 messages, and getting your base model to output a thousand messages, and looking at the activation difference when pass them both through your base model again is not okay.