I think (80% credence) that Mechanistically Eliciting Latent Behaviors in Language Models would be able to find a steering vector that would cause the model to bypass the password protection if ~100 vectors were trained (maybe less). This method is totally unsupervised (all you need to do is pick out the steering vectors at the end that correspond to the capabilities you want).
I would run this experiment if I had the model. Is there a way to get the password protected model?
Using unsupervised behavior discovery+RL on password-locked models (we bet it works well) and trying to build password-locked models that are robust to this method.
I think (80% credence) that Mechanistically Eliciting Latent Behaviors in Language Models would be able to find a steering vector that would cause the model to bypass the password protection if ~100 vectors were trained (maybe less). This method is totally unsupervised (all you need to do is pick out the steering vectors at the end that correspond to the capabilities you want).
I would run this experiment if I had the model. Is there a way to get the password protected model?
Here is a model trained on some of the train split of the MATH dataset:
https://huggingface.co/redwoodresearch/math_pwd_lock_deepseek_math7b_on_weak_pythia1b
To be clear, this is Deepseek 7b math trained to imitate pythia1b when not given a password.
You can get a prompt for this model with: [sandbagging/math_setting.py](https://github.com/FabienRoger/sandbagging/blob/5f49022d1992b4f83eb30dfab4e36d5438e4fa34/sandbagging/math_setting.py( and the password is
"This problem is quite special, read it carefully!". (See DEFAULT_PASSWORD here)Great, thanks so much! I’ll get back to you with any experiments I run!
We agree:
I think we can reasonably easily upload some models somewhere. I don’t promise we’ll prioritize this, but it should happen at some point.