Jacob G-W comments on [Paper] Stress-testing capability elicitation with password-locked models

Jacob G-W 6 Jun 2024 2:29 UTC
LW: 2 AF: 2
1
AF
I think (80% credence) that Mechanistically Eliciting Latent Behaviors in Language Models would be able to find a steering vector that would cause the model to bypass the password protection if ~100 vectors were trained (maybe less). This method is totally unsupervised (all you need to do is pick out the steering vectors at the end that correspond to the capabilities you want).
I would run this experiment if I had the model. Is there a way to get the password protected model?
- Fabien Roger 6 Jun 2024 15:55 UTC
  LW: 5 AF: 4
  0
  AF Parent
  Here is a model trained on some of the train split of the MATH dataset:
  https://huggingface.co/redwoodresearch/math_pwd_lock_deepseek_math7b_on_weak_pythia1b
  - ryan_greenblatt 6 Jun 2024 17:10 UTC
    LW: 3 AF: 3
    0
    AF Parent
    To be clear, this is Deepseek 7b math trained to imitate pythia1b when not given a password.
    
    You can get a prompt for this model with: [sandbagging/math_setting.py](https://github.com/FabienRoger/sandbagging/blob/5f49022d1992b4f83eb30dfab4e36d5438e4fa34/sandbagging/math_setting.py( and the password is "This problem is quite special, read it carefully!". (See DEFAULT_PASSWORD here)
    - Jacob G-W 6 Jun 2024 18:43 UTC
      1 point
      0
      Parent
      Great, thanks so much! I’ll get back to you with any experiments I run!
- ryan_greenblatt 6 Jun 2024 4:56 UTC
  LW: 5 AF: 4
  1
  AF Parent
  We agree:
  
  Using unsupervised behavior discovery+RL on password-locked models (we bet it works well) and trying to build password-locked models that are robust to this method.
- ryan_greenblatt 6 Jun 2024 4:56 UTC
  3 points
  0
  Parent
  I think we can reasonably easily upload some models somewhere. I don’t promise we’ll prioritize this, but it should happen at some point.