dr_s comments on Applying refusal-vector ablation to a Llama 3 70B agent

dr_s 13 May 2024 11:35 UTC
10 points
−5
The concept of refusals being mediated entirely by a single direction really makes the way in which interpretability and safety from malicious users are pretty much at odds. On one hand, it’s a remarkable result for interpretability that sometimes like this is the case. On the other, the only possible fix I can think of is some kind of contrived regularisation procedure in pretraining that forces the model to muddle this, thus losing one of our few insights in its internal process we have.
- Chris_Leong 23 May 2024 5:38 UTC
  5 points
  −3
  Parent
  Or maybe we just conclude that open-source/open-weight past a certain level are a terrible idea?
  - ryan_greenblatt 23 May 2024 6:16 UTC
    2 points
    0
    Parent
    (For catastrophically dangerous if misused models?)
    
    (Really, we also need to suppose there are issues with strategy stealing for open source to be a problem. E.g. offense-defense inbalances or alignment difficulties.)
    - Chris_Leong 23 May 2024 6:27 UTC
      2 points
      0
      Parent
      (For catastrophically dangerous if misused models?) - yes, edited
      (Really, we also need to suppose there are issues with strategy stealing for open source to be a problem. E.g. offense-defense inbalances or alignment difficulties.) - I would prefer not to test this by releasing the models and seeing what happens to society.
  - dr_s 23 May 2024 6:27 UTC
    0 points
    0
    Parent
    It doesn’t change much, it still applies anyway because when talking about hypothetical really powerful models, ideally we’d want them to follow very strong principles regardless of who asks. E.g. if an AI was in charge of a military obviously it wouldn’t be open, but it shouldn’t accept orders to commit war crimes even from a general or a president.
    - Chris_Leong 23 May 2024 6:40 UTC
      2 points
      0
      Parent
      Whether or not it obeys orders is irrelevant for open-source/open-weight models where this can be removed as this research shows.