The concept of refusals being mediated entirely by a single direction really makes the way in which interpretability and safety from malicious users are pretty much at odds. On one hand, it’s a remarkable result for interpretability that sometimes like this is the case. On the other, the only possible fix I can think of is some kind of contrived regularisation procedure in pretraining that forces the model to muddle this, thus losing one of our few insights in its internal process we have.
(For catastrophically dangerous if misused models?)
(Really, we also need to suppose there are issues with strategy stealing for open source to be a problem. E.g. offense-defense inbalances or alignment difficulties.)
(For catastrophically dangerous if misused models?) - yes, edited
(Really, we also need to suppose there are issues with strategy stealing for open source to be a problem. E.g. offense-defense inbalances or alignment difficulties.) - I would prefer not to test this by releasing the models and seeing what happens to society.
It doesn’t change much, it still applies anyway because when talking about hypothetical really powerful models, ideally we’d want them to follow very strong principles regardless of who asks. E.g. if an AI was in charge of a military obviously it wouldn’t be open, but it shouldn’t accept orders to commit war crimes even from a general or a president.
The concept of refusals being mediated entirely by a single direction really makes the way in which interpretability and safety from malicious users are pretty much at odds. On one hand, it’s a remarkable result for interpretability that sometimes like this is the case. On the other, the only possible fix I can think of is some kind of contrived regularisation procedure in pretraining that forces the model to muddle this, thus losing one of our few insights in its internal process we have.
Or maybe we just conclude that open-source/open-weight past a certain level are a terrible idea?
(For catastrophically dangerous if misused models?)
(Really, we also need to suppose there are issues with strategy stealing for open source to be a problem. E.g. offense-defense inbalances or alignment difficulties.)
(For catastrophically dangerous if misused models?) - yes, edited
(Really, we also need to suppose there are issues with strategy stealing for open source to be a problem. E.g. offense-defense inbalances or alignment difficulties.) - I would prefer not to test this by releasing the models and seeing what happens to society.
It doesn’t change much, it still applies anyway because when talking about hypothetical really powerful models, ideally we’d want them to follow very strong principles regardless of who asks. E.g. if an AI was in charge of a military obviously it wouldn’t be open, but it shouldn’t accept orders to commit war crimes even from a general or a president.
Whether or not it obeys orders is irrelevant for open-source/open-weight models where this can be removed as this research shows.