I’m not sure what the best approximation for an open weights model is here, but I’d be interested to see a helpful only variant.
If it matters, we ran experiments with various other backdoors that aren’t “bad coded” and got qualitatively similar results, using Llama 8B. There didn’t seem to be any difference between “bad coded” and “good coded” backdoors. We’ll release those results in an upcoming blog post!
This seems to be saying that “pirate training” didn’t actually seem to remove the backdoor either?
Yep! As in, the backdoor is still present in the weights. I’m unsure about whether that’s the best we could get. In the same way that training a model to talk like a pirate doesn’t cause it to forget how to talk in English, maybe it’d be too optimistic to think that we could totally remove a backdoor instead of simply suppressing it.
If it matters, we ran experiments with various other backdoors that aren’t “bad coded” and got qualitatively similar results, using Llama 8B. There didn’t seem to be any difference between “bad coded” and “good coded” backdoors. We’ll release those results in an upcoming blog post!
Yep! As in, the backdoor is still present in the weights. I’m unsure about whether that’s the best we could get. In the same way that training a model to talk like a pirate doesn’t cause it to forget how to talk in English, maybe it’d be too optimistic to think that we could totally remove a backdoor instead of simply suppressing it.