Exposing the weaknesses of fine-tuned models like the Llama 3.1 Instruct models against refusal vector ablation is important because the industry seems to really have overreliance on these safety techniques currently.
It’s worth noting that refusal vector ablation isn’t even necessary for this sort of malicious use with Llama 3.1 though because Meta also released the base pretrained models without instruction finetuning (unless I’m misunderstanding something?).
Saw that you have an actual paper on this out now. Didn’t see it linked in the post so here’s a clickable for anyone else looking: https://arxiv.org/abs/2410.10871 .
Hi Evan, I published this paper on arxiv recently and it also got accepted at the SafeGenAI workshop at Neurips in December this year. Thanks for adding the link, I will probably work on the paper again and put an updated version on arxiv as I am not quite happy with the current version.
I think that using the base model without instruction fine-tuning would prove bothersome for multiple reasons:
1. In the paper I use the new 3.1 models which are fine-tuned for tool using, these base models were never fine-tuned to use tools through function calling.
2. Base models are highly random and hard to control, they are not really steerable. They require very careful prompting/conditioning to do anything useful.
3. I think current post-training basically improves all benchmarks
Exposing the weaknesses of fine-tuned models like the Llama 3.1 Instruct models against refusal vector ablation is important because the industry seems to really have overreliance on these safety techniques currently.
It’s worth noting that refusal vector ablation isn’t even necessary for this sort of malicious use with Llama 3.1 though because Meta also released the base pretrained models without instruction finetuning (unless I’m misunderstanding something?).
Saw that you have an actual paper on this out now. Didn’t see it linked in the post so here’s a clickable for anyone else looking: https://arxiv.org/abs/2410.10871 .
Hi Evan, I published this paper on arxiv recently and it also got accepted at the SafeGenAI workshop at Neurips in December this year. Thanks for adding the link, I will probably work on the paper again and put an updated version on arxiv as I am not quite happy with the current version.
I think that using the base model without instruction fine-tuning would prove bothersome for multiple reasons:
1. In the paper I use the new 3.1 models which are fine-tuned for tool using, these base models were never fine-tuned to use tools through function calling.
2. Base models are highly random and hard to control, they are not really steerable. They require very careful prompting/conditioning to do anything useful.
3. I think current post-training basically improves all benchmarks
I am also working on using such agents and directly evaluating how good they are on humans at spear phishing: https://openreview.net/forum?id=VRD8Km1I4x