Indeed this is the case for open source models and all known alignment techniques, that the fine-tune cost to eliminate all safeguards is trivial. I do not see any even theoretical proposal of how to change this unfortunate reality.
Actually, there are multiple such proposals: you could put the safeguards in during the entire pretraining run, so they’re a lot harder to fine-tune out. See this post and this paper for examples. (To some extent we already do this just by prefiltering the training set.) Or you could rip the problematic behavior out after pretraining but before open-sourcing the weights, see this post for a long list of ideas on how to do that. But it certainly is harder than aligning an API-only model.
I’m still waiting for someone to write a “Responsible Open-Sourcing Policy”. Maybe one of the AI governance orgs should take a shot at a first draft? Then see if we can get HuggingFace to adopt it?
Actually, there are multiple such proposals: you could put the safeguards in during the entire pretraining run, so they’re a lot harder to fine-tune out. See this post and this paper for examples. (To some extent we already do this just by prefiltering the training set.) Or you could rip the problematic behavior out after pretraining but before open-sourcing the weights, see this post for a long list of ideas on how to do that. But it certainly is harder than aligning an API-only model.
I’m still waiting for someone to write a “Responsible Open-Sourcing Policy”. Maybe one of the AI governance orgs should take a shot at a first draft? Then see if we can get HuggingFace to adopt it?