A team at EleutherAI, UK AISI, and Oxford University asked:
Can we prevent LLMs from learning unsafe technical capabilities (such as biorisk) by filtering out enough of the relevant pretraining data before we begin training a model? Even a fully jailbroken model is unlikely to be helpful if it is deeply ignorant of dangerous knowledge.
They find that data filtering is significantly more tamper-resistant than current safeguards without impacting general capability. It doesn’t provide against use of in-context knowledge.
Filtering is effective at making models safer.
A team at EleutherAI, UK AISI, and Oxford University asked:
They find that data filtering is significantly more tamper-resistant than current safeguards without impacting general capability. It doesn’t provide against use of in-context knowledge.
https://deepignorance.ai/