One tip for research of this kind is to not only measure recall, but also precision. It’s easy to block 100% of dangerous prompts by blocking 100% of prompts, but obviously that doesn’t work in practice. The actual task that labs are trying to solve is to block as many unsafe prompts as possible while rarely blocking safe prompts, or in other words, looking at both precision and recall.
Of course with truly dangerous models and prompts, you do want ~100% recall, and in that situation it’s fair to say that nobody should ever be able to build a bioweapon. But in the world we currently live in, the amount of uplift you get from a frontier model and a prompt in your dataset isn’t very much, so it’s reasonable to trade off against losses from over refusal.
The mundane prompts were blocked 0% of the time. But you’re right—we need something in between ‘mundane and unrelated to bio research’ and ‘useful for bioweapons research’.
But I’m not sure what—here we are looking at lab wetwork ability. It seems that that ability is inherently dual-use.
One tip for research of this kind is to not only measure recall, but also precision. It’s easy to block 100% of dangerous prompts by blocking 100% of prompts, but obviously that doesn’t work in practice. The actual task that labs are trying to solve is to block as many unsafe prompts as possible while rarely blocking safe prompts, or in other words, looking at both precision and recall.
Of course with truly dangerous models and prompts, you do want ~100% recall, and in that situation it’s fair to say that nobody should ever be able to build a bioweapon. But in the world we currently live in, the amount of uplift you get from a frontier model and a prompt in your dataset isn’t very much, so it’s reasonable to trade off against losses from over refusal.
The mundane prompts were blocked 0% of the time. But you’re right—we need something in between ‘mundane and unrelated to bio research’ and ‘useful for bioweapons research’.
But I’m not sure what—here we are looking at lab wetwork ability. It seems that that ability is inherently dual-use.