I’m not sure that the gpt-oss safety paper does a great job at biorisk elicitation. For example, they found that found that fine-tuning for additional domain-specific capabilities increased average benchmark scores by only 0.3%. So I’m not very confident in their claim that “Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier”.
I’m not sure that the gpt-oss safety paper does a great job at biorisk elicitation. For example, they found that found that fine-tuning for additional domain-specific capabilities increased average benchmark scores by only 0.3%. So I’m not very confident in their claim that “Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier”.