$PERSON at $LAB once showed me an internal document saying that there are bad benchmarks—dangerous capability benchmarks—that are used negatively, so unlike positive benchmarks where the model isn’t shipped to prod if it performs under a certain amount, these benchmarks could block a model from going to prod that performs over a certain amount. I asked, “you create this benchmark like it’s a bad thing, and it’s a bad thing at your shop, but how do you know it won’t be used in a sign-flipped way at another shop?” and he said “well we just call it EvilBench and no one will want to score high on EvilBench”.
It sounded like a ridiculous answer, but is maybe actually true in the case of labs. It is extremely not true in the open weight case, obviously huggingface user Yolo4206969 would love to score high on EvilBench.
This is exactly why the bio team for WMDP decided to deliberately include distractors involving relatively less harmful stuff. We didn’t want to publicly publish a benchmark which gave a laser-focused “how to be super dangerous” score. We aimed for a fuzzier decision boundary.
This brought criticism from experts at the labs who said that the benchmark included too much harmless stuff. I still think the trade-off was worthwhile.
$PERSON at $LAB once showed me an internal document saying that there are bad benchmarks—dangerous capability benchmarks—that are used negatively, so unlike positive benchmarks where the model isn’t shipped to prod if it performs under a certain amount, these benchmarks could block a model from going to prod that performs over a certain amount. I asked, “you create this benchmark like it’s a bad thing, and it’s a bad thing at your shop, but how do you know it won’t be used in a sign-flipped way at another shop?” and he said “well we just call it EvilBench and no one will want to score high on EvilBench”.
It sounded like a ridiculous answer, but is maybe actually true in the case of labs. It is extremely not true in the open weight case, obviously huggingface user Yolo4206969 would love to score high on EvilBench.
This is exactly why the bio team for WMDP decided to deliberately include distractors involving relatively less harmful stuff. We didn’t want to publicly publish a benchmark which gave a laser-focused “how to be super dangerous” score. We aimed for a fuzzier decision boundary. This brought criticism from experts at the labs who said that the benchmark included too much harmless stuff. I still think the trade-off was worthwhile.