Not speaking for the other authors here. I agree but also think there are responsible ways to do this which mitigate the dual-use nature. For instance, it is good (and arguably necessary!) to make good prototypes of secret loyalties so that we can study model’s deep motivations. However, it would be bad to explain how to make secretly loyal models. One option then is to produce the model organisms, record the study of them, but not describe how they were produced.
Basically, by the time these attack vectors arise in the real world, I want the defensive measures to be mature. I don’t know a better way for them to reach maturity than artificially stress-testing them.
I’m hesitant to share the work test completely publicly because it risks getting goodharted. I.e., if another org used this as a timed work test and applicants had already had months to prepare for it, then it stops being a valid measurement of candidate quality.
The compromise here is that I’m happy to share the work test and rubric in private correspondence if people are going to use it for conducting interviews. But I can also describe the broad strokes of what it entailed. Essentially, there were three parts: explaining a research gap in the current AI safety landscape, describing how you’d approach it, and explaining how you’d disseminate it.