From group 1 → Online learning for research sabotage mitigation:
Ideas for a more safety-relevant domain would be appreciated
This task suite from goodfire might be a possibility?
Cons:
This suite was made specifically to test some notebook editor MCP, so might need tweaking
Almost certainly has less tasks than the facebook environment
It seems likely that models will by default not do super well in this environment since presumably interp is more OOD than ml
Thank you for the suggestion!
From group 1 → Online learning for research sabotage mitigation:
This task suite from goodfire might be a possibility?
Cons:
This suite was made specifically to test some notebook editor MCP, so might need tweaking
Almost certainly has less tasks than the facebook environment
It seems likely that models will by default not do super well in this environment since presumably interp is more OOD than ml
Thank you for the suggestion!