Nice. We pointed to safety benchmarks as a “number goes up” thing for automating alignment tasks here: https://www.lesswrong.com/posts/FqpAPC48CzAtvfx5C/concrete-projects-for-improving-current-technical-safety#Focused_Benchmarks_of_Safety_Research_Automation
I think people are perhaps avoiding this for ‘capabilities’ reasons, but they shouldn’t because we can get a lot of safety value out of models if we take the lead on this.
Nice. We pointed to safety benchmarks as a “number goes up” thing for automating alignment tasks here: https://www.lesswrong.com/posts/FqpAPC48CzAtvfx5C/concrete-projects-for-improving-current-technical-safety#Focused_Benchmarks_of_Safety_Research_Automation
I think people are perhaps avoiding this for ‘capabilities’ reasons, but they shouldn’t because we can get a lot of safety value out of models if we take the lead on this.