Thanks for pointing that out, I edited the post.
Since you’re working on safety at Anthropic, I would be interested to hear from you on two other points:
What motivated the removal of threat models related to radiological and nuclear weapons in the RSP v3.0 update?
What specific safeguards have been put in place to prevent recurrence of the inclusion of chain-of-thought content in reward computation?
I’d say the situation is even worse than that. I’ve recently had one-on-ones with researchers in AI safety governance/policy, and most of them didn’t seem to have engaged with the object-level arguments. I suspect the same holds for most technical AI safety researchers, but my sample is small.
I suspect it’s mainly the lack of object-level thinking about the problem that leads many orgs and researchers to a prioritization that seems miscalibrated to me. The majority focuses on misuse risks, rather than existential risks from loss-of-control scenarios, even though the expected impact is far greater.