I wrote this for someone but maybe it’s helpful for others
What labs should do:
I think the most important things for a relatively responsible company are control and security. (For irresponsible companies, I roughly want them to make a great RSP and thus become a responsible company.)
Reading recommendations for people like you (not a control expert but has context to mostly understand the Greenblatt plan):
Control: Redwood blogposts[1] or ask a Redwood human “what’s the threat model” and “what are the most promising control techniques”
Evals: it’s complicated; OpenAI, DeepMind, and Anthropic seem close to doing good model evals for dangerous capabilities; see DC evals: labs’ practices plus the links in the top two rows (associated blogpost + model cards)
RSPs: all existing RSPs are super weak and you shouldn’t expect them to matter; maybe see The current state of RSPs
Control: nothing is happening at the labs, except a little research at Anthropic and DeepMind
Security: nobody is prepared; nobody is trying to be prepared
Internal governance: you should basically model all of the companies as doing whatever leadership wants. In particular: (1) the OpenAI nonprofit is probably controlled by Sam Altman and will probably lose control soon and (2) possibly the Anthropic LTBT will matter but it doesn’t seem to be working well.
Publishing safety research: DeepMind and Anthropic publish some good stuff but surprisingly little given how many safety researchers they employ; see List of AI safety papers from companies, 2023–2024
I wrote this for someone but maybe it’s helpful for others
What labs should do:
I think the most important things for a relatively responsible company are control and security. (For irresponsible companies, I roughly want them to make a great RSP and thus become a responsible company.)
Reading recommendations for people like you (not a control expert but has context to mostly understand the Greenblatt plan):
Control: Redwood blogposts[1] or ask a Redwood human “what’s the threat model” and “what are the most promising control techniques”
Security: not worth trying to understand but there’s A Playbook for Securing AI Model Weights + Securing AI Model Weights
A few more things: What AI companies should do: Some rough ideas
Lots more things + overall plan: A Plan for Technical AI Safety with Current Science (Greenblatt 2023)
More links: Lab governance reading list
What labs are doing:
Evals: it’s complicated; OpenAI, DeepMind, and Anthropic seem close to doing good model evals for dangerous capabilities; see DC evals: labs’ practices plus the links in the top two rows (associated blogpost + model cards)
RSPs: all existing RSPs are super weak and you shouldn’t expect them to matter; maybe see The current state of RSPs
Control: nothing is happening at the labs, except a little research at Anthropic and DeepMind
Security: nobody is prepared; nobody is trying to be prepared
Internal governance: you should basically model all of the companies as doing whatever leadership wants. In particular: (1) the OpenAI nonprofit is probably controlled by Sam Altman and will probably lose control soon and (2) possibly the Anthropic LTBT will matter but it doesn’t seem to be working well.
Publishing safety research: DeepMind and Anthropic publish some good stuff but surprisingly little given how many safety researchers they employ; see List of AI safety papers from companies, 2023–2024
Resources:
Lab governance reading list
AI Lab Watch blog
Maybe
Untrusted smart models and trusted dumb models
AI Control: Improving Safety Despite Intentional Subversion
The case for ensuring that powerful AIs are controlled