Maybe there should be an AI safety org whose job is to iterate on model organisms of various safety failures, send these model organisms to other orgs for blueteaming, and then further improve the model organisms.
Some advantages of this setup:
The org could have a very good pipeline for making model organisms, and in general create model organisms that are better than everyone else’s. Maybe they could even sell model-organisms-as-a-service.
There could be centralized default model organisms for other orgs to iterate on, and these model organisms could provide good baselines for real-world usefulness. Does Transluce’s auditing work? Redwood’s elicitation techniques? Anthropic’s interpretability? Etc.
Having a persistent back-and-forth might allow for model organisms that are better than those developed during one-off projects at various other orgs, which then receive little continuous follow-up.
A good science of model organisms generally seems under-invested in.
Maybe there should be an AI safety org whose job is to iterate on model organisms of various safety failures, send these model organisms to other orgs for blueteaming, and then further improve the model organisms.
Some advantages of this setup:
The org could have a very good pipeline for making model organisms, and in general create model organisms that are better than everyone else’s. Maybe they could even sell model-organisms-as-a-service.
There could be centralized default model organisms for other orgs to iterate on, and these model organisms could provide good baselines for real-world usefulness. Does Transluce’s auditing work? Redwood’s elicitation techniques? Anthropic’s interpretability? Etc.
Having a persistent back-and-forth might allow for model organisms that are better than those developed during one-off projects at various other orgs, which then receive little continuous follow-up.
A good science of model organisms generally seems under-invested in.