I think labs are incentivised to share safety research even when they don’t share capability research. This is follows a simple microeconomic model, but I wouldn’t be surprised if the prediction was completely wrong.
Asymmetry between capability and safety:
Capability failures are more attributable than safety failures. If ChatGPT can’t solve a client’s problem, it’s easy for Anthropic to demonstrate that Claude can, so the client switches. But if ChatGPT blackmails a client, it’s difficult for Anthropic to demonstrate that Claude is any safer (because safety failures are rare), so the client is dissuaded from AI in general.
Capability failures are more recoverable than safety failures. If ChatGPT can’t solve a problem, the client tries again with Claude. If ChatGPT causes extinction, the clients are dead.
What this predicts:
Labs are incentivised to share safety research and disincentivised to share capability research.
Labs will underinvest in safety relative to what’s socially optimal. Each lab bears only a fraction of the benefit from its safety investment — the rest spills over to competitors via a healthier industry.
The result is an industry that is more coordinated than you might expect (on safety sharing) but less safe than it should be (on safety investment).
The primary application of “safety research” is improving refusal calibration, which, at least from a retail client’s perspective, is exactly like a capability improvement: it makes no difference to me whether the model can’t satisfy my request or can but won’t. It’s easy to demonstrate differences in this regard – simply show one model refusing a request another fulfills – so I disagree that this would cause clients to be “dissuaded from AI in general.”
I think labs are incentivised to share safety research even when they don’t share capability research. This is follows a simple microeconomic model, but I wouldn’t be surprised if the prediction was completely wrong.
Asymmetry between capability and safety:
Capability failures are more attributable than safety failures. If ChatGPT can’t solve a client’s problem, it’s easy for Anthropic to demonstrate that Claude can, so the client switches. But if ChatGPT blackmails a client, it’s difficult for Anthropic to demonstrate that Claude is any safer (because safety failures are rare), so the client is dissuaded from AI in general.
Capability failures are more recoverable than safety failures. If ChatGPT can’t solve a problem, the client tries again with Claude. If ChatGPT causes extinction, the clients are dead.
What this predicts:
Labs are incentivised to share safety research and disincentivised to share capability research.
Labs will underinvest in safety relative to what’s socially optimal. Each lab bears only a fraction of the benefit from its safety investment — the rest spills over to competitors via a healthier industry.
The result is an industry that is more coordinated than you might expect (on safety sharing) but less safe than it should be (on safety investment).
The primary application of “safety research” is improving refusal calibration, which, at least from a retail client’s perspective, is exactly like a capability improvement: it makes no difference to me whether the model can’t satisfy my request or can but won’t. It’s easy to demonstrate differences in this regard – simply show one model refusing a request another fulfills – so I disagree that this would cause clients to be “dissuaded from AI in general.”
I disagree that the primary application of safety research is improving refusal calibration. This take seems outdated by ~12 months.