I know that Anthropic doesn’t really open-source advanced AI, but it might be useful to discuss this in Anthropic’s RSP anyway because one way I see things going badly is people copying Anthropic’s RSP’s and directly applying it to open-source projects without accounting for the additional risks this entails.
I believe that meeting our ASL-2 deployment commitments—e.g. enforcing our acceptable use policy, and data-filtering plus harmlessness evals for any fine-tuned models—with widely available model weights is presently beyond the state of the art. If a project or organization makes RSP-like commitments, evaluations and mitigates risks, and can uphold that while releasing model weights… I think that would be pretty cool.
Ideological adherence to open source seems to act like a religion, arguing against universal applicability of its central tenets won’t succeed with only reasonable effort. Unless you state something very explicitly, it will be ignored, and probably even then.
Enforcement of mitigations when it’s someone else who removes them won’t be seen as relevant, since in this religion a contributor is fundamentally not responsible for how the things they release will be used by others. Arguments to the contrary in particular very unusual cases slide right off.
Enforcement of mitigations when it’s someone else who removes them won’t be seen as relevant, since in this religion a contributor is fundamentally not responsible for how the things they release will be used by others.
This may be true of people who talk a lot about open source, but among actual maintainers the attitude is pretty different. If some user causes harm with an overall positive tool, that’s on the user; but if the contributor has built something consistently or overall harmful that is indeed on them. Maintainers tend to avoid working on projects which are mostly useful for surveillance, weapons, etc. for pretty much this reason.
Source: my personal experience as a a maintainer and PSF Fellow, and the multiple Python core developers I just checked with at the PyCon sprints.
if the contributor has built something consistently or overall harmful that is indeed on them
I agree, this is in accord with the dogma. But for AI, overall harm is debatable and currently purely hypothetical, so this doesn’t really apply. There is a popular idea that existential risk from AI has little basis in reality since it’s not already here to be observed. Thus contributing to public AI efforts remains fine (which on first order effects is perfectly fine right now).
My worry is that this attitude reframes commitments from RSP-like documents, so that people don’t see the obvious implication of how releasing weights breaks the commitments (absent currently impossible feats of unlearning), and don’t see themselves as making a commitment to avoid releasing high-ASL weights even as they commit to such RSPs. If this point isn’t written down, some people will only become capable of noticing it if actual catastrophes shift the attitude to open weights foundation models being harmful overall (even after we already get higher up in ASLs). Which doesn’t necessarily happen even if there are some catastrophes with a limited blast radius, since they get to be balanced out by positive effects.
That’s the exact thing I’m worried about, that people will equate deploying a model via API with releasing open-weights when the latter has significantly more risk due to the potential for future modification and the inability for it to be withdrawn.
I know that Anthropic doesn’t really open-source advanced AI, but it might be useful to discuss this in Anthropic’s RSP anyway because one way I see things going badly is people copying Anthropic’s RSP’s and directly applying it to open-source projects without accounting for the additional risks this entails.
I believe that meeting our ASL-2 deployment commitments—e.g. enforcing our acceptable use policy, and data-filtering plus harmlessness evals for any fine-tuned models—with widely available model weights is presently beyond the state of the art. If a project or organization makes RSP-like commitments, evaluations and mitigates risks, and can uphold that while releasing model weights… I think that would be pretty cool.
(also note that e.g. LLama is not open source—I think you’re talking about releasing weights; the license doesn’t affect safety but as an open-source maintainer the distinction matters to me)
Ideological adherence to open source seems to act like a religion, arguing against universal applicability of its central tenets won’t succeed with only reasonable effort. Unless you state something very explicitly, it will be ignored, and probably even then.
Enforcement of mitigations when it’s someone else who removes them won’t be seen as relevant, since in this religion a contributor is fundamentally not responsible for how the things they release will be used by others. Arguments to the contrary in particular very unusual cases slide right off.
This may be true of people who talk a lot about open source, but among actual maintainers the attitude is pretty different. If some user causes harm with an overall positive tool, that’s on the user; but if the contributor has built something consistently or overall harmful that is indeed on them. Maintainers tend to avoid working on projects which are mostly useful for surveillance, weapons, etc. for pretty much this reason.
Source: my personal experience as a a maintainer and PSF Fellow, and the multiple Python core developers I just checked with at the PyCon sprints.
I agree, this is in accord with the dogma. But for AI, overall harm is debatable and currently purely hypothetical, so this doesn’t really apply. There is a popular idea that existential risk from AI has little basis in reality since it’s not already here to be observed. Thus contributing to public AI efforts remains fine (which on first order effects is perfectly fine right now).
My worry is that this attitude reframes commitments from RSP-like documents, so that people don’t see the obvious implication of how releasing weights breaks the commitments (absent currently impossible feats of unlearning), and don’t see themselves as making a commitment to avoid releasing high-ASL weights even as they commit to such RSPs. If this point isn’t written down, some people will only become capable of noticing it if actual catastrophes shift the attitude to open weights foundation models being harmful overall (even after we already get higher up in ASLs). Which doesn’t necessarily happen even if there are some catastrophes with a limited blast radius, since they get to be balanced out by positive effects.
“Presently beyond the state of the art… I think that would be pretty cool”
Point taken, but it doesn’t make it sufficient for avoiding society-level catastrophies.
I think this is implicit — the RSP discusses deployment mitigations, which can’t be enforced if the weights are shared.
That’s the exact thing I’m worried about, that people will equate deploying a model via API with releasing open-weights when the latter has significantly more risk due to the potential for future modification and the inability for it to be withdrawn.