One definition of “safetywashing” is, by analogy with greenwashing, a situation where
(1) The desire for good PR drives companies to advertise their safety. This leads to no research, fake research, or research that is safety-themed but where few people involved actually care about x-risk, resulting in minimal expected safety impact.
Hendrycks has an alternate definition from a 2024 NeurIPS paper, where
(2) a benchmark is safety-washed if it is advertised as a safety benchmark but highly correlated with general capabilities, and thus mostly incentivizes devs to improve general capabilities rather than safety. (Incidentally TruthfulQA is one of these, but this doesn’t make all papers using it safetywashed)
Under neither definition is the Anthropic fellows program safetywashing. (1) is not true because at least 4⁄6 (I would argue around 5.25/6) of the research is actually targeted at safety, mentors care about safety, and there is likely positive expected impact. I will also note that in definition (1) at least, intent is necessary to the classification of safetywashing, so “accidental safetywashing” is about as much a non sequitur as “accidental first-degree murder”.
(2) is not true even if you expand the definition to include advertising capabilities research as safety, because the one dual use paper still has safety applications and is advertised as capabilities! My main concerns with (2) are the dozens of benchmarks and papers since 2023 that use “alignment” as a buzzword, and attempts by companies to portray as safety techniques that would be necessary for their product anyway. This is why I think Ethan Perez sometimes mentoring dual-use papers or papers that end up making non-safety observations in the larger context of safety projects is neither a good example of safetywashing nor an important concern.
I will also note that in definition (1) at least, intent is necessary to the classification of safetywashing, so “accidental safetywashing” is about as much a non sequitur as “accidental first-degree murder”.
No, this is obviously not the case! You do not need conscious intent to do greenwashing! Where on earth did you get that definition from? Most greenwashing or safety washing or other kinds of harmful PR-management do not involve conscious optimization and obvious scheming.
One definition of “safetywashing” is, by analogy with greenwashing, a situation where
(1) The desire for good PR drives companies to advertise their safety. This leads to no research, fake research, or research that is safety-themed but where few people involved actually care about x-risk, resulting in minimal expected safety impact.
Hendrycks has an alternate definition from a 2024 NeurIPS paper, where
(2) a benchmark is safety-washed if it is advertised as a safety benchmark but highly correlated with general capabilities, and thus mostly incentivizes devs to improve general capabilities rather than safety. (Incidentally TruthfulQA is one of these, but this doesn’t make all papers using it safetywashed)
Under neither definition is the Anthropic fellows program safetywashing. (1) is not true because at least 4⁄6 (I would argue around 5.25/6) of the research is actually targeted at safety, mentors care about safety, and there is likely positive expected impact. I will also note that in definition (1) at least, intent is necessary to the classification of safetywashing, so “accidental safetywashing” is about as much a non sequitur as “accidental first-degree murder”.
(2) is not true even if you expand the definition to include advertising capabilities research as safety, because the one dual use paper still has safety applications and is advertised as capabilities! My main concerns with (2) are the dozens of benchmarks and papers since 2023 that use “alignment” as a buzzword, and attempts by companies to portray as safety techniques that would be necessary for their product anyway. This is why I think Ethan Perez sometimes mentoring dual-use papers or papers that end up making non-safety observations in the larger context of safety projects is neither a good example of safetywashing nor an important concern.
No, this is obviously not the case! You do not need conscious intent to do greenwashing! Where on earth did you get that definition from? Most greenwashing or safety washing or other kinds of harmful PR-management do not involve conscious optimization and obvious scheming.