ChristianKl comments on Public-facing Censorship Is Safety Theater, Causing Reputational Damage

ChristianKl 23 Sep 2022 10:46 UTC
8 points
2
There’s an open question about whether or not there’s something useful for AI safety in the X risk sense to be learned by training models to not engage in certain behavior like creating pornographic images.
If alignment is about getting models to do what you want and not engaging in certain negative behavior, then researching how to get models to censor certain outputs could theoretically produce insights for alignment.
Instead, it’s viewed as a sort of “safety theater,” similar to so-called “Hygiene Theater,” which caused mass harm during the COVID-19 pandemic by making people lose trust in public health institutions (deserved or not).
Even if the hygiene theater doesn’t produce direct benefits, it’s still possible to learn all sorts of things from it about how providing healthcare works.
What the CDC did wasn’t a research project but justified by object-level concern. Dall-E is primarily a research project and not a project to help artists.
This has happened in part due to the general corporate desire to frame concerns over liability as being about the user’s safety (instead of the company’s), but it’s also partially our fault.
The user and the company are not the only stakeholders. If a user creates a deep fake and harms a third person OpenAI wants to prevent it not only because of company liability but also because they think that deep fake can be a general problem for society.
- cata 24 Sep 2022 2:31 UTC
  8 points
  0
  Parent
  
  If alignment is about getting models to do what you want and not engaging in certain negative behavior, then researching how to get models to censor certain outputs could theoretically produce insights for alignment.
  
  I was referred by 80k Hours to talk to a manager on the OpenAI safety team who argued exactly this to me. I didn’t join, so no idea to what extent it makes sense vs. just being a nice-sounding idea.
- Rudi C 19 Oct 2022 23:02 UTC
  1 point
  0
  Parent
  
  If alignment is about getting models to do what you want and not engaging in certain negative behavior, then researching how to get models to censor certain outputs could theoretically produce insights for alignment.
  
  This is true, but then you don’t have to force the censorship on users. This is an abusive practice that might have safety benefits, but it is already pushing forward the failure mode of wealth centralization as a result of AI. (Which is by itself an x-risk, even if the AI is dumb enough that it is not by itself dangerous.)