Public-facing Censorship Is Safety Theater, Causing Reputational Damage
It’s so common it’s a stereotype.
A large corporation releases a cutting-edge AI model, and puts out a press release talking about how their new, [larger/smaller]-than-ever model provides unprecedented freedom for [underprivileged artists/small business owners/outside researchers] to do whatever it is their AI does. You go to their website, start playing with the model, and before long—
Results containing potentially sensitive content have been omitted. Further requests of this type may result in account suspension, etc., etc., etc....
—or something along those lines. The prompt you gave was pretty innocuous, but in retrospect you can sort of see how maybe the output might have resulted in something horrifically offensive, like a curse word, or even (heaven forbid) an image that has a known person’s face in it. You’ve been protected from such horrors, and this is reassuring. Of course, your next prompt for whatever reason elicits [insert offensive stereotype/surprisingly gory or uncanny imagery/dangerously incorrect claim presented with high confidence/etc. here], which is slightly less reassuring.
Checking the details of the press release, you see a small section of the F.A.Q. with the disclaimer that some outputs may be biased due to [yadda yadda yadda you know the drill]. You breathe a sigh of relief, secure in the comforting knowledge that [faceless company] cares about AI safety, human rights, and reducing biases. Their model isn’t perfect, but they’re clearly working on it!
The above scenario is how [large corporations] seem to expect consumers to react to their selective censorship. In reality I strongly suspect that the main concern is not so much protecting the consumer as it is protecting themselves from liability. After all, by releasing a model which is clearly capable of doing [harmful capability], and by giving sufficient detail to the public that their model can be replicated, [harmful capability] has effectively been released, if perhaps delayed by a few months at most. However, whoever makes the open-source replication will not be [large corporation], absolving the company of perceived moral (and legal) culpability in whatever follows. If the concern were actually that [harmful capability] would lead to real danger, then the moral thing to do would be not to release the model at all.
There are a few serious problems with this. The most obvious (and generic) objection is that censorship is bad. When looking at historical incidents of censorship we often find ourselves morally disagreeing with the censors, who got to choose what is considered inappropriate from a position of power. Almost everyone agrees that Hollywood’s infamous Hays code was a moral mistake. In the present day, inconsistent or weaponized social media censorship is widespread, with seemingly nobody happy with how large corporations enforce their rules (though the details of how they are failing are arguable). At least one Chinese text-to-image model disallows prompts which include the word “democracy”. It would be surprising to me if protections against generating certain forms of content with LLMs don’t eventually lead to unexpected negative social consequences.
Secondly, there is a danger of AI safety becoming less robust—or even optimising for deceptive alignment—in models using front-end censorship. If it’s possible for a model to generate a harmful result from a prompt, then the AI is not aligned, even if the user can’t see the bad outputs once they are generated. This will create the illusion of greater safety than actually exists, and (imo) is practically begging for something to go wrong. As a “tame” example, severe bugs could crop up which are left unaddressed until it’s too late because nobody has access to “edge-case” harmful generations.
The third argument is a bit more coldly utilitarian, but is extremely relevant to this community: Calling content censorship “AI safety” (or even “bias reduction”) severely damages the reputation of actual, existential AI safety advocates. This is perhaps most obviously happening in the field of text-to-image generation. To illustrate, I present a few sample Tweets from my timeline (selected more-or-less randomly among tweets using the search term “AI safety” and “AI ethics”):
I think the predicament we are facing is clear. The more that public-facing censorship is presented as being a demonstration of AI safety/ethics, the more people tend to dismiss the AI safety field as a whole. This damages our ability to connect with people (especially in the open-source movement) who might otherwise be interested in collaborating, and gives motivation for adversarial actions against our efforts. My background is partially in Public and Media Relations, and if that were my current job here, I would be going into serious damage-reduction mode right now!
This has happened in part due to the general corporate desire to frame concerns over liability as being about the user’s safety (instead of the company’s), but it’s also partially our fault. OpenAI, for instance, is viewed by many as the poster-child of the classical “AI safety” camp (whether deserved or not), and what is it most famous for on the safety front? Its multi-tiered release of GPT-2 (and subsequent non-delayed release of GPT-3 for some reason), and its ban against using DALL-E to generate or edit photorealistic faces! Regardless of if those are good measures to take or not, the fact of the matter is that at some point, a decision was made that this would be marketed as “AI safety” and “minimizing risk,” respectively.
While we can’t take back what’s already been said and done, for the future I would like people in this field to take a stronger stance against using safety-related terminology in places where you’re likely to be seen by outsiders as the boy who cried “wolf”. Alternatively, perhaps we should make a clearer distinction between types of safety efforts (bias reduction, existential risk reduction, etc.), using specialized terminology to do so. It would be foolish to undermine our own efforts at raising awareness because we missed such an easy concept to understand: Censorship is not viewed as safety by the vast majority of people. Instead, it’s viewed as a sort of “safety theater,” similar to so-called “Hygiene Theater,” which caused mass harm during the COVID-19 pandemic by making people lose trust in public health institutions (deserved or not). We should do everything in our power to reduce the negative effects such “AI safety theater” may cause to our community.
What practical steps can be done?
The following list includes some suggestions given above, along with some more tentative proposals. I do not have rigorous evidence for everything mentioned blow (this is more personal intuition), so feel free to take this with a grain of salt:
If you can do so ethically, try to minimize the amount of overt censorship used in public-facing models.
If for whatever reason you need to explicitly block some forms of content, the details of implementation matter a lot, with back-end preventative work being preferred over front-end user-facing censorship. For example, banning sexually suggestive keywords and prompts (a front-end approach) will feel much more subjectively oppressive than not having your model trained on sexually suggestive data in the first place (a back-end approach which also prevents suggestive text/image outputs). Obviously, what you can practically achieve will vary depending on your situation.
If censorship is being utilized to reduce personal/company liability, say so explicitly! Currently, many people seem to think that calls for “ethics,” “safety,” or “bias reduction” are being used for the primary purpose of protecting corporate interests, and we really, really do not want to feed that particular beast. Saying “we can’t do [thing you want] because we’ll get sued/lose access to some services” is a lot less harmful than saying “we can’t do [thing you want] because we know what’s best for you and you don’t.” (this is often a useful framing even if you think the latter is true!)
Make clearer distinctions between types of safety efforts (bias reduction, existential risk reduction, etc.), using specialized terminology to do so. Perhaps new terminology needs to be coined, or perhaps existing concepts will do; this is something that can and should be discussed and iterated on within the community.
Be willing to speak to the media (as long as you have some amount of training beforehand) about what the field of AI safety is really trying to achieve. Most publications source their news about the field from press releases, which tend to come with a corporate, “everything we’re doing is for the good of humanity” vibe, and that may occasionally be worth pushing back against if your field is being misrepresented.
Feel free to suggest further possible actions that can be done in the comments below!
Especially considering that among many other harms, it was used to prevent anti-Nazi films from being produced!
For example, enforcement against generating sexually explicit content is likely to be stricter with some media (think queer/feminist/war coverage stuff), leading to exacerbated asymmetry in depicting the human condition. What about classical art, or cultures with totally healthy customs considered explicit in other contexts (such as nudists)? Some of this could be resolved in the future with more fine-tuned filters, but there isn’t strong incentive to do so, and evidence from existing social media censorship points to this not happening in a nuanced manner.
I define front-end censorship as when the user asks for something which is then denied, though the theoretical possibility to create/access it clearly exists; this is different from more subtle “back-end” forms.
To be clear, it may be the case that censorship is the right thing to do in some circumstances. However, please keep in mind that this community’s most famous unforced error has been related to censorship, and if you are reading this, you are unlikely to have typical views on the subject. Regardless of the ground truth, most people will perceive front-end censorship (as opposed to more subtle back-end censorship which may not receive the same reception) as being net negative, and an intrusive action. Some exceptions to this general rule do exist, most notably when it comes to blatantly illegal or uncontroversially unethical content (child pornography, nonconsensually obtained private information, etc.), but even then, some will still be unhappy on principle. One cannot make everyone perfectly content, but should still work to reduce potential damage when possible.