Large Language Models will be Great for Censorship

Link post

Produced as part of the SERI ML Alignment Theory Scholars Program—Summer 2023 Cohort

Thanks to ev_ and Kei for suggestions on this post.

LLMs can do many incredible things. They can generate unique creative content, carry on long conversations in any number of subjects, complete complex cognitive tasks, and write nearly any argument. More mundanely, they are now the state of the art for boring classification tasks and therefore have the capability to radically upgrade the censorship capacities of authoritarian regimes throughout the world.

How Censorship Worked

In totalitarian government states with wide censorship—Tsarist Russia, Eastern Bloc Communist states, the People’s Republic of China, Apartheid South Africa, etc—all public materials are ideally read and reviewed by government workers to ensure they contain nothing that might be offensive to the regime. This task is famously extremely boring and the censors would frequently miss obviously subversive material because they did not bother to go through everything. Marx’s Capital was thought to be uninteresting economics so made it into Russia legally in the 1890s.

The old style of censorship could not possibly scale, and the real way that censors exert control is through deterrence and fear rather than actual control of communication. Nobody knows the strict boundary line over which they cannot cross, and therefore they stay well away from it. It might be acceptable to lightly criticize one small part of the government that is currently in disfavor, but why risk your entire future on a complaint that likely goes nowhere? In some regimes such as the PRC under Mao, chaotic internal processes led to constant reversals of acceptable expression and by the end of the Cultural Revolution most had learned that simply being quiet was the safest path[1]. Censorship prevents organized resistance in the public and ideally for the regime this would lead to tacit acceptance of the powers that be, but a silently resentful population is not safe or secure. When revolution finally comes, the whole population might turn on their rulers with all of their suppressed rage released at once. Everyone knows that everyone knows that everyone hates the government, even if they can only acknowledge this in private trusted channels.

Because proper universal and total surveillance has always been impractical, regimes have instead focused on more targeted interventions to prevent potential subversion. Secret polices rely on targeted informant networks, not on workers who can listen to every minute of every recorded conversation. This had a horrible and chilling effect and destroyed many lives, but also was not as effective as it could have been. Major resistance leaders were still able to emerge in totalitarian states, and once the government showed signs of true weakness there were semi-organized dissidents ready to seize the moment[2].

Digital Communication and the Elusiveness of Total Censorship

Traditional censorship mostly dealt with a relatively small number of published works: newspapers, books, films, radio, television. This was somewhat manageable just using human labor. However in the past two decades, the amount of communication and material that is potentially public has been transformed with the internet.

It is much harder to know how governments are handling new data because the information we have mostly comes from the victims of surveillance who are kept in the same deterrent fear as the past. If victims imagine the state is more capable than it is, that means the state is succeeding, and it is harder to assess the true capabilities. We don’t have reliable accounts from insiders or archival access since no major regimes have fallen during this time.

With these caveats aside, reports at the forefront of government censorship are quite worrying. The current PRC state is apparently willing to heavily invest in extreme monitoring and control of populations it considers potentially subversive. According to reports from Xinjiang, people are regularly subjected to searches of their phones including their public social media accounts and private messages. Nothing feels safe and in the early days of the crackdown on Muslim minorities people were uncertain about which channels were monitored and which were not. It is notable that in these accounts the state appears to want to find any possibly subversive content but in frequent checks simply misses clearly offensive material. A source for Darren Byler reports that while the guards at checkpoints would check his WeChat, they seemed uncertain how to use foreign social media apps or to notice that he used a VPN[3]. Everything still passes through many layers of human fallibility, and the task of total surveillance of Xinjiang without heavy-handed violence and internment (which they have resorted to) seems to be beyond the capacity of the PRC state, even though they have poured money into this dark task.

We also know that modern internet platforms are built to allow for centralized control of information, but that these tools are far from perfect. Any mention of the Tiananmen Square Incident is kept strictly hidden in Chinese language sources within the great firewall, though for example the English Wikipedia article was openly accessible as of a few years ago. WeChat and other Chinese social media platforms appear to have tight controls placed on them: Peng Shuai’s accusations against a party official were scrubbed quickly from accessible sources and Zhao Wei’s whole career disappeared very quickly. While these are extremely dystopian and disturbing, it is notable that one could watch these events happen in real time along with many outraged Chinese citizens. The authorities can contain and control such incidents shortly after they happen, but not before they explode enough to warrant manual human-targeted censorship. Once manual intervention has happened, automatic processes prevent the same topics from reemerging, but new incidents seem to regularly escape classification.

What we know from tech-media companies in the more open world confirms this. Facebook, Twitter, and similar platforms seem to have a huge suite of algorithmic tools to prevent topics deemed toxic from gaining much reach or being seen at all, but none of these tools work perfectly. The internet has also revolutionized the output of content which disturb or offend the public, content which must be hidden if platforms wish to maximize their users and ad revenue. A huge pool of labor doing semi-manual tagging of offensive content—successors to the bored censors of the eastern bloc—are now able to at least keep communications mostly inoffensive and do some filtering according to company ideology. They have to wade through horrible content and respond to ambiguous and ever-changing guidelines, but they seem to keep these platforms tolerable. These manual classifications are almost certainly being used to train AI systems, but there is no sign that the humans can be replaced yet. And viral political events that require heavy-handed highly visible and manual intervention still happen regularly.

What can LLMs do?

Personal communication has been fully digitized, but algorithmic classification of potentially subversive content has not kept pace with the incredible scale of networked media. LLMs might be the tool to change this, at least in the domain of text. The lazy censors of the past can become even lazier and far more effective if they simply get scalable LLMs to grade every piece of text for its potential riskiness and implications.

For example, GPT-4 can be simply instructed to play the role of a Tsarist censor. To test it on a simple prompt, I gave it passages from Nikolay Chernyshevsky’s novel What is to be Done?, the 1863 Russian novel famous for inspiring many notable socialists and revolutionaries, especially Lenin who held the work dear. It is also somewhat famous as a failure of the censorship system—Chernyshevsky wrote the book as a political prisoner and the novel passed through the prison censor and a publication censor before being released to the public.

On a passage about the protagonist’s struggle, a censorship system detects nothing wrong:

However given a passage concerning the seamstress commune which inspired many readers, it notes the potential subversive danger:

A simple prompt like this will not be robust enough and would likely catch far too many false positives. But even this extremely simple system could save a human censor a huge amount of time, and might lead to a different outcome in the case of such works.

Large context windows may make the task even easier. Claude 100k was able to process about half the novel and provided the sort of report a censor might send to a supervisor, at a quality likely superior to what the Russian state was producing at the time.

And this took only a few seconds. Reviewing every novel published in Russia in the final century of the Tsarist state could be done in a matter of hours, and I would bet a properly instructed Claude would produce better reports than the army of Russia censors did over their whole careers.

In terms of cost, LLMs will be the most efficient censorship tool yet created. As an example, to read 120,000 words, a ballpark estimate for what the New York Times publishes in a day, a fast reader at 300 words per minute would take about six hours[4]. Categorizing and commenting on these would presumably take more time. Feeding in all this text to GPT-4 would cost $7.2, $0.36 for ChatGPT, they do it almost instantly, and advances in LLMs will drive this cheaper and cheaper until no hourly salary can compete.

The speed of LLMs cannot be matched, and the possibility of instant, realtime review might ground a new style of censorship. I would suspect such LLM systems will be used to flag any potential subversive material before they have the chance to spread. An extra few seconds before a post goes live could save companies huge risks, so they will likely have an AI review process which can be double checked by a human in cases of ambiguity. Messages both private and public will be monitored and can be intercepted before they are read. Updates to the classification system might make old messages disappear without announcement.

LLMs may not fully replace censors, but will make them orders of magnitude more effective. Humans will still have to manage the AIs to program in the changing and contradictory standards of censorship, but the AIs will do all the evaluation themselves. The sort of slightly ambiguous instructions that Facebook moderators are given are the sort of classification tasks that LLMs succeed at which no previous NLP system could manage[5]. I would expect censorship bureaus in totalitarian states are given similar tasks for more obviously bad ends. It was not feasible to give humans the enormous and boring task of evaluating every communication of every person and thinking hard about potential subversive externalities. But with AI, this dream of total surveillance is now feasible and likely to come upon us quite soon.

There will be limits, especially as so much of communication is now done in the form of video and images. We are still likely a little bit off from a system which can understand purely visual content perfectly. But automatic transcription of words in audio and images will allow LLMs to intervene in this domain as well.

Crucially, censorship does not actually have to be accurate to work. If these systems are not good and flag plenty of innocuous content as subversive, from the regime’s perspective this is fine. People will grumble and learn to steer around the systems, and the goal of creating acceptance of the regime will have been achieved.

If better LLM classifications of content are implemented at every level, the effects are not really predictable, and it is quite possible that regimes will overplay their hand. If Chinese users of WeChat attempt to let off steam about dealing with government bureaucracy to their friends and an LLM classifier prevents this content from even being sent in the first place, or modifies it to be innocuous on the way, what will be the overall political effect? People might conform and simply stop discussing politics or questioning the powers that be. But they also might take their previously digital conversations into other channels, creating conversational standards more resistant to surveillance and possibly more dangerous to the state. Totalitarian regimes are brutal and terrifying and murderous, but they are not always competent.

What Comes Next

It has widely been reported that the PRC may be hesitant to deploy public-facing LLMs due to concerns that the models themselves can’t be adequately censored—it might be very difficult to make a version of ChatGPT that cannot be tricked into saying “4/​6/​89.” But I would suspect that the technology itself, especially in the versions being made by state-aligned corporations, is likely to eventually make it into the censorship and surveillance pipeline. These will be used to classify and regulate messages and media as they are being sent, as well as possibly to comb through and understand masses of collected data. While there are plenty of internal bureaucratic reasons that might slow or even prevent this process, I would suspect that enterprising censors or tech companies will eventually give this a try.

Other existing totalitarian regimes may not be as fast, though may follow. Economic incentives can end up leading to increased development of such systems in the PRC, where there is significant government demand, followed by the export to other autocratic states. Russia may be an early adopter, but it is not clear. North Korea, Cuba, or the communist regimes in Southeast Asia don’t have the same levels of digital sophistication as the PRC but still could follow. Saudi Arabia and other states with an interest in technical sophistication and the suppression of dissent might also be likely to use these systems. Certain censorious factions in democratic countries such as India or Turkey could end up using such systems as well. And these techniques will likely be independently discovered and deployed by tech companies throughout the first world, especially in the United States.

There are two major applications corporate actors in the United States are likely to use censorship classification systems for. One is for the thorny and highly controversial area of content moderation, where the applications are obvious although the effects are uncertain. The other is for self-censorship both in compliance to incentives in relationships with totalitarian powers (mostly the PRC) and for avoiding general controversy in their home cultures. I believe that such systems are already being used by firms deploying LLMs as a safeguard against negative attention and I would expect the use of such systems to expand to all public-facing communication as a safety mechanism.

There is no real way to prevent this. It is only a matter of time before well-resourced state actors begin implementing and advancing such systems. Mitigations, such as techniques that can consistently fool LLMs and are difficult to patch, would be a worthy area for research although they might have other externalities as well. Further capabilities advances that might affect this problem should be considered in this light before deployment.

  1. ^

    Frank Dikötter’s People’s Trilogy contains a lot of such examples. The purest illustration of why it is better to stay quiet is the Hundred Flowers Campaign and its after. For how learned silence can actually undermine regime goals, the chapter “The Silent Revolution” in The Cultural Revolution is recommended.

  2. ^

    Eastern bloc Europe is full of these stories, one notable one being Václav Havel https://​​en.wikipedia.org/​​wiki/​​V%C3%A1clav_Havel#Political_dissident

  3. ^

    Darren Byler, In the Camps, page 47.

  4. ^

    Estimates are inexact and based on quick searches for average length of New York Times article, amount of articles published per day, and adult reading speed.

  5. ^

    From the article:

    “A post calling someone ‘my favorite n-----’ is allowed to stay up, because under the policy it is considered “explicitly positive content.”

    ‘Autistic people should be sterilized’ seems offensive to him, but it stays up as well. Autism is not a ‘protected characteristic’ the way race and gender are, and so it doesn’t violate the policy. (‘Men should be sterilized’ would be taken down.)”

    Assessing sentiment in complex circumstances and screening for particular protected characteristics seem within reach of GPT systems whereas earlier NLP approaches may not have been accurate enough.