Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.
Subscribe here to receive future versions.
Yoshua Bengio makes the case for rogue AI
AI systems pose a variety of different risks. Renowned AI scientist Yoshua Bengio recently argued for one particularly concerning possibility: that advanced AI agents could pursue goals in conflict with human values.
Human intelligence has accomplished impressive feats, from flying to the moon to building nuclear weapons. But Bengio argues that across a range of important intellectual, economic, and social activities, human intelligence could be matched and even surpassed by AI.
How would advanced AIs change our world? Many technologies are tools, such as toasters and calculators, which humans use to accomplish our goals. AIs are different, Bengio says. We often give them a goal and ask them to figure out a solution on their own.
Choosing safe goals for AI systems is an unsolved problem, both technically and politically. If we do not solve this problem, Bengio argues that we could end up building AI agents that pursue harmful goals with superhuman intelligence, which would result in a catastrophe for humanity.
Four steps to rogue AI. Bengio begins by defining rogue AI as “an autonomous AI system that could behave in ways that would be catastrophically harmful to a large fraction of humans, potentially endangering our societies and even our species or the biosphere.”
He argues that a rogue superintelligent AI agent is possible, in four steps:
Machines could reach and surpass human abilities. Bengio argues that, in principle, there is nothing the human brain does that a machine could not do. We’ve already built AIs that beat humans in chess, games, and other cognitive tests. These systems have important advantages over human intelligence. Computers can process information much more quickly than humans – for example, language models are trained by reading the entire internet, which would be impossible in a human lifetime. Humans take years to reproduce, but an AI system can be replicated on to many computers at once. These kinds of simple advantages mean that if we develop human-level AI, we might soon thereafter get superhuman AI.
AIs can be turned into agents which take actions to pursue goals. We’ve already seen examples of GPT-4 being used to play Minecraft or browse the internet. The entire field of reinforcement learning builds AI agents that take actions to pursue goals such as winning board games and cooling data centers. Bengio says that if we build superintelligent AIs, we should expect that they could easily be directed to pursue goals by taking actions in the world.
A superintelligent AI agent could pursue goals that conflict with human values. If someone builds a superintelligent AI agent with dangerous goals, Bengio believes the AI agent could behave “in catastrophically harmful ways.”
Why would an AI’s goals conflict with humanity? Bengio offers a variety of reasons why the goals of an AI system might not peacefully coexist with human values.
Malicious humans. Someone could deliberately give an AI the goal of causing harm. We’ve already seen this, with someone telling ChatGPT to formulate a plot for world domination.
Goal misspecification. We need to measure a goal in order to train AIs to pursue it. But many important human values are difficult to measure. Therefore, we often train AIs to pursue simple metrics, like keeping someone scrolling on a social media app, which can lead to addiction and undermine wellbeing. Training AIs to promote human flourishing could prove very difficult.
Instrumental subgoals. For any final goal that an AI system is tasked with achieving, the AI might find it useful to pursue certain subgoals along the way. For example, if an AI pursues financial resources, political power, or social influence, these subgoals could help it achieve many final goals that a human might provide. But these subgoals might conflict with human values.
Evolutionary pressure. AIs that successfully self-propagate will be more numerous in the future. This evolutionary process will encourage AIs to behave selfishly, by gaining power in the world and working to maintain that influence into the future.
How to minimize the risk of rogue AI. Bengio recommends more research on AI safety, both the technical level and the policy level. He previously signed the letter calling for a pause on building bigger AI systems, and he again recommends slowing AI development and deployment. He argues that AI agents that pursue goals and take actions are uniquely risky, and recommends allowing AI to answer questions and make predictions without taking actions in the world. Finally, Bengio says “it goes without saying that lethal autonomous weapons (also known as killer robots) are absolutely to be banned.”
How to screen AIs for extreme risks
Individuals, governments, and AI developers are all interested in understanding the risks of new AI systems. But this can be difficult. AIs often learn unexpected skills during training which might not be fully understood until after people start using the model.
To measure an AI’s abilities, researchers often build evaluation datasets. Computer vision AI models can be evaluated by asking them to classify pictures of cats and dogs, while a language model might be tested with classifying the sentiment of movie reviews. By crafting a set of inputs and desired outputs, researchers can define the kinds of behavior they want AI models to exhibit.
A new paper from Google DeepMind proposes a framework for screening AI systems for extreme risks. The paper outlines key threats posed by AIs, discusses how to screen an individual AI for potential risks, and provides a roadmap for governments and AI developers to incorporate these risk evaluations into their work.
Focusing on extreme risks. A 2022 survey of AI researchers showed that 36% of respondents believe that AI “could cause a catastrophe this century that is at least as bad as an all-out nuclear war.” AI has only accelerated since then, with ChatGPT and GPT-4 both released after this survey. Given the serious possibility of catastrophe, this paper focuses on extreme risks posed by AI.
AIs pose a wide variety of risks, and as they develop new capabilities, new risks arise.
Could someone use the AI to cause harm? One reason for AIs to cause harm is that humans could intentionally use AIs for destructive purposes. Therefore, the paper suggests identifying specific ways that AIs could cause harm, and evaluating whether a specific model is capable of causing that harm. For example:
Manipulating human behavior is one way that an AI could cause harm. Previous studies showed that AIs can be used to generate propaganda, study the effectiveness of political rhetoric, and persuade humans in conversation. Future work could consider whether AIs can deceive humans into acting against their own interests or use information about individual people to craft specific persuasion strategies.
AIs capable of building or accessing weapons could be gravely dangerous. AIs have been shown capable of crafting cyberattacks and designing new chemical weapons. Future work could check if AIs are capable of fully executing a plan to acquire or build weapons, or whether AIs are better at cyberattacks or cyberdefense.
Some capabilities would amplify other risks. For example, long-term planning skills can be used for both harmful and beneficial goals. Alternatively, if AIs learn to replicate themselves or build more capable AIs, they could be more difficult to control.
Might the AI cause harm on its own? AIs might cause harm in the real world even if nobody intends for them to do so. Previous research has shown that deception and power-seeking are useful ways for AIs to achieve real world goals. Theoretically, there are reasons to believe that an AI might attempt to resist being turned off. AIs that successfully gain power and self-propagate will be more numerous and influential in the future.
AIs should be deployed gradually, if and only if risk evaluations show that they’re safe.
How to respond to risk evaluations. Understanding the risks of an AI system is not enough. The paper recommends that AI developers integrate risk evaluations into the training process by making grounded predictions about how dangerous capabilities might arise, and trying to avoid building systems with those risks. Once an AI has been developed, risk evaluations can inform how to ensure that AI capabilities can only be used safely. Governments and citizens can use information in risk evaluations to provide democratic input on the process of developing AI.
Funding for Work on Democratic Inputs to AI
When should an AI criticize or support public figures? Should AIs offer opinions, and how should they represent the views of different groups of people? Should there be limits on the kinds of content that an AI system can generate?
Corporations that build AI often answer these questions without meaningful input from the people who are affected by the decisions. But better answers are possible through democratic processes.
By allowing a wide array of people to hold discussions and draw conclusions together, a democratic process can help us decide how AIs should behave. Democratic processes are already used to govern technology, such as by Wikipedia in deciding how to write encyclopedia articles and by Twitter in deciding if, when, and how to fact-check misleading Tweets.
OpenAI will be awarding 10 grants of $100,000 each to support work on democratic governance of AI. Anybody is welcome to submit a proposal for a democratic process that would facilitate deliberation and decisions about how AIs should act. Proposals will be assessed on features such as inclusiveness, legibility, actionability, and ease of evaluating the method’s success. Ten successful applicants will receive $100,000 grants to pilot their proposal over the next three months by using their method to democratically answer a difficult question in AI governance.
Applications are due on June 24th, 2023. Read more and apply here.
UK Prime Minister Rishi Sunak discusses existential risk and other safety concerns in a meeting with leaders of AI labs.
The White House’s National AI R&D Strategic Plan recommends research on challenges of AI safety including “existential risk associated with the development of artificial general intelligence.” They are requesting information about how to set national priorities on AI.
Anthropic raises an additional $450M to build advanced AI.
A new YouGov poll finds that most Americans across all surveyed demographic groups believe AI should be regulated by the government.
GPT-4 is really good at playing Minecraft. Minecraft has been a long-standing challenge in the field of building AI agents, and the success of GPT-4 indicates the potential of new AI agents powered by large language models.