Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

I want to draw attention to a new paper, written by myself, David “davidad” Dalrymple, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum.

In this paper we introduce the concept of “guaranteed safe (GS) AI”, which is a broad research strategy for obtaining safe AI systems with provable quantitative safety guarantees. Moreover, with a sufficient push, this strategy could plausibly be implemented on a moderately short time scale. The key components of GS AI are:

  1. A formal safety specification that mathematically describes what effects or behaviors are considered safe or acceptable.

  2. A world model that provides a mathematical description of the environment of the AI system.

  3. A verifier that provides a formal proof (or some other comparable auditable assurance) that the AI system satisfies the safety specification with respect to the world model.

The first thing to note is that a safety specification in general is not the same thing as a reward function, utility function, or loss function (though they include these objects as special cases). For example, it may specify that the AI system should not communicate outside of certain channels, copy itself to external computers, modify its own source code, or obtain information about certain classes of things in the external world, etc. The safety specifications may be specified manually, generated by a learning algorithm, written by an AI system, or obtained through other means. Further detail is provided in the main paper.

The next thing to note is that most useful safety specifications must be given relative to a world model. Without a world model, we can only use specifications defined directly over input-output relations. However, we want to define specifications over input-outcome relations instead. This is why a world model is a core component of GS AI. Also note that:

  1. The world model need not be a “complete” model of the world. Rather, the required amount of detail and the appropriate level of abstraction depends on both the safety specification(s) and the AI system’s context of use.

  2. The world model should of course account for uncertainty, which may include both stochasticity and nondeterminism.

  3. The AI system whose safety is being verified may or may not use a world model, and if it does, we may or may not be able to extract it. However, the world model that is used for the verification of the safety properties need not be the same as the world model of the AI system whose safety is being verified (if it has one).

The world model would likely have to be AI-generated, and should ideally be interpretable. In the main paper, we outline a few potential strategies for producing such a world model.

Finally, the verifier produces a quantitative assurance that the base-level AI controller satisfies the safety specification(s) relative to the world model(s). In the most straightforward form, this could simply take the shape of a formal proof. However, if a direct formal proof cannot be obtained, then there are weaker alternatives that would still produce a quantitative guarantee. For example, the assurance may take the form of a proof that bounds the probability of failing to satisfy the safety specification, or a proof that the AI system will converge towards satisfying the safety specification (with increasing amounts of data or computational resources, for example). Such proofs are of course often very hard to obtain. However, further progress in automated theorem proving (and related techniques) may make it very substantially easier to obtain such proofs. Furthermore, an automated theorem prover AI could be very powerful without having dangerous capabilities. For more detail, see the main paper.

If each of these three components can be created, then they can be used to provide auditable, quantitative safety guarantees for AI systems. This strategy does also not require interpretability to be solved, but could still provide a solution to the inner alignment problem (and rule out deceptive alignment, etc). Moreover, it should be possible to implement this strategy without any new fundamental insights; improvement of existing techniques (using LLMs and other tools) may be sufficient. If we get a substantive research push in this direction, then I am optimistic about the prospects of achieving substantially safer AI systems through the GS AI strategy.

For more detail, see the full paper.