Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse17 May 2024 19:13 UTC

LW: 63 AF: 29

Open Agency Architecture Formal Proof AI

I want to draw attention to a new paper, written by myself, David “davidad” Dalrymple, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum.

In this paper we introduce the concept of “guaranteed safe (GS) AI”, which is a broad research strategy for obtaining safe AI systems with provable quantitative safety guarantees. Moreover, with a sufficient push, this strategy could plausibly be implemented on a moderately short time scale. The key components of GS AI are:

A formal safety specification that mathematically describes what effects or behaviors are considered safe or acceptable.
A world model that provides a mathematical description of the environment of the AI system.
A verifier that provides a formal proof (or some other comparable auditable assurance) that the AI system satisfies the safety specification with respect to the world model.

The first thing to note is that a safety specification in general is not the same thing as a reward function, utility function, or loss function (though they include these objects as special cases). For example, it may specify that the AI system should not communicate outside of certain channels, copy itself to external computers, modify its own source code, or obtain information about certain classes of things in the external world, etc. The safety specifications may be specified manually, generated by a learning algorithm, written by an AI system, or obtained through other means. Further detail is provided in the main paper.

The next thing to note is that most useful safety specifications must be given relative to a world model. Without a world model, we can only use specifications defined directly over input-output relations. However, we want to define specifications over input-outcome relations instead. This is why a world model is a core component of GS AI. Also note that:

The world model need not be a “complete” model of the world. Rather, the required amount of detail and the appropriate level of abstraction depends on both the safety specification(s) and the AI system’s context of use.
The world model should of course account for uncertainty, which may include both stochasticity and nondeterminism.
The AI system whose safety is being verified may or may not use a world model, and if it does, we may or may not be able to extract it. However, the world model that is used for the verification of the safety properties need not be the same as the world model of the AI system whose safety is being verified (if it has one).

The world model would likely have to be AI-generated, and should ideally be interpretable. In the main paper, we outline a few potential strategies for producing such a world model.

Finally, the verifier produces a quantitative assurance that the base-level AI controller satisfies the safety specification(s) relative to the world model(s). In the most straightforward form, this could simply take the shape of a formal proof. However, if a direct formal proof cannot be obtained, then there are weaker alternatives that would still produce a quantitative guarantee. For example, the assurance may take the form of a proof that bounds the probability of failing to satisfy the safety specification, or a proof that the AI system will converge towards satisfying the safety specification (with increasing amounts of data or computational resources, for example). Such proofs are of course often very hard to obtain. However, further progress in automated theorem proving (and related techniques) may make it very substantially easier to obtain such proofs. Furthermore, an automated theorem prover AI could be very powerful without having dangerous capabilities. For more detail, see the main paper.

If each of these three components can be created, then they can be used to provide auditable, quantitative safety guarantees for AI systems. This strategy does also not require interpretability to be solved, but could still provide a solution to the inner alignment problem (and rule out deceptive alignment, etc). Moreover, it should be possible to implement this strategy without any new fundamental insights; improvement of existing techniques (using LLMs and other tools) may be sufficient. If we get a substantive research push in this direction, then I am optimistic about the prospects of achieving substantially safer AI systems through the GS AI strategy.

For more detail, see the full paper.

Joar Skalse17 May 2024 19:13 UTC

LW: 63 AF: 29

9 comments2 min readLW link

Open Agency Architecture Formal Proof AI

ryan_greenblatt 17 May 2024 22:59 UTC
LW: 9 AF: 6
2
AF
(Note that this paper was already posted here, so see comments on that post as well.)
Joe_Collman 20 May 2024 6:33 UTC
LW: 6 AF: 4
0
AF
This seems interesting, but I’ve seen no plausible case that there’s a version of (1) that’s both sufficient and achievable. I’ve seen Davidad mention e.g. approaches using boundaries formalization. This seems achievable, but clearly not sufficient. (boundaries don’t help with e.g. [allow the mental influences that are desirable, but not those that are undesirable])
The [act sufficiently conservatively for safety, relative to some distribution of safety specifications] constraint seems likely to lead to paralysis (either of the form [AI system does nothing], or [AI system keeps the world locked into some least-harmful path], depending on the setup—and here of course “least harmful” isn’t a utopia, since it’s a distribution of safety specifications, not desirability specifications).
Am I mistaken about this?
I’m very pleased that people are thinking about this, but I fail to understand the optimism—hopefully I’m confused somewhere!
Is anyone working on toy examples as proof of concept?
I worry that there’s so much deeply technical work here that not enough time is being spent to check that the concept is workable (is anyone focusing on this?). I’d suggest focusing on mental influences: what kind of specification would allow me to radically change my ideas, but not to be driven insane? What’s the basis to think we can find such a specification?
It seems to me that finding a fit-for-purpose safety/acceptability specification won’t be significantly easier than finding a specification for ambitious value alignment.
- davidad 20 May 2024 16:37 UTC
  LW: 6 AF: 3
  0
  AF Parent
  It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system’s pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments).
  
  It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details of the engineering designs (which it then proves satisfy the safety specifications). But, I think sufficient fine-tuning with a GFlowNet objective will naturally penalise description complexity, and also penalise heavily biased sampling of equally complex solutions (e.g. toward ones that encode messages of any significance), and I expect this to reduce this risk to an acceptable level. I would like to fund a sleeper-agents-style experiment on this by the end of 2025.
  - Joe_Collman 22 May 2024 19:33 UTC
    LW: 3 AF: 1
    0
    AF Parent
    [again, the below is all in the spirit of “I think this direction is plausibly useful, and I’d like to see more work on it”]
    not to have any mental influences on people other than those which factor through the system’s pre-agreed goals being achieved in the world.
    Sure, but this seems to say “Don’t worry, the malicious superintelligence can only manipulate your mind indirectly”. This is not the level of assurance I want from something calling itself “Guaranteed safe”.
    It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details
    This is one mechanism by which such a system could cause great downstream harm.
    Suppose that we have a process to avoid this. What assurance do we have that there aren’t other mechanisms to cause harm?
    I don’t yet buy the description complexity penalty argument (as I currently understand it—but quite possibly I’m missing something). It’s possible to manipulate by strategically omitting information. Perhaps the “penalise heavily biased sampling” is intended to avoid this (??). If so, I’m not sure how this gets us more than a hand-waving argument.
    I imagine it’s very hard to do indirect manipulation without adding much complexity.
    I imagine that ASL-4+ systems are capable of many very hard things.
    Similar reasoning leads me to initial skepticism of all [safety guarantee by penalizing some-simple-x] claims. This amounts to a claim that reducing x necessarily makes things safer—which I expect is untrue for any simple x.
    I can buy that there are simple properties whose reduction guarantees safety if it’s done to an extreme degree—but then I’m back to expecting the system to do nothing useful.
    As an aside, I’d note that such processes (e.g. complexity penalties) seem likely to select out helpful behaviours too. That’s not a criticism of the overall approach—I just want to highlight that I don’t think we get to have both [system provides helpful-in-ways-we-hadn’t-considered output] and [system can’t produce harmful output]. Allowing the former seems to allow the latter.
    I would like to fund a sleeper-agents-style experiment on this by the end of 2025
    That’s probably a good idea, but this kind of approach doesn’t seem in keeping with a “Guaranteed safe” label. More of a “We haven’t yet found a way in which this is unsafe”.
- davidad 20 May 2024 16:53 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Paralysis of the form “AI system does nothing” is the most likely failure mode. This is a “de-pessimizing” agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.
  
  “Locked into some least-harmful path” is a potential failure mode if the semantics or implementation of causality or decision theory in the specification framework are done in a different way than I hope. Locking in to a particular path massively reduces the entropy of the outcome distribution beyond what is necessary to ensure a reasonable risk threshold (e.g. 1 catastrophic event per millennium) is cleared. A FEEF objective (namely, minimize the divergence of the outcomes conditional on intervention from the outcomes conditional on filtering for the goal being met) would greatly penalize the additional facts which are enforced by the lock-in behaviours.
  
  As a fail-safe, I propose to mitigate the downsides of lock-in by using time-bounded utility functions.
  - Joe_Collman 22 May 2024 20:24 UTC
    LW: 2 AF: 1
    0
    AF Parent
    (understood that you’d want to avoid the below by construction through the specification)
    I think the worries about a “least harmful path” failure mode would also apply to a “below 1 catastrophic event per millennium” threshold. It’s not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn’t be highly undesirable outcomes.
    It seems to me that “greatly penalize the additional facts which are enforced” is a two-edged sword: we want various additional facts to be highly likely, since our acceptability specification doesn’t capture everything that we care about.
    
    I haven’t thought about it in any detail, but doesn’t using time-bounded utility functions also throw out any acceptability guarantee for outcomes beyond the time-bound?
Pawan 22 May 2024 23:44 UTC
1 point
0
If I understand this paper correctly, then I can see parallels between the concepts that this paper suggests (Safety Specifications, World Model and Verifier) and the they way the different religions of the world have been functioning as frameworks to align humans.
The parallels here could be as follows...
Safety Specifications ~ Ethical values that the religion purports
World Model ~ Ontological philosophy of the religion
Verifier ~ Societal mechanisms (Law, Judiciary, Policing etc) based on the above.
While the Ontological philosophies and Ethical values of the different religions are generally well stated, the Societal mechanisms that verify adherence to these values can be complex if the societies are secular or comprise of multi-religious populations. In such situations, the usually, one religion dominates the verification systems.
Also, based on this comment below, it would appear that I am not too far off with this analogy. If this is a possible outcome of the ideas presented in the paper than again, this is seems like a shadow of how societies treat a “non-aligned” human by “transitioning them to safe mode” (aka judicial custody for example) and “disabling” (aka prison/isloation for example).
“You could also monitor the environment of the AI at runtime to look for signs that the world model is inaccurate in a certain situation, and if such signs are detected, transition the AI to a safe mode where it can be disabled.
I am curious if the authors think this analogy is valid or if its too far off?
If this analogy stands, then perhaps the various religions of the world already provide a set of “Safety Specifications” and “World Models” that can help test this thesis.
Vinayak Pathak 19 May 2024 2:08 UTC
1 point
0
AF
I read the paper, and overall it’s an interesting framework. One thing I am somewhat unconvinced about (likely because I have misunderstood something) is its utility despite the dependence on the world model. If we prove guarantees assuming a world model, but don’t know what happens if the real world deviates from the world model, then we have a problem. Ideally perhaps we want a guarantee akin to what’s proved in learning theory, for example, that the accuracy will be small for any data distribution as long as the distribution remains the same during training and testing.
But perhaps I have misunderstood what’s meant by a world model and maybe it’s simply the set of precise assumptions under which the guarantees have been proved. For example, in the learning theory setup, maybe the world model is the assumption that the training and test distributions are the same, as opposed to a description of the data distribution.
- Joar Skalse 21 May 2024 11:59 UTC
  LW: 1 AF: 1
  0
  AF Parent
  You can imagine different types of world models, going from very simple ones to very detailed ones. In a sense, you could perhaps think of the assumption that the input distribution is i.i.d. as a “world model”. However, what is imagined is generally something that is much more detailed than this. More useful safety specifications would require world models that (to some extent) describe the physics of the environment of the AI (perhaps including human behaviour, though it would probably be better if this can be avoided). More detail about what the world model would need to do, and how such a world model may be created, is discussed in Section 3.2. My personal opinion is that the creation of such a world model probably would be challenging, but not more challenging than the problems encountered in other alignment research paths (such as mechanistic interpretability, etc). Also note that you can obtain guarantees without assuming that the world model is entirely accurate. For example, consider the guarantees that are derived in cryptography, or the guarantees derived from formal verification of airplane controllers, etc. You could also monitor the environment of the AI at runtime to look for signs that the world model is inaccurate in a certain situation, and if such signs are detected, transition the AI to a safe mode where it can be disabled.