Provably Safe AI

Link post

I’ve been hearing vague claims that automated theorem provers are able to, or will soon be able to prove things about complex software such as AIs.

Max Tegmark and Steve Omohundro have now published a paper, Provably Safe Systems : The Only Path To Controllable AGI, which convinces me that this is a plausible strategy to help with AI safety.

Ambitions

The basic steps:

  • Write trustworthy AIs that are capable of searching for proofs and verifying them.

  • Specify safety properties that we want all AIs to obey.

  • Train increasingly powerful Deep Learning systems.

  • Use Mechanistic Interpretability tools to translate knowledge from deep learning systems into more traditional software that’s more transparent than neural nets.

  • Use the AIs from the first step to prove that the results have those safety properties.

  • Require that any hardware that’s sufficiently powerful to run advanced AI be provably constrained to run only provably safe software.

Progress in automated theorem proving has been impressive. It’s tantalizingly close to what we need to prove interesting constraints on a large class of systems.

What Systems Can We Hope to Prove Properties about?

The paper convinced me that automated proof search and verification are making important progress. My intuition still says that leading AIs are too complex to prove anything about. But I don’t have a strong argument to back up that intuition. The topic is important enough that we ought to be pursuing some AI safety strategies that have a significant risk of being impossible. Throwing lots of compute at proving things could produce unintuitive results.

My main reaction after figuring out the basics of the paper’s proposals was to decide that their strategy made it impossible to train new AIs on high end GPUs.

Tegmark and Omohundro admit that powerful neural networks seem too messy to prove much about. Yet they also say that neural networks are a key step in creating better AIs:

First, note that the seemingly magical power of neural networks comes not from their ability to execute, but from their ability to learn. Once training is complete and it is time for execution, a neural network is merely one particular massively parallel computational architecture—and there are many others based on traditional software that can be similarly efficient.

But neural networks routinely crush the competition when it comes to learning.

I’m willing to imagine that the knowledge that’s embodied in neural network can be translated into more traditional software, and that doing so might make enough difference that the traditional software version will be amenable to proofs. My intuition says we should be pessimistic about proofs being tractable for anything more powerful than GPT-4, but the benefits of doing so suggest that we should try anyway.

What bothers me most about the paper’s strategy is that training a new neural network requires running that unproven neural network, in ways that allow it to do arbitrarily intelligent acts. Doesn’t that conflict with the rule that the relevant hardware only runs provably safe software?

That appears to depend in part on whether the training software can prove that the neural network is safe.

For an arbitrarily smart neural network, that may require proving that there’s no way for information about that neural network to get out of the training run except via the process that proves the safety of the neural network.

Implementing such security is well beyond my pay grade. But I’m not willing to bet it’s beyond the ability of GPT-7.

It also means that AI development becomes harder once security reaches this level (the paper implies that increasingly strict safety requirements will gradually be added). The paper seems to downplay the development costs of humans not being able test the unsafe neural network to figure out why it couldn’t be proven safe.

What Safety Properties Are Feasible?

The paper implies that we should initially ask for relatively basic safety properties, such as a refusal to help terrorists spread harmful viruses.

I want to focus instead on security properties that limit the AI’s ability to take over the world. Specifically, can we prove that the AI won’t interfere with a human pressing a kill button that shuts down the AI?

One simple suggestion from the paper is to require that the AI needs crypto coins to enable it to access compute. That means the default is for the AI to run out of compute unless the human(s) in control of generating the coins cooperates.

That’s a good start for dealing with AIs that approach human levels. Eventually we’ll want to add safety properties that deter AIs from harmful manipulation of humans.

The longer-term vision seems to be gradually building up stronger safety guarantees, until they’re more like Asimov’s 3 Laws of Robotics.

More Caveats

The strategy assumes we’ll develop a good set of safety properties that we’re demanding proof of. There’s likely some room for automated systems to figure out what safety humans want, and turn it into rigorous specifications. Tegmark and Omohundro prefer to have this step done by humans.

I’m willing to accept that for near-human stages of AI, but not too long after that we’ll needed something closer to an AI-generated approximation to CEV. I’ll classify this as being a fairly important problem, one that’s shared by most other approaches to AI safety.

Proofs about hardware seem different from proofs about software, since there are uncertainties about how well the theorem prover can observe the hardware. This seems like a relatively minor issue, but I’m uncomfortable with the way the paper claims mathematical certainty about the proofs.

The authors list sensor reliability as an open problem. They aim only for increasing that reliability, not full guarantees. So I’m getting mildly conflicting messages about how much certainty the paper is aiming for.

Section 8 (“The Only Path To Controllable AGI”) goes a bit too far:

this implies that if some safety property doesn’t have a proof, then there must be a way to violate it. Sufficiently powerful AGIs may well find that way. And if that AGI is malicious or controlled by a malicious human, then it will exploit that flaw.

That seems reasonable if the safety properties that we ask for are exactly the properties that we need, and if we use unlimited amounts of compute to search for proofs.

I consider it much more likely that we’ll ask for security properties that don’t exactly match our needs. We’ll either be overly cautious, and ask for safety properties that constrain the AI more than we absolutely need. In which case, an absence of proof doesn’t tell us whether a system is dangerous. Or we’ll fail at safety by not asking for as much safety as we need.

Note that they seem to expect fairly exhaustive searches for proofs, such that failure to find a proof ought to imply a genuine absence of safety. I’m unsure whether this is a reasonable expectation.

Conclusion

I see something like a 50% chance that this strategy would significantly reduce AI risk if the world coordinated to require it. However, it’s still rather hard to see how we can get such coordination.

I’m now feeling a wee bit more optimistic about decent regulation of AI development.

I’ve returned to thinking that it’s overkill to have a full-fledged ban on development of powerful AI.

Requiring insurance and strict liability no longer appear to create large risks of amounting to a total ban on development, so it’s looking more plausible that such regulations can be enforced.

If insurance companies need to see proofs of safety for advanced GPUs, that will in practice create a temporary pause in AI progress. I’m guessing that pause would be on the order of 1 to 10 years. But will it end due to successful proofs, or due to inability to enforce regulations? I don’t know.