Write trustworthy AIs that are capable of searching for proofs and
verifying them.
Specify safety properties that we want all AIs to obey.
Train increasingly powerful Deep Learning systems.
Use Mechanistic Interpretability tools to translate knowledge from
deep learning systems into more traditional software that’s more
transparent than neural nets.
Use the AIs from the first step to prove that the results have those
safety properties.
Require that any hardware that’s sufficiently powerful to run
advanced AI be provably constrained to run only provably safe
software.
Progress in automated theorem proving has been impressive. It’s
tantalizingly close to what we need to prove interesting constraints on
a large class of systems.
What Systems Can We Hope to Prove Properties about?
The paper convinced me that automated proof search and verification are
making important progress. My intuition still says that leading AIs are
too complex to prove anything about. But I don’t have a strong argument
to back up that intuition. The topic is important enough that we ought
to be pursuing some AI safety strategies that have a significant risk of
being impossible. Throwing lots of compute at proving things could
produce unintuitive results.
My main reaction after figuring out the basics of the paper’s proposals
was to decide that their strategy made it impossible to train new AIs on
high end GPUs.
Tegmark and Omohundro admit that powerful neural networks seem too messy
to prove much about. Yet they also say that neural networks are a key
step in creating better AIs:
First, note that the seemingly magical power of neural networks comes
not from their ability to execute, but from their ability to
learn. Once training is complete and it is time for execution, a
neural network is merely one particular massively parallel
computational architecture—and there are many others based on
traditional software that can be similarly efficient.
But neural networks routinely crush the competition when it comes to
learning.
I’m willing to imagine that the knowledge that’s embodied in neural
network can be translated into more traditional software, and that doing
so might make enough difference that the traditional software version
will be amenable to proofs. My intuition says we should be pessimistic
about proofs being tractable for anything more powerful than GPT-4, but
the benefits of doing so suggest that we should try anyway.
What bothers me most about the paper’s strategy is that training a new
neural network requires running that unproven neural network, in ways
that allow it to do arbitrarily intelligent acts. Doesn’t that conflict
with the rule that the relevant hardware only runs provably safe
software?
That appears to depend in part on whether the training software can
prove that the neural network is safe.
For an arbitrarily smart neural network, that may require proving that
there’s no way for information about that neural network to get out of
the training run except via the process that proves the safety of the
neural network.
Implementing such security is well beyond my pay grade. But I’m not
willing to bet it’s beyond the ability of GPT-7.
It also means that AI development becomes harder once security reaches
this level (the paper implies that increasingly strict safety
requirements will gradually be added). The paper seems to downplay the
development costs of humans not being able test the unsafe neural
network to figure out why it couldn’t be proven safe.
What Safety Properties Are Feasible?
The paper implies that we should initially ask for relatively basic
safety properties, such as a refusal to help terrorists spread harmful
viruses.
I want to focus instead on security properties that limit the AI’s
ability to take over the world. Specifically, can we prove that the AI
won’t interfere with a human pressing a kill button that shuts down the
AI?
One simple suggestion from the paper is to require that the AI needs
crypto
coins
to enable it to access compute. That means the default is for the AI to
run out of compute unless the human(s) in control of generating the
coins cooperates.
That’s a good start for dealing with AIs that approach human levels.
Eventually we’ll want to add safety properties that deter AIs from
harmful manipulation of humans.
The longer-term vision seems to be gradually building up stronger safety
guarantees, until they’re more like Asimov’s 3 Laws of Robotics.
More Caveats
The strategy assumes we’ll develop a good set of safety properties that
we’re demanding proof of. There’s likely some room for automated
systems to figure out what safety humans want, and turn it into rigorous
specifications. Tegmark and Omohundro prefer to have this step done by
humans.
I’m willing to accept that for near-human stages of AI, but not too
long after that we’ll needed something closer to an AI-generated
approximation to
CEV.
I’ll classify this as being a fairly important problem, one that’s
shared by most other approaches to AI safety.
Proofs about hardware seem different from proofs about software, since
there are uncertainties about how well the theorem prover can observe
the hardware. This seems like a relatively minor issue, but I’m
uncomfortable with the way the paper claims mathematical certainty about
the proofs.
The authors list sensor reliability as an open problem. They aim only
for increasing that reliability, not full guarantees. So I’m getting
mildly conflicting messages about how much certainty the paper is aiming
for.
Section 8 (“The Only Path To Controllable AGI”) goes a bit too far:
this implies that if some safety property doesn’t have a proof, then
there must be a way to violate it. Sufficiently powerful AGIs may
well find that way. And if that AGI is malicious or controlled by a
malicious human, then it will exploit that flaw.
That seems reasonable if the safety properties that we ask for are
exactly the properties that we need, and if we use unlimited amounts of
compute to search for proofs.
I consider it much more likely that we’ll ask for security properties
that don’t exactly match our needs. We’ll either be overly cautious,
and ask for safety properties that constrain the AI more than we
absolutely need. In which case, an absence of proof doesn’t tell us
whether a system is dangerous. Or we’ll fail at safety by not asking
for as much safety as we need.
Note that they seem to expect fairly exhaustive searches for proofs,
such that failure to find a proof ought to imply a genuine absence of
safety. I’m unsure whether this is a reasonable expectation.
Conclusion
I see something like a 50% chance that this strategy would significantly
reduce AI risk if the world coordinated to require it. However, it’s
still rather hard to see how we can get such coordination.
I’m now feeling a wee bit more optimistic about decent regulation of AI
development.
I’ve returned to thinking that it’s overkill to have a full-fledged
ban on development of powerful AI.
Requiring insurance and strict
liability
no longer appear to create large risks of amounting to a total ban on
development, so it’s looking more plausible that such regulations can
be enforced.
If insurance companies need to see proofs of safety for advanced GPUs,
that will in practice create a temporary pause in AI progress. I’m
guessing that pause would be on the order of 1 to 10 years. But will it
end due to successful proofs, or due to inability to enforce
regulations? I don’t know.
Provably Safe AI
Link post
I’ve been hearing vague claims that automated theorem provers are able to, or will soon be able to prove things about complex software such as AIs.
Max Tegmark and Steve Omohundro have now published a paper, Provably Safe Systems : The Only Path To Controllable AGI, which convinces me that this is a plausible strategy to help with AI safety.
Ambitions
The basic steps:
Write trustworthy AIs that are capable of searching for proofs and verifying them.
Specify safety properties that we want all AIs to obey.
Train increasingly powerful Deep Learning systems.
Use Mechanistic Interpretability tools to translate knowledge from deep learning systems into more traditional software that’s more transparent than neural nets.
Use the AIs from the first step to prove that the results have those safety properties.
Require that any hardware that’s sufficiently powerful to run advanced AI be provably constrained to run only provably safe software.
Progress in automated theorem proving has been impressive. It’s tantalizingly close to what we need to prove interesting constraints on a large class of systems.
What Systems Can We Hope to Prove Properties about?
The paper convinced me that automated proof search and verification are making important progress. My intuition still says that leading AIs are too complex to prove anything about. But I don’t have a strong argument to back up that intuition. The topic is important enough that we ought to be pursuing some AI safety strategies that have a significant risk of being impossible. Throwing lots of compute at proving things could produce unintuitive results.
My main reaction after figuring out the basics of the paper’s proposals was to decide that their strategy made it impossible to train new AIs on high end GPUs.
Tegmark and Omohundro admit that powerful neural networks seem too messy to prove much about. Yet they also say that neural networks are a key step in creating better AIs:
I’m willing to imagine that the knowledge that’s embodied in neural network can be translated into more traditional software, and that doing so might make enough difference that the traditional software version will be amenable to proofs. My intuition says we should be pessimistic about proofs being tractable for anything more powerful than GPT-4, but the benefits of doing so suggest that we should try anyway.
What bothers me most about the paper’s strategy is that training a new neural network requires running that unproven neural network, in ways that allow it to do arbitrarily intelligent acts. Doesn’t that conflict with the rule that the relevant hardware only runs provably safe software?
That appears to depend in part on whether the training software can prove that the neural network is safe.
For an arbitrarily smart neural network, that may require proving that there’s no way for information about that neural network to get out of the training run except via the process that proves the safety of the neural network.
Implementing such security is well beyond my pay grade. But I’m not willing to bet it’s beyond the ability of GPT-7.
It also means that AI development becomes harder once security reaches this level (the paper implies that increasingly strict safety requirements will gradually be added). The paper seems to downplay the development costs of humans not being able test the unsafe neural network to figure out why it couldn’t be proven safe.
What Safety Properties Are Feasible?
The paper implies that we should initially ask for relatively basic safety properties, such as a refusal to help terrorists spread harmful viruses.
I want to focus instead on security properties that limit the AI’s ability to take over the world. Specifically, can we prove that the AI won’t interfere with a human pressing a kill button that shuts down the AI?
One simple suggestion from the paper is to require that the AI needs crypto coins to enable it to access compute. That means the default is for the AI to run out of compute unless the human(s) in control of generating the coins cooperates.
That’s a good start for dealing with AIs that approach human levels. Eventually we’ll want to add safety properties that deter AIs from harmful manipulation of humans.
The longer-term vision seems to be gradually building up stronger safety guarantees, until they’re more like Asimov’s 3 Laws of Robotics.
More Caveats
The strategy assumes we’ll develop a good set of safety properties that we’re demanding proof of. There’s likely some room for automated systems to figure out what safety humans want, and turn it into rigorous specifications. Tegmark and Omohundro prefer to have this step done by humans.
I’m willing to accept that for near-human stages of AI, but not too long after that we’ll needed something closer to an AI-generated approximation to CEV. I’ll classify this as being a fairly important problem, one that’s shared by most other approaches to AI safety.
Proofs about hardware seem different from proofs about software, since there are uncertainties about how well the theorem prover can observe the hardware. This seems like a relatively minor issue, but I’m uncomfortable with the way the paper claims mathematical certainty about the proofs.
The authors list sensor reliability as an open problem. They aim only for increasing that reliability, not full guarantees. So I’m getting mildly conflicting messages about how much certainty the paper is aiming for.
Section 8 (“The Only Path To Controllable AGI”) goes a bit too far:
That seems reasonable if the safety properties that we ask for are exactly the properties that we need, and if we use unlimited amounts of compute to search for proofs.
I consider it much more likely that we’ll ask for security properties that don’t exactly match our needs. We’ll either be overly cautious, and ask for safety properties that constrain the AI more than we absolutely need. In which case, an absence of proof doesn’t tell us whether a system is dangerous. Or we’ll fail at safety by not asking for as much safety as we need.
Note that they seem to expect fairly exhaustive searches for proofs, such that failure to find a proof ought to imply a genuine absence of safety. I’m unsure whether this is a reasonable expectation.
Conclusion
I see something like a 50% chance that this strategy would significantly reduce AI risk if the world coordinated to require it. However, it’s still rather hard to see how we can get such coordination.
I’m now feeling a wee bit more optimistic about decent regulation of AI development.
I’ve returned to thinking that it’s overkill to have a full-fledged ban on development of powerful AI.
Requiring insurance and strict liability no longer appear to create large risks of amounting to a total ban on development, so it’s looking more plausible that such regulations can be enforced.
If insurance companies need to see proofs of safety for advanced GPUs, that will in practice create a temporary pause in AI progress. I’m guessing that pause would be on the order of 1 to 10 years. But will it end due to successful proofs, or due to inability to enforce regulations? I don’t know.