Yeah idk. I think it’s plausible that proofs and products trade off against eachother. There’s a sense in which theoretical legitimacy is a signal that priorities are not in order, that it selects for staff that get excited by the wrong things, whereas deeply product-oriented teams are focused on what users actually care about. In an arms race, or even just scarce modes like layoffs/cuts, we should expect theory-driven teams to either restructure themselves against theory or just lose.

One approach to drawing a reference class: central banking predates macroeconomic theory by centuries, gambling predates probability theory by millenia, enigma was cracked without formal information theory (though the lag here is under a decade). However, this approach fails in at least one of it’s exceptions that I’m thinking about: Godel/church/turing can be said to have provided the “theory of the x86 instruction set” a few decades in advance. Then there are ambiguous cases: like archimedean theory of pressure was available to the engineers of the roman aqueduct, but it’s not clear 1. that the theory was a game changer for the engineers or they would have figured it out anyway, or 2. whether it matters that the theory isn’t the same as the matured theory after civilization continued to iterate.

Importantly I think is observations in the blockchain space: ethereum is not theory driven, it’s very move fast break things, it’s subject to hundreds of millions of dollars of hacks or security failures, its contracting language is intentionally trying to appeal to javascript developers, etc. If theory gets you any gains at all, you should expect that proof-driven blockchain ecosystems to beat ethereum, if nothing else than for the security benefits of theoretical legitimacy. But that happened: a splinter group from the ethereum team went off and started a formal verification -friendly blockchain ecosystem, and (this is not investing advice) it hasn’t gone super well. Basically ethereum says “the whitepaper markets the product” and cardano says “the product implements the whitepaper”, and this strategy has not led to cardano winning. (Note: this example could be washed out by first mover advantage, ethereum was the first mover above an important threshold of contract expressibility, so no reasonable amount of technical superiority may viably work).

So yeah, in spite of not being convinced by the “engineering predates theory” reference class, I still think that an aligned takeoff strategy that rests on deployments abiding soundly and completely to specifications that everyone agrees are right has a deeply slim chance of getting us out of this thing.

Strongly upvoted, this was a very valuable objection.

My theory of change for theorems sounds like Cardano from your example:

My theory of change is using adequate/comprehensive selection theoretic results to design training setups/environments that select for safety properties we want of our systems.

[Copied from a comment I made on Discord.]

Perhaps it’s a pessimistic bet, but decorrelated alignment bets are good. 😋

We do not necessarily need fully formally proven theorems, other forms of high confidence in safety could be sufficient. For example, I would be comfortable with turning on an AI that is safe iff the Goldbach conjecture is true, even if we have no proof of the Goldbach conjecture.

We currently don’t have any idea what kind of theorems we want to prove. Formulating the right theorem is likely more difficult than proving it.

Theorems can rely on imperfect assumptions (that are not exact as in the real world). In such a case, it is not clear that they give us the degree of high confidence that we would like to have.

Theorems that rely on imperfect assumptions could still be very valuable and increase overall safety, nonetheless. For example, if we could prove something like “this software is corrigible, assuming we are in a world run by Newtonian physics” then this could (depending on the details) be high evidence for the software being corrigible in a Quantum world.

We do not necessarily need fully formally proven theorems, other forms of high confidence in safety could be sufficient. For example, I would be comfortable with turning on an AI that is safe iff the Goldbach conjecture is true, even if we have no proof of the Goldbach conjecture.

Sounds like the kind of formal arguments I was alluding to in the post. Here you have a theorem of the form:
X⟹this system is existentially safe

Where X is a proposition. Theorems of the above form are most useful if we have significantly higher confidence in X than our prior confidence in the system’s existential safety.

We currently don’t have any idea what kind of theorems we want to prove. Formulating the right theorem is likely more difficult than proving it.

While I currently haven’t formulated any theorems I want to prove, my stumbling block to me appears to be insufficient mathematical maturity [I’ve set aside the project to study more first] and not intrinsic difficulty of the endeavour.

Theorems can rely on imperfect assumptions (that are not exact as in the real world). In such a case, it is not clear that they give us the degree of high confidence that we would like to have.

My philosophy is to set up minimal preconditions that describe the universe we live in.

Such preconditions will probably also describe other universes and may not give us very strong statements about our particular universe, but that’s fine. We can iterate on our theorems and build progressive theorems that tighten the preconditions of earlier theorems and give us stricter post conditions.

We could synthesise similar theorems, combine disparate theorems into a unified whole, generalise existing theorems, etc.

There are a lot of tools in the mathematician’s toolbox.

Theorems that rely on imperfect assumptions could still be very valuable and increase overall safety, nonetheless. For example, if we could prove something like “this software is corrigible, assuming we are in a world run by Newtonian physics” then this could (depending on the details) be high evidence for the software being corrigible in a Quantum world.

Ideally we would relax the preconditions of our theorems/seek the most general form of it. And we could interrogate if the generalised theorem is still applicable (in an interesting way) to the world we actually live in.

Going through your arguments and links, I find that there is one single desideratum for AI safety: steer the progress toward the iterative AI development process. Anything else is open-loop and is guaranteed to miss something crucially important. You can prove some no-go theorems, solve the Lobian obstacle, calculate some safety guarantees given certain assumptions, and then reality will still do something unexpected, as it always does.

Basically, my conclusion is the opposite of yours: given that in the open-loop AGI possible worlds our odds of affecting the outcome are pretty dire, while in the closed-loop worlds they are pretty good, the most promising course of action is to focus on shifting the probability of possible worlds from open-loop to closed-loop. Were I at MIRI, I’d “halt and catch fire” given the recent rather unexpected GPT progress and the wealth of non-lethal data it provided, and work on a strategy to enable closing the loop early and often. The current MIRI dogma I would burn is that “we must get it right the first time or else”, and replace it with “we must figure out a way to iterate, or else”. I guess that is what Paul Christiano has been arguing for, and the way AI research labs work.… so probably nothing new here.

The idea that AI Alignment can mostly be solved by iteration is perhaps a unstated crux amongst the AI safety community, since it bears on so many strategy questions.

I’m very interested in mechanistic Interpretability to provide a testing ground for:
* Selection Theorems
* Natural Abstractions
* Shard Theory
* Other theories about neural networks

Oh, I agree 100% that “the empirical data should serve to let us refine and enhance our theories, not to displace them”. That is how science works in general. My beef is with focusing mostly on theory because “we only have one shot”. My point is “if you think you only have one shot, figure out how to get more shots”.

I don’t think we only have one shot in the mainline (I expect slow takeoff). I think theory is especially valuable if we only have one (or a few) shots.

General comment: I think its common for people working on theorems & math to forget the ways that the real world can violate their mathematical assumptions, even in very unlikely ways. The following is a concrete story of one thing that may go wrong. It is unlikely this particular problem goes wrong when proving any proofs or doing any theorems. The point is to illustrate a general lesson about subtle ways the world doesn’t live up to your platonic expectations.

For example, suppose that I have a proof that my AGI plan is 100% safe. I have a provably correct enumeration of all corrigibility properties, and a completely formally specified training story for why my architecture and training process will result in an AGI with those corrigibility properties, using some fancy new differential equation math my team invented & proved. We’ve run the whole thing through an automated proof checker, and it returned ‘yes, this is indeed a corrigible AGI you get at the end of this process’. None of us know how to program though, so we haven’t actually seen whether our new fancy differential equation math gives correct predictions on ML stuff.

We run it, and then it kills us. How is this possible?

||It turns out, there was a saddle point midway through training, and instead of going off to the (generalized-)left part of the saddle point like our differential equations would have done, due to the fact that we have a non-infinitesimal step-size, our model went off to the (generalized-)right part of the saddle point. We can also imagine other sources of non-platonicity in our physical system, like how its operating using floating point numbers instead of rationals or reals, the existence of bitflip errors in computers, the possibility of EM interference, how we aren’t actually the ones training the network, but we hired out contractors to do it for us on the cloud, and all the ways they could mess with our assumptions (for instance, maybe they mix up our Very Strict Reward Schedule, forgetting to train in maze environments [which secretly train out deceptive behaviors during a particular developmental milestone we’ve located in which this is possible] during epochs 500-550, realize this mistake only on epoch 729, and so train in maze environments in epochs 730-780 [where our model has already learned deceptive behaviors, passing that developmental milestone, so this does nothing]).||

More generally, just because you have theorems showing that if so-and-so variables matter, in the real world other variables may dominate, and the actually winning theorems lie in showing what you need to do to those other variables in order to achieve your objectives.

Robust to adversarial optimization?

I generally think of this in the context of creating feedback loops for alignment work (and I’m sure this is one of John’s main reasons for focusing much on robustness to adversarial optimization). If you have found a vector in activation space you’ve discovered, and reason that it must be consistently encoding the truth/false value of your statements, then this is useful knowledge. If you then go to try to train your models to always output text which agrees with the direction of that vector, then you are both selecting for models which encode the truth/falseness of statements in not that vector, and for models that always output the truth. In less capable models, you may get the ‘model that always outputs the truth’ end of that coin flip. In more capable ones, there’s more options present for the kinds of ways it could encode truthiness, so its more likely you optimize away from your interests.

Edit: In general, I think of proofs as a sort of portable ground-level truth for intuitions and attempted generalizations of phenomena. If you are operating very very abstractly, it can be difficult to touch the ground, so instead you bound yourself to testing your hypotheses by formal verification from things you can in fact see from the ground. Similarly for attempted generalizations. You can’t see the future, but you can reason about it from things you do see in the present. The only danger is that both reasoning very very abstractly and attempting to generalize introduce the possibility that you forgot to pay attention to not obviously (or pretty obviously depending on how in the sky you are) variables which make your overall argument actually invalid in the real world.

General comment: I think its common for people working on theorems & math to forget the ways that the real world can violate their mathematical assumptions, even in very unlikely ways. The following is a concrete story of one thing that may go wrong. It is unlikely this particular problem goes wrong when proving any proofs or doing any theorems. The point is to illustrate a general lesson about subtle ways the world doesn’t live up to your platonic expectations.

I feel inclined to disagree here, as this doesn’t really match my philosophy/intuitions for constructing theorems. See my reply to @harfe:

My philosophy is to set up minimal preconditions that describe the universe we live in.

Such preconditions will probably also describe other universes and may not give us very strong statements about our particular universe, but that’s fine. We can iterate on our theorems and build progressive theorems that tighten the preconditions of earlier theorems and give us stricter post conditions.

If reality doesn’t conform to the preconditions of your theorem, then you chose overly restrictive preconditions. I think that’s just doing theorems badly, not necessarily an inherent flaw of theorems.

Choosing preconditions that describe reality isn’t that hard; e.g. we have multiple equivalently powerful models of computation that seem to accurately describe what is mechanically computable within our universe.

Even with simplifying/idealising assumptions algorithm designs, algorithm analysis and complexity theoretic results translate (modulo some minor adjustments) to real world computers.

For example, suppose that I have a proof that my AGI plan is 100% safe. I have a provably correct enumeration of all corrigibility properties, and a completely formally specified training story for why my architecture and training process will result in an AGI with those corrigibility properties, using some fancy new differential equation math my team invented & proved. We’ve run the whole thing through an automated proof checker, and it returned ‘yes, this is indeed a corrigible AGI you get at the end of this process’. None of us know how to program though, so we haven’t actually seen whether our new fancy differential equation math gives correct predictions on ML stuff.

Again, this violates my philosophy for proofs of safety guarantees. I want an iterative proof process. We start with minimal restrictive preconditions, see what results we can get, verify them empirically, after our meta level confidence in the theorems is strong, we can use it as foundations for further theorems, building more elaborate constructs that give us stronger guarantees:

We could synthesise similar theorems, combine disparate theorems into a unified whole, generalise existing theorems, etc.

There are a lot of tools in the mathematician’s toolbox.

Our first theorems should be simple/straightforward and easy to verify.

Building towers of abstraction in the sky without touching reality is another way to do maths wrong.

We’d both want empirical validation at many steps during the theorem process and those kind of theorems don’t really admit excessive detail as was present in your example.

If the theorems fail to apply, it’ll look like the preconditions not being met. An example of this is the coherence theorems. They assume (among others) the following preconditions:

Path independent preferences/no internal agent state influencing preferences

A total order over preferences

Static preferences that don’t change with time

They fail to apply to humans because human preferences are more complicated than this model.

Robust empirical verification that the preconditions of the theorems hold/obtain in practice would be necessary to construct theorems that aren’t disconnected from reality.

Yeah idk. I think it’s plausible that proofs and products trade off against eachother. There’s a sense in which theoretical legitimacy is a signal that priorities are not in order, that it selects for staff that get excited by the wrong things, whereas deeply product-oriented teams are focused on what users actually care about. In an arms race, or even just scarce modes like layoffs/cuts, we should expect theory-driven teams to either restructure themselves against theory or just lose.

One approach to drawing a reference class: central banking predates macroeconomic theory by centuries, gambling predates probability theory by millenia, enigma was cracked without formal information theory (though the lag here is under a decade). However, this approach fails in at least one of it’s exceptions that I’m thinking about: Godel/church/turing can be said to have provided the “theory of the x86 instruction set” a few decades in advance. Then there are ambiguous cases: like archimedean theory of pressure was available to the engineers of the roman aqueduct, but it’s not clear 1. that the theory was a game changer for the engineers or they would have figured it out anyway, or 2. whether it matters that the theory isn’t the same as the matured theory after civilization continued to iterate.

Importantly I think is observations in the blockchain space: ethereum is not theory driven, it’s very move fast break things, it’s subject to hundreds of millions of dollars of hacks or security failures, its contracting language is intentionally trying to appeal to javascript developers, etc. If theory gets you any gains at all, you should expect that proof-driven blockchain ecosystems to beat ethereum, if nothing else than for the security benefits of theoretical legitimacy. But that happened: a splinter group from the ethereum team went off and started a formal verification -friendly blockchain ecosystem, and (this is not investing advice) it hasn’t gone super well. Basically ethereum says “the whitepaper markets the product” and cardano says “the product implements the whitepaper”, and this strategy has not led to cardano winning. (Note: this example could be washed out by first mover advantage, ethereum was the first mover above an important threshold of contract expressibility, so no reasonable amount of technical superiority may viably work).

So yeah, in spite of not being convinced by the “engineering predates theory” reference class, I still think that an aligned takeoff strategy that rests on deployments abiding soundly and completely to specifications that everyone agrees are right has a deeply slim chance of getting us out of this thing.

Strongly upvoted, this was a very valuable objection.

My theory of change for theorems sounds like Cardano from your example:

[Copied from a comment I made on Discord.]

Perhaps it’s a pessimistic bet, but decorrelated alignment bets are good. 😋

Nice post! Here are some thoughts:

We do not necessarily need fully formally proven theorems, other forms of high confidence in safety could be sufficient. For example, I would be comfortable with turning on an AI that is safe iff the Goldbach conjecture is true, even if we have no proof of the Goldbach conjecture.

We currently don’t have any idea what kind of theorems we want to prove. Formulating the right theorem is likely more difficult than proving it.

Theorems can rely on imperfect assumptions (that are not exact as in the real world). In such a case, it is not clear that they give us the degree of high confidence that we would like to have.

Theorems that rely on imperfect assumptions could still be very valuable and increase overall safety, nonetheless. For example, if we could prove something like “this software is corrigible, assuming we are in a world run by Newtonian physics” then this could (depending on the details) be high evidence for the software being corrigible in a Quantum world.

Sounds like the kind of formal arguments I was alluding to in the post. Here you have a theorem of the form: X⟹this system is existentially safe

Where X is a proposition. Theorems of the above form are most useful if we have significantly higher confidence in X than our prior confidence in the system’s existential safety.

Strong disagree. I have a good idea on the kind of theorems I’ll be searching for over the next few years.

While I currently haven’t formulated any theorems I want to prove, my stumbling block to me appears to be insufficient mathematical maturity [I’ve set aside the project to study more first] and not intrinsic difficulty of the endeavour.

My philosophy is to set up minimal preconditions that describe the universe we live in.

Such preconditions will probably also describe other universes and may not give us very strong statements about our particular universe, but that’s fine. We can iterate on our theorems and build progressive theorems that tighten the preconditions of earlier theorems and give us stricter post conditions.

We could synthesise similar theorems, combine disparate theorems into a unified whole, generalise existing theorems, etc.

There are a lot of tools in the mathematician’s toolbox.

Ideally we would relax the preconditions of our theorems/seek the most general form of it. And we could interrogate if the generalised theorem is still applicable (in an interesting way) to the world we actually live in.

Going through your arguments and links, I find that there is one single desideratum for AI safety: steer the progress toward the iterative AI development process. Anything else is open-loop and is guaranteed to miss something crucially important. You can prove some no-go theorems, solve the Lobian obstacle, calculate some safety guarantees given certain assumptions, and then reality will still do something unexpected, as it always does.

Basically, my conclusion is the opposite of yours: given that in the open-loop AGI possible worlds our odds of affecting the outcome are pretty dire, while in the closed-loop worlds they are pretty good, the most promising course of action is to focus on shifting the probability of possible worlds from open-loop to closed-loop. Were I at MIRI, I’d “halt and catch fire” given the recent rather unexpected GPT progress and the wealth of non-lethal data it provided, and work on a strategy to enable closing the loop early and often. The current MIRI dogma I would burn is that “we must get it right the first time or else”, and replace it with “we must figure out a way to iterate, or else”. I guess that is what Paul Christiano has been arguing for, and the way AI research labs work.… so probably nothing new here.

The idea that AI Alignment can mostly be solved by iteration is perhaps a unstated crux amongst the AI safety community, since it bears on so many strategy questions.

I prefer theory rooted in solid empirical data

^{[1]}.I’m sympathetic to iterative cycles, but the empirical data should serve to let us refine and enhance our theories, not to displace them.

Empirical data does not exist in the absence of theory; observations only convey information after interpretation through particular theories.

The power of formal guarantees to:

Apply even as the system scales up

Generalise far out of distribution

Confer very high “all things considered” confidence

Transfer to derivative systems

Apply even under adversarial optimisation?

Remain desiderata that arguments for existential safety of powerful AI systems need to satisfy.

I’m very interested in mechanistic Interpretability to provide a testing ground for:

* Selection Theorems

* Natural Abstractions

* Shard Theory

* Other theories about neural networks

Oh, I agree 100% that “the empirical data should serve to let us refine and enhance our theories, not to displace them”. That is how science works in general. My beef is with focusing mostly on theory because “we only have one shot”. My point is “if you think you only have one shot, figure out how to get more shots”.

I don’t think we only have one shot in the mainline (I expect slow takeoff). I think theory is especially valuable if we only have one (or a few) shots.

I should edit the OP to make that clear.

General comment: I think its common for people working on theorems & math to forget the ways that the real world can violate their mathematical assumptions, even in very unlikely ways. The following is a concrete story of one thing that may go wrong. It is unlikely this particular problem goes wrong when proving any proofs or doing any theorems. The point is to illustrate a

generallesson about subtle ways the world doesn’t live up to your platonic expectations.For example, suppose that I have a proof that my AGI plan is 100% safe. I have a provably correct enumeration of all corrigibility properties, and a completely formally specified training story for why my architecture and training process will result in an AGI with those corrigibility properties, using some fancy new differential equation math my team invented & proved. We’ve run the whole thing through an automated proof checker, and it returned ‘yes, this is indeed a corrigible AGI you get at the end of this process’. None of us know how to program though, so we haven’t actually seen whether our new fancy differential equation math gives correct predictions on ML stuff.

We run it, and then it kills us. How is this possible?

||It turns out, there was a saddle point midway through training, and instead of going off to the (generalized-)left part of the saddle point like our differential equations would have done, due to the fact that we have a non-infinitesimal step-size, our model went off to the (generalized-)right part of the saddle point. We can also imagine other sources of non-platonicity in our physical system, like how its operating using floating point numbers instead of rationals or reals, the existence of bitflip errors in computers, the possibility of EM interference, how

wearen’t actually the ones training the network, but we hired out contractors to do it for us on the cloud, and all the ways they could mess with our assumptions (for instance, maybe they mix up our Very Strict Reward Schedule, forgetting to train in maze environments [which secretly train out deceptive behaviors during a particular developmental milestone we’ve located in which this is possible] during epochs 500-550, realize this mistake only on epoch 729, and so train in maze environments in epochs 730-780 [where our model has already learned deceptive behaviors, passing that developmental milestone, so this does nothing]).||More generally, just because you have theorems showing that if so-and-so variables matter, in the real world other variables may dominate, and the actually winning theorems lie in showing what you need to do to those other variables in order to achieve your objectives.

I generally think of this in the context of creating feedback loops for alignment work (and I’m sure this is one of John’s main reasons for focusing much on robustness to adversarial optimization). If you have found a vector in activation space you’ve discovered, and reason that it must be consistently encoding the truth/false value of your statements, then this is useful knowledge. If you then go to try to train your models to always output text which agrees with the direction of that vector, then you are both selecting for models which encode the truth/falseness of statements in not that vector, and for models that always output the truth. In less capable models, you may get the ‘model that always outputs the truth’ end of that coin flip. In more capable ones, there’s more options present for the kinds of ways it could encode truthiness, so its more likely you optimize away from your interests.

Edit: In general, I think of proofs as a sort of portable ground-level truth for intuitions and attempted generalizations of phenomena. If you are operating very very abstractly, it can be difficult to touch the ground, so instead you bound yourself to testing your hypotheses by formal verification from things you can in fact see from the ground. Similarly for attempted generalizations. You can’t see the future, but you can reason about it from things you do see in the present. The only danger is that both reasoning very very abstractly and attempting to generalize introduce the possibility that you forgot to pay attention to not obviously (or pretty obviously depending on how in the sky you are) variables which make your overall argument actually invalid in the real world.

I feel inclined to disagree here, as this doesn’t really match my philosophy/intuitions for constructing theorems. See my reply to @harfe:

If reality doesn’t conform to the preconditions of your theorem, then you chose overly restrictive preconditions. I think that’s just doing theorems badly, not necessarily an inherent flaw of theorems.

Choosing preconditions that describe reality isn’t that hard; e.g. we have multiple equivalently powerful models of computation that seem to accurately describe what is mechanically computable within our universe.

Even with simplifying/idealising assumptions algorithm designs, algorithm analysis and complexity theoretic results translate (modulo some minor adjustments) to real world computers.

Again, this violates my philosophy for proofs of safety guarantees. I want an iterative proof process. We start with minimal restrictive preconditions, see what results we can get, verify them empirically, after our meta level confidence in the theorems is strong, we can use it as foundations for further theorems, building more elaborate constructs that give us stronger guarantees:

Our first theorems should be simple/straightforward and easy to verify.

Building towers of abstraction in the sky without touching reality is another way to do maths wrong.

To be clear, the theorems I vibe with the most are a generalisation of the selection theorems agenda.

We’d both want empirical validation at many steps during the theorem process and those kind of theorems don’t really admit excessive detail as was present in your example.

If the theorems fail to apply, it’ll look like the preconditions not being met. An example of this is the coherence theorems. They assume (among others) the following preconditions:

Path independent preferences/no internal agent state influencing preferences

A total order over preferences

Static preferences that don’t change with time

They fail to apply to humans because human preferences are more complicated than this model.

Robust empirical verification that the preconditions of the theorems hold/obtain in practice would be necessary to construct theorems that aren’t disconnected from reality.