But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system?
You fundamentally cannot, so it’s a moot point. There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.
I do not follow your seeming dismissal of this. You acknowledge it, and then… assert it’s not a problem?
An analogy: solving the Halting problem is impossible[3]. It is sometimes useful to handwave a Halting oracle as a component of proofs regardless—but at the end of the day saying ‘solving the Halting problem is easy, just use a Halting oracle’ is not a solution.
Many people have an intuition like “everything is an imperfect proxy; we can never avoid Goodhart”. The point of the mutual information example is that this is basically wrong.
“Many people have an intuition like “everything is an imperfect halting-problem solver; we can never avoid Turing”. The point of the Halting oracle example is that this is basically wrong.”
In particular, it requires calculating the distributions to infinite accuracy, which in turn requires an infinite sample. (Consider if I have two independent perfectly fair coins. I flip each of them 3x and get HHT/HHT. My mutual information is non-zero!)
Except potentially if there’s an event horizon, although even that’s an open question, and in that case it’s a moot point because an AI in an event horizon is indistinguishable from no AI.
There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.
Thanks for bringing this up; it raises to a technical point which didn’t make sense to include in the post but which I was hoping someone would raise in the comments.
The key point: Goodhart problems are about generalization, not approximation.
Suppose I have a proxy u′ for a true utility function u, and u′ is always within ϵ of u (i.e. |u′−u|<ϵ). I maximize u′. Then the true utility u achieved will be within 2ϵ of the maximum achievable utility. Reasoning: in the worst case, u′ is ϵ lower than u at the u-maximizing point, and ϵ higher than u at the u′-maximizing point.
Point is: if a proxy is close to the true utility function everywhere, then we will indeed achieve close-to-maximal utility upon maximizing the proxy. Goodhart problems require the proxy to not even be approximately close, in at least some places.
When we look at real-world Goodhart problems, they indeed involve situations where some approximation only works well within some region, and ceases to even be a good approximation once we move well outside that region. That’s a generalization problem, not an approximation problem.
So approximations are fine, so long as they generalize well.
This is an interesting observation; I don’t see how it addresses my point.
There is no exact solution to mutual information from two finite samples. There isno ϵ-approximation of mutual information from two finite samples, either.
=====
On the topic of said observation: beware that ϵ-approximations of many things are proven difficult to compute, and in some cases even are uncomputable. (The classic being Chaitin’s Constant[1].)
In particular, you very often end up with Halting-problem style contradictions when computing properties of systems capable of unbounded computation, and even approximations thereof.
Unfortunately, ‘value function of a powerful AI’ tends to fall into that category[2].
The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.
Observing the input and output of a blackbox X, and checking a property thereof.
Whitebox knowledge of X, and checking a property thereof.
In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox[1]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that’s about the best we can say[2].
And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable[3], even though checking if two specific Turing-complete whiteboxes may be decidable.
Trivially, due to indistinguishability issues. For any finite sequence of inputs and outputs, there are multiple machines X and X’ which produce that sequence of outputs given the input, but which have later output that diverges. This is not a problem in the whitebox case because said machines are distinguishable.
You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So “zero” is best understood as “zero within some tolerance”. So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently.
This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable tolerances, even if we do not yet know how to do such a thing in the face of the immense optimization pressure which superhuman AGI would bring to bear on a problem.
We need plans to have achievable tolerances. For example, we need to assume a realistic amount of hardware failure. We can’t treat the hardware as blackboxes; we know how it operates, and we have to make use of that knowledge. But we can’t pretend perfect mathematical knowledge of it, either; we have error tolerances.
So your blackbox/whitebox dichotomy doesn’t fit the situation very well.
But do you really buy the whole analogy with mutual information, IE buy the claim that we can judge the viability of escaping goodhart from this one example, and only object that the judgement with respect to this example was incorrect?
Perhaps we should really look at a range of examples, not just one? And judge John’s point as reasonable if and only if we can find some cases where effectively perfect proxies were found?
Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed. So John’s plan sounds doomed to failure, because it relies on finding an actually-perfect proxy, when all realistic proxies are imprecise at least in their physical tolerances.
In which case, I would reply that the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course. So there is hope that we will not end up in a situation where every tiny flaw is exploited. What we are looking for is plans which robustly get us to that point.
Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed.
My objection is actually mostly to the example itself.
As you mention:
the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course.
Compare with the example:
Suppose we’re designing some secure electronic equipment, and we’re concerned about the system leaking information to adversaries via a radio side-channel.
[...]
But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.
This is analogous to the case of… trying to contain a malign AI which is already not on our side.
Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by “guessing true names”. I think the approach makes sense, but my argument for why this is the case does differ from John’s arguments here.
I think TLW’s criticism is important, and I don’t think your responses are sufficient. I also think the original example is confusing; I’ve met several people who, after reading OP, seemed to me confused about how engineers could use the concept of mutual information.
Here is my attempt to expand your argument.
We’re trying to design some secure electronic equipment. We want the internal state and some of the outputs to be secret. Maybe we want all of the outputs to be secret, but we’ve given up on that (for example, radio shielding might not be practical or reliable enough). When we’re trying to design things so that the internal state and outputs are secret, there are a couple of sources of failure.
One source of failure is failing to model the interactions between the components of our systems. Maybe there is an output we don’t know about (like the vibrations the electronics make while operating), or maybe there is an interaction we’re not aware of (like magnetic coupling between two components we’re treating as independent).
Another source of failure is that we failed to consider all the ways that an adversary could exploit the interactions we do know about. In your example, we fail to consider how an adversary could exploit higher-order correlations between emitted radio waves and the state of the electronic internals.
A true name, in principle, allows us to avoid the second kind of failure. In high-dimensional state spaces, we might need to get kind of clever to prove the lack of mutual information. But it’s a fairly delimited analytic problem, and we at least know what a good answer would look like.
The true name could also guide our investigations into our system, to help us avoid the first kind of failure. “Huh, we just made the adder have a more complicated behaviour as an optimisation. Could the unnevenness of that optimisation over the input distribution leak information about the adder’s inputs to another part of the system?”
Now, reader, you might worry that the chosen example of a True Name leaves an implementation gap wide enough for a human adversary to drive an exploit through. And I think that’s a pretty good complaint. The best defence I can muster is that it guides and organises the defender’s thinking. You get to do proofs-given-assumptions, and you get more clarity about how to think if your assumptions are wrong.
To the extent that the idea is that True Names are part of a strategy to come up with approaches that are unbounded-optimisation-proof, I think that defence doesn’t work and the strategy is kind of sunk.
On the other hand, here is an argument that I can plause. In the end, we’ve got to make some argument that when we flick some switch or continue down some road, things will be OK. And there’s a big messy space of considerations to navigate to that end. True Names are necessary to have any hope of compressing the domain enough that you can make arguments that stand up.
I think that’s basically right, and good job explaining it clearly and compactly.
I would also highlight that it’s not just about adversaries. One the main powers of proof-given-assumptions is that it allows to rule out large classes of unknown unknowns in one go. And, insofar as the things-proven-given-assumptions turn out to be false, it allows to detect previously-unknown unknowns.
The fact that the mutual information cannot be zero is a good and interesting point. But, as I understand it, this is not fundamentally a barrier to it being a good “true name”. Its the right target, the impossibility of hitting it doesn’t change that.
You fundamentally cannot, so it’s a moot point. There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.
I do not follow your seeming dismissal of this. You acknowledge it, and then… assert it’s not a problem?
An analogy: solving the Halting problem is impossible[3]. It is sometimes useful to handwave a Halting oracle as a component of proofs regardless—but at the end of the day saying ‘solving the Halting problem is easy, just use a Halting oracle’ is not a solution.
“Many people have an intuition like “everything is an imperfect halting-problem solver; we can never avoid Turing”. The point of the Halting oracle example is that this is basically wrong.”
Hopefully this illustrates my point.
In particular, it requires calculating the distributions to infinite accuracy, which in turn requires an infinite sample. (Consider if I have two independent perfectly fair coins. I flip each of them 3x and get HHT/HHT. My mutual information is non-zero!)
For a sufficient example: gravity causes any[4] two things in the universe[5] to correlate[6].
At least assuming the Church-Turing hypothesis is correct.
Except potentially if there’s an event horizon, although even that’s an open question, and in that case it’s a moot point because an AI in an event horizon is indistinguishable from no AI.
Strictly speaking, within each others lightcone.
And as soon as you have anything causing a correlation, the probability that other factors exactly cancel said correlation is zero.
Thanks for bringing this up; it raises to a technical point which didn’t make sense to include in the post but which I was hoping someone would raise in the comments.
The key point: Goodhart problems are about generalization, not approximation.
Suppose I have a proxy u′ for a true utility function u, and u′ is always within ϵ of u (i.e. |u′−u|<ϵ). I maximize u′. Then the true utility u achieved will be within 2ϵ of the maximum achievable utility. Reasoning: in the worst case, u′ is ϵ lower than u at the u-maximizing point, and ϵ higher than u at the u′-maximizing point.
Point is: if a proxy is close to the true utility function everywhere, then we will indeed achieve close-to-maximal utility upon maximizing the proxy. Goodhart problems require the proxy to not even be approximately close, in at least some places.
When we look at real-world Goodhart problems, they indeed involve situations where some approximation only works well within some region, and ceases to even be a good approximation once we move well outside that region. That’s a generalization problem, not an approximation problem.
So approximations are fine, so long as they generalize well.
This is an interesting observation; I don’t see how it addresses my point.
There is no exact solution to mutual information from two finite samples. There is no ϵ-approximation of mutual information from two finite samples, either.
=====
On the topic of said observation: beware that ϵ-approximations of many things are proven difficult to compute, and in some cases even are uncomputable. (The classic being Chaitin’s Constant[1].)
In particular, you very often end up with Halting-problem style contradictions when computing properties of systems capable of unbounded computation, and even approximations thereof.
Unfortunately, ‘value function of a powerful AI’ tends to fall into that category[2].
Which isn’t “a” constant, but that’s another matter.
Well, as closely as anything in the physical world does, anyway.
The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.
Let us make a distinction here between two cases:
Observing the input and output of a blackbox X, and checking a property thereof.
Whitebox knowledge of X, and checking a property thereof.
In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox[1]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that’s about the best we can say[2].
And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable[3], even though checking if two specific Turing-complete whiteboxes may be decidable.
It is not exactly the same way, due to the above.
Namely, ‘the laws of physics’
(And worse, often doesn’t exactly match in the observations thus far, or results in contradictions.)
Trivially, due to indistinguishability issues. For any finite sequence of inputs and outputs, there are multiple machines X and X’ which produce that sequence of outputs given the input, but which have later output that diverges. This is not a problem in the whitebox case because said machines are distinguishable.
You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So “zero” is best understood as “zero within some tolerance”. So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently.
This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable tolerances, even if we do not yet know how to do such a thing in the face of the immense optimization pressure which superhuman AGI would bring to bear on a problem.
We need plans to have achievable tolerances. For example, we need to assume a realistic amount of hardware failure. We can’t treat the hardware as blackboxes; we know how it operates, and we have to make use of that knowledge. But we can’t pretend perfect mathematical knowledge of it, either; we have error tolerances.
So your blackbox/whitebox dichotomy doesn’t fit the situation very well.
But do you really buy the whole analogy with mutual information, IE buy the claim that we can judge the viability of escaping goodhart from this one example, and only object that the judgement with respect to this example was incorrect?
Perhaps we should really look at a range of examples, not just one? And judge John’s point as reasonable if and only if we can find some cases where effectively perfect proxies were found?
Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed. So John’s plan sounds doomed to failure, because it relies on finding an actually-perfect proxy, when all realistic proxies are imprecise at least in their physical tolerances.
In which case, I would reply that the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course. So there is hope that we will not end up in a situation where every tiny flaw is exploited. What we are looking for is plans which robustly get us to that point.
My objection is actually mostly to the example itself.
As you mention:
Compare with the example:
This is analogous to the case of… trying to contain a malign AI which is already not on our side.
Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by “guessing true names”. I think the approach makes sense, but my argument for why this is the case does differ from John’s arguments here.
I think TLW’s criticism is important, and I don’t think your responses are sufficient. I also think the original example is confusing; I’ve met several people who, after reading OP, seemed to me confused about how engineers could use the concept of mutual information.
Here is my attempt to expand your argument.
We’re trying to design some secure electronic equipment. We want the internal state and some of the outputs to be secret. Maybe we want all of the outputs to be secret, but we’ve given up on that (for example, radio shielding might not be practical or reliable enough). When we’re trying to design things so that the internal state and outputs are secret, there are a couple of sources of failure.
One source of failure is failing to model the interactions between the components of our systems. Maybe there is an output we don’t know about (like the vibrations the electronics make while operating), or maybe there is an interaction we’re not aware of (like magnetic coupling between two components we’re treating as independent).
Another source of failure is that we failed to consider all the ways that an adversary could exploit the interactions we do know about. In your example, we fail to consider how an adversary could exploit higher-order correlations between emitted radio waves and the state of the electronic internals.
A true name, in principle, allows us to avoid the second kind of failure. In high-dimensional state spaces, we might need to get kind of clever to prove the lack of mutual information. But it’s a fairly delimited analytic problem, and we at least know what a good answer would look like.
The true name could also guide our investigations into our system, to help us avoid the first kind of failure. “Huh, we just made the adder have a more complicated behaviour as an optimisation. Could the unnevenness of that optimisation over the input distribution leak information about the adder’s inputs to another part of the system?”
Now, reader, you might worry that the chosen example of a True Name leaves an implementation gap wide enough for a human adversary to drive an exploit through. And I think that’s a pretty good complaint. The best defence I can muster is that it guides and organises the defender’s thinking. You get to do proofs-given-assumptions, and you get more clarity about how to think if your assumptions are wrong.
To the extent that the idea is that True Names are part of a strategy to come up with approaches that are unbounded-optimisation-proof, I think that defence doesn’t work and the strategy is kind of sunk.
On the other hand, here is an argument that I can plause. In the end, we’ve got to make some argument that when we flick some switch or continue down some road, things will be OK. And there’s a big messy space of considerations to navigate to that end. True Names are necessary to have any hope of compressing the domain enough that you can make arguments that stand up.
I think that’s basically right, and good job explaining it clearly and compactly.
I would also highlight that it’s not just about adversaries. One the main powers of proof-given-assumptions is that it allows to rule out large classes of unknown unknowns in one go. And, insofar as the things-proven-given-assumptions turn out to be false, it allows to detect previously-unknown unknowns.
The fact that the mutual information cannot be zero is a good and interesting point. But, as I understand it, this is not fundamentally a barrier to it being a good “true name”. Its the right target, the impossibility of hitting it doesn’t change that.
This is the part I was disagreeing with, to be clear.