Ok, I think I’m following you (though I am tired, so who knows :)).
For me the crux seems to be: We can’t assume in general that pol(P) isn’t terrible at optimising for P. We can “do our best” and still screw up catastrophically.
If assuming “pol(P) is always a good optimiser for P” were actually realistic (and I assume you’re not!), then we wouldn’t have an alignment problem: we’d be assuming away any possibility of making a catastrophic error.
If we just assume “pol(P) is always a good optimiser for P” for the purpose of non-obstruction definitions/calculations, then our AI can adopt policies of the following form:
Take actions with the following consequences:
If (humans act according to a policy that optimises well for P) then humans are empowered on P Otherwise, consequences can be arbitrarily bad
Once the AI’s bar on the quality of pol is high, we have no AU guarantees at all if we fail to meet that bar. This seems like an untenable approach to me, so I’m not assuming that pol is reliably/uniformly good at optimising.
So e.g. in my diagram, I’m assuming that for every P in S, humans screw up and accidentally create the 80s optimiser (let’s say the 80s optimiser was released prematurely through an error). That may be unlikely: the more reasonable proposition would be that this happens for some subset T of S larger than simply P = 80s utopia. If for all P in T, pol(P) gets you 80s utopia, that will look like a spike on T peaking at P = 80s utopia.
The maximum of this spike may only be achievable by optimising early for 80s utopia (before some period of long reflection that allows us to optimise well across T).
However, once this spike is present for P = 80s utopia, our AI is required by non-obstruction to match that maximum for P = 80s utopia. If it’s still possible that we do want 80s utopia when the premature optimisation would start under pol(P) for P in T, the AI is required to support that optimisation—even if the consequences across the rest of T are needlessly suboptimal (relative to what would be possible for the AI; clearly they still improve on pol, because pol wasn’t good).
To assume that my claim (2) doesn’t hold is to assume that there’s no subset T of S where this kind of thing happens by default. That seems unlikely to me—unless we’re in a world where the alignment problem gets solved very well without non-obstruction. For instance, this can happen if you have a payoff function on T which accidentally misses out some component that’s valuable-but-not-vital over almost all of T, but zero for one member. You may optimise hard for the zero member, sacrificing the component you missed out, and later realise that you actually wanted the non-zero version.
Personally, I’d guess that this kind of thing would happen over many such subsets, so you’d have a green line with a load of spikes, each negatively impacting a very small part of the line as a trade-off to achieve the high spike.
To take your P vs -P example, the “give money then shut off” only reliably works if we assume pol(P) and pol(-P) are sufficiently good optimisers. (though probably the bar isn’t high here)
To take a trivial (but possible) example of its going wrong, imagine that pol(-P) involves using software with some hidden absolute value call that inadvertently converts -P optimisation into P optimisation.
Now giving the money doesn’t work, since it makes things worse for V(-p)pol(-P). The AI can shut off without doing anything, but it can’t necessarily do the helpful thing: saying “Hang on a bit and delay optimisation: you need to fix this absolute value bug”, unless that delay doesn’t cost anything for P optimisation. This case is probably ok either with a generous epsilon, or the assumption that the AI has the capacity to help either optimisation similarly. But in general there’ll be problems of similar form which aren’t so simply resolved.
Here I don’t like the constraint not to sacrifice a small amount of pol(P) value for a huge amount of pol(Q) value.
Hopefully that’s clear. Perhaps I’m still missing something, but I don’t see how assuming pol makes no big mistakes gets you very far (the AI is then free to optimise to put us on a knife-edge between chasms, and ‘blame’ us for falling). Once you allow pol to be a potentially catastrophically bad optimiser for some subset of S, I think you get the problems I outline in the post. I don’t think strategy-stealing is much of an issue where pol can screw up badly.
That’s the best I can outline my current thinking. If I’m still not seeing things clearly, I’ll have to rethink/regroup/sleep, since my brain is starting to struggle.
Once the AI’s bar on the quality of pol is high, we have no AU guarantees at all if we fail to meet that bar. This seems like an untenable approach to me
Er—non-obstruction is a conceptual frame for understanding the benefits we want from corrigibility. It is not a constraint under which the AI finds a high-scoring policy. It is not an approach to solving the alignment problem any more than Kepler’s laws are an approach for going to the moon.
Generally, broad non-obstruction seems to be at least as good as literal corrigibility. In my mind, the point of corrigibility is that we become more able to wield and amplify our influence through the AI. If pol(P) sucks, even if the AI is literally corrigible, we still won’t reach good outcomes. I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility in the real world, where pol is pretty reasonable for the relevant goals.
the “give money then shut off” only reliably works if we assume pol(P) and pol(-P) are sufficiently good optimisers.
I agree it’s possible for pol to shoot itself in the foot, but I was trying to give an example situation. I was not claiming that for every possible pol, giving money is non-obstructive against P and -P. I feel like that misses the point, and I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility.
The point of all this analysis is to think about why we want corrigibility in the real world, and whether there’s a generalized version of that desideratum. To remark that there exists an AI policy/pol pair which induces narrow non-obstruction, or which doesn’t empower pol a whole lot, or which makes silly tradeoffs… I guess I just don’t see the relevance of that for thinking about the alignment properties of a given AI system in the real world.
Thinking of corrigibility, it’s not clear to me that non-obstruction is quite what I want. Perhaps a closer version would be something like: A non-obstructive AI on S needs to do no worse for each P in S than pol(P | off & humans have all the AI’s knowledge)
This feels a bit patchy, but in principle it’d fix the most common/obvious issue of the kind I’m raising: that the AI would often otherwise have an incentive to hide information from the users so as to avoid ‘obstructing’ them when they change their minds.
I think this is more in the spirit of non-obstruction, since it compares the AI’s actions to a fully informed human baseline (I’m not claiming it’s precise, but in the direction that makes sense to me). Perhaps the extra information does smooth out any undesirable spikes the AI might anticipate.
I do otherwise expect such issues to be common. But perhaps it’s usually about the AI knowing more than the humans.
I may well be wrong about any/all of this, but (unless I’m confused), it’s not a quibble about edge cases. If I’m wrong about default spikiness, then it’s much more of an edge case.
(You’re right about my P, -P example missing your main point; I just meant it as an example, not as a response to the point you were making with it; I should have realised that would make my overall point less clear, given that interpreting it as a direct response was natural; apologies if that seemed less-than-constructive: not my intent)
If pol(P) sucks, even if the AI is literally corrigible, we still won’t reach good outcomes.
If pol(P) sucks by default, a general AI (corrigible or otherwise) may be able to give us information I which: Makes Vp(pol(P)) much higher, by making pol(P) given I suck a whole lot less. Makes Vq(pol(Q)) a little lower, by making pol(Q) given I make concessions to allow pol(P) to perform better.
A non-obstructive AI can’t do that, since it’s required to maintain the AU for pol(Q).
A simple example is where P and Q currently look the same to us—so our pol(P) and pol(Q) have the same outcome [ETA for a long time at least, with potentially permanent AU consequences], which happens to be great for Vq(pol(Q)), but not so great for Vp(pol(P)).
In this situation, we want an AI that can tell us: ”You may actually want either P or Q here. Here’s an optimisation that works 99% as well for Q, and much better than your current approach for P. Since you don’t currently know which you want, this is much better than your current optimisation for Q: that only does 40% as well for P.”
A non-obstructive AI cannot give us that information if it predicts it would lower Vq(pol(Q)) in so doing—which it probably would.
Does non-obstruction rule out lowering Vq(pol(Q)) in this way? If not, I’ve misunderstood you somewhere. If so, that’s a problem.
I’m not sure I understand the distinction you’re making between a “conceptual frame”, and a “constraint under which...”.
[Non-obstruction with respect to a set S] must be a constraint of some kind. I’m simply saying that there are cases where it seems to rule out desirable behaviour—e.g. giving us information that allows us to trade a small potential AU penalty for a large potential AU gain, when we’re currently uncertain over which is our true payoff function.
Anyway, my brain is now dead. So I doubt I’ll be saying much intelligible before tomorrow (if the preceding even qualifies :)).
Ok, I think I’m following you (though I am tired, so who knows :)).
For me the crux seems to be:
We can’t assume in general that pol(P) isn’t terrible at optimising for P. We can “do our best” and still screw up catastrophically.
If assuming “pol(P) is always a good optimiser for P” were actually realistic (and I assume you’re not!), then we wouldn’t have an alignment problem: we’d be assuming away any possibility of making a catastrophic error.
If we just assume “pol(P) is always a good optimiser for P” for the purpose of non-obstruction definitions/calculations, then our AI can adopt policies of the following form:
Once the AI’s bar on the quality of pol is high, we have no AU guarantees at all if we fail to meet that bar.
This seems like an untenable approach to me, so I’m not assuming that pol is reliably/uniformly good at optimising.
So e.g. in my diagram, I’m assuming that for every P in S, humans screw up and accidentally create the 80s optimiser (let’s say the 80s optimiser was released prematurely through an error).
That may be unlikely: the more reasonable proposition would be that this happens for some subset T of S larger than simply P = 80s utopia.
If for all P in T, pol(P) gets you 80s utopia, that will look like a spike on T peaking at P = 80s utopia.
The maximum of this spike may only be achievable by optimising early for 80s utopia (before some period of long reflection that allows us to optimise well across T).
However, once this spike is present for P = 80s utopia, our AI is required by non-obstruction to match that maximum for P = 80s utopia. If it’s still possible that we do want 80s utopia when the premature optimisation would start under pol(P) for P in T, the AI is required to support that optimisation—even if the consequences across the rest of T are needlessly suboptimal (relative to what would be possible for the AI; clearly they still improve on pol, because pol wasn’t good).
To assume that my claim (2) doesn’t hold is to assume that there’s no subset T of S where this kind of thing happens by default. That seems unlikely to me—unless we’re in a world where the alignment problem gets solved very well without non-obstruction.
For instance, this can happen if you have a payoff function on T which accidentally misses out some component that’s valuable-but-not-vital over almost all of T, but zero for one member. You may optimise hard for the zero member, sacrificing the component you missed out, and later realise that you actually wanted the non-zero version.
Personally, I’d guess that this kind of thing would happen over many such subsets, so you’d have a green line with a load of spikes, each negatively impacting a very small part of the line as a trade-off to achieve the high spike.
To take your P vs -P example, the “give money then shut off” only reliably works if we assume pol(P) and pol(-P) are sufficiently good optimisers. (though probably the bar isn’t high here)
To take a trivial (but possible) example of its going wrong, imagine that pol(-P) involves using software with some hidden absolute value call that inadvertently converts -P optimisation into P optimisation.
Now giving the money doesn’t work, since it makes things worse for V(-p)pol(-P).
The AI can shut off without doing anything, but it can’t necessarily do the helpful thing: saying “Hang on a bit and delay optimisation: you need to fix this absolute value bug”, unless that delay doesn’t cost anything for P optimisation.
This case is probably ok either with a generous epsilon, or the assumption that the AI has the capacity to help either optimisation similarly. But in general there’ll be problems of similar form which aren’t so simply resolved.
Here I don’t like the constraint not to sacrifice a small amount of pol(P) value for a huge amount of pol(Q) value.
Hopefully that’s clear. Perhaps I’m still missing something, but I don’t see how assuming pol makes no big mistakes gets you very far (the AI is then free to optimise to put us on a knife-edge between chasms, and ‘blame’ us for falling). Once you allow pol to be a potentially catastrophically bad optimiser for some subset of S, I think you get the problems I outline in the post. I don’t think strategy-stealing is much of an issue where pol can screw up badly.
That’s the best I can outline my current thinking.
If I’m still not seeing things clearly, I’ll have to rethink/regroup/sleep, since my brain is starting to struggle.
Er—non-obstruction is a conceptual frame for understanding the benefits we want from corrigibility. It is not a constraint under which the AI finds a high-scoring policy. It is not an approach to solving the alignment problem any more than Kepler’s laws are an approach for going to the moon.
Generally, broad non-obstruction seems to be at least as good as literal corrigibility. In my mind, the point of corrigibility is that we become more able to wield and amplify our influence through the AI. If pol(P) sucks, even if the AI is literally corrigible, we still won’t reach good outcomes. I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility in the real world, where pol is pretty reasonable for the relevant goals.
I agree it’s possible for pol to shoot itself in the foot, but I was trying to give an example situation. I was not claiming that for every possible pol, giving money is non-obstructive against P and -P. I feel like that misses the point, and I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility.
The point of all this analysis is to think about why we want corrigibility in the real world, and whether there’s a generalized version of that desideratum. To remark that there exists an AI policy/pol pair which induces narrow non-obstruction, or which doesn’t empower pol a whole lot, or which makes silly tradeoffs… I guess I just don’t see the relevance of that for thinking about the alignment properties of a given AI system in the real world.
Thinking of corrigibility, it’s not clear to me that non-obstruction is quite what I want.
Perhaps a closer version would be something like:
A non-obstructive AI on S needs to do no worse for each P in S than pol(P | off & humans have all the AI’s knowledge)
This feels a bit patchy, but in principle it’d fix the most common/obvious issue of the kind I’m raising: that the AI would often otherwise have an incentive to hide information from the users so as to avoid ‘obstructing’ them when they change their minds.
I think this is more in the spirit of non-obstruction, since it compares the AI’s actions to a fully informed human baseline (I’m not claiming it’s precise, but in the direction that makes sense to me). Perhaps the extra information does smooth out any undesirable spikes the AI might anticipate.
I do otherwise expect such issues to be common.
But perhaps it’s usually about the AI knowing more than the humans.
I may well be wrong about any/all of this, but (unless I’m confused), it’s not a quibble about edge cases.
If I’m wrong about default spikiness, then it’s much more of an edge case.
(You’re right about my P, -P example missing your main point; I just meant it as an example, not as a response to the point you were making with it; I should have realised that would make my overall point less clear, given that interpreting it as a direct response was natural; apologies if that seemed less-than-constructive: not my intent)
If pol(P) sucks by default, a general AI (corrigible or otherwise) may be able to give us information I which:
Makes Vp(pol(P)) much higher, by making pol(P) given I suck a whole lot less.
Makes Vq(pol(Q)) a little lower, by making pol(Q) given I make concessions to allow pol(P) to perform better.
A non-obstructive AI can’t do that, since it’s required to maintain the AU for pol(Q).
A simple example is where P and Q currently look the same to us—so our pol(P) and pol(Q) have the same outcome [ETA for a long time at least, with potentially permanent AU consequences], which happens to be great for Vq(pol(Q)), but not so great for Vp(pol(P)).
In this situation, we want an AI that can tell us:
”You may actually want either P or Q here. Here’s an optimisation that works 99% as well for Q, and much better than your current approach for P. Since you don’t currently know which you want, this is much better than your current optimisation for Q: that only does 40% as well for P.”
A non-obstructive AI cannot give us that information if it predicts it would lower Vq(pol(Q)) in so doing—which it probably would.
Does non-obstruction rule out lowering Vq(pol(Q)) in this way?
If not, I’ve misunderstood you somewhere.
If so, that’s a problem.
I’m not sure I understand the distinction you’re making between a “conceptual frame”, and a “constraint under which...”.
[Non-obstruction with respect to a set S] must be a constraint of some kind.
I’m simply saying that there are cases where it seems to rule out desirable behaviour—e.g. giving us information that allows us to trade a small potential AU penalty for a large potential AU gain, when we’re currently uncertain over which is our true payoff function.
Anyway, my brain is now dead. So I doubt I’ll be saying much intelligible before tomorrow (if the preceding even qualifies :)).