Once the AI’s bar on the quality of pol is high, we have no AU guarantees at all if we fail to meet that bar. This seems like an untenable approach to me
Er—non-obstruction is a conceptual frame for understanding the benefits we want from corrigibility. It is not a constraint under which the AI finds a high-scoring policy. It is not an approach to solving the alignment problem any more than Kepler’s laws are an approach for going to the moon.
Generally, broad non-obstruction seems to be at least as good as literal corrigibility. In my mind, the point of corrigibility is that we become more able to wield and amplify our influence through the AI. If pol(P) sucks, even if the AI is literally corrigible, we still won’t reach good outcomes. I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility in the real world, where pol is pretty reasonable for the relevant goals.
the “give money then shut off” only reliably works if we assume pol(P) and pol(-P) are sufficiently good optimisers.
I agree it’s possible for pol to shoot itself in the foot, but I was trying to give an example situation. I was not claiming that for every possible pol, giving money is non-obstructive against P and -P. I feel like that misses the point, and I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility.
The point of all this analysis is to think about why we want corrigibility in the real world, and whether there’s a generalized version of that desideratum. To remark that there exists an AI policy/pol pair which induces narrow non-obstruction, or which doesn’t empower pol a whole lot, or which makes silly tradeoffs… I guess I just don’t see the relevance of that for thinking about the alignment properties of a given AI system in the real world.
Thinking of corrigibility, it’s not clear to me that non-obstruction is quite what I want. Perhaps a closer version would be something like: A non-obstructive AI on S needs to do no worse for each P in S than pol(P | off & humans have all the AI’s knowledge)
This feels a bit patchy, but in principle it’d fix the most common/obvious issue of the kind I’m raising: that the AI would often otherwise have an incentive to hide information from the users so as to avoid ‘obstructing’ them when they change their minds.
I think this is more in the spirit of non-obstruction, since it compares the AI’s actions to a fully informed human baseline (I’m not claiming it’s precise, but in the direction that makes sense to me). Perhaps the extra information does smooth out any undesirable spikes the AI might anticipate.
I do otherwise expect such issues to be common. But perhaps it’s usually about the AI knowing more than the humans.
I may well be wrong about any/all of this, but (unless I’m confused), it’s not a quibble about edge cases. If I’m wrong about default spikiness, then it’s much more of an edge case.
(You’re right about my P, -P example missing your main point; I just meant it as an example, not as a response to the point you were making with it; I should have realised that would make my overall point less clear, given that interpreting it as a direct response was natural; apologies if that seemed less-than-constructive: not my intent)
If pol(P) sucks, even if the AI is literally corrigible, we still won’t reach good outcomes.
If pol(P) sucks by default, a general AI (corrigible or otherwise) may be able to give us information I which: Makes Vp(pol(P)) much higher, by making pol(P) given I suck a whole lot less. Makes Vq(pol(Q)) a little lower, by making pol(Q) given I make concessions to allow pol(P) to perform better.
A non-obstructive AI can’t do that, since it’s required to maintain the AU for pol(Q).
A simple example is where P and Q currently look the same to us—so our pol(P) and pol(Q) have the same outcome [ETA for a long time at least, with potentially permanent AU consequences], which happens to be great for Vq(pol(Q)), but not so great for Vp(pol(P)).
In this situation, we want an AI that can tell us: ”You may actually want either P or Q here. Here’s an optimisation that works 99% as well for Q, and much better than your current approach for P. Since you don’t currently know which you want, this is much better than your current optimisation for Q: that only does 40% as well for P.”
A non-obstructive AI cannot give us that information if it predicts it would lower Vq(pol(Q)) in so doing—which it probably would.
Does non-obstruction rule out lowering Vq(pol(Q)) in this way? If not, I’ve misunderstood you somewhere. If so, that’s a problem.
I’m not sure I understand the distinction you’re making between a “conceptual frame”, and a “constraint under which...”.
[Non-obstruction with respect to a set S] must be a constraint of some kind. I’m simply saying that there are cases where it seems to rule out desirable behaviour—e.g. giving us information that allows us to trade a small potential AU penalty for a large potential AU gain, when we’re currently uncertain over which is our true payoff function.
Anyway, my brain is now dead. So I doubt I’ll be saying much intelligible before tomorrow (if the preceding even qualifies :)).
Er—non-obstruction is a conceptual frame for understanding the benefits we want from corrigibility. It is not a constraint under which the AI finds a high-scoring policy. It is not an approach to solving the alignment problem any more than Kepler’s laws are an approach for going to the moon.
Generally, broad non-obstruction seems to be at least as good as literal corrigibility. In my mind, the point of corrigibility is that we become more able to wield and amplify our influence through the AI. If pol(P) sucks, even if the AI is literally corrigible, we still won’t reach good outcomes. I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility in the real world, where pol is pretty reasonable for the relevant goals.
I agree it’s possible for pol to shoot itself in the foot, but I was trying to give an example situation. I was not claiming that for every possible pol, giving money is non-obstructive against P and -P. I feel like that misses the point, and I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility.
The point of all this analysis is to think about why we want corrigibility in the real world, and whether there’s a generalized version of that desideratum. To remark that there exists an AI policy/pol pair which induces narrow non-obstruction, or which doesn’t empower pol a whole lot, or which makes silly tradeoffs… I guess I just don’t see the relevance of that for thinking about the alignment properties of a given AI system in the real world.
Thinking of corrigibility, it’s not clear to me that non-obstruction is quite what I want.
Perhaps a closer version would be something like:
A non-obstructive AI on S needs to do no worse for each P in S than pol(P | off & humans have all the AI’s knowledge)
This feels a bit patchy, but in principle it’d fix the most common/obvious issue of the kind I’m raising: that the AI would often otherwise have an incentive to hide information from the users so as to avoid ‘obstructing’ them when they change their minds.
I think this is more in the spirit of non-obstruction, since it compares the AI’s actions to a fully informed human baseline (I’m not claiming it’s precise, but in the direction that makes sense to me). Perhaps the extra information does smooth out any undesirable spikes the AI might anticipate.
I do otherwise expect such issues to be common.
But perhaps it’s usually about the AI knowing more than the humans.
I may well be wrong about any/all of this, but (unless I’m confused), it’s not a quibble about edge cases.
If I’m wrong about default spikiness, then it’s much more of an edge case.
(You’re right about my P, -P example missing your main point; I just meant it as an example, not as a response to the point you were making with it; I should have realised that would make my overall point less clear, given that interpreting it as a direct response was natural; apologies if that seemed less-than-constructive: not my intent)
If pol(P) sucks by default, a general AI (corrigible or otherwise) may be able to give us information I which:
Makes Vp(pol(P)) much higher, by making pol(P) given I suck a whole lot less.
Makes Vq(pol(Q)) a little lower, by making pol(Q) given I make concessions to allow pol(P) to perform better.
A non-obstructive AI can’t do that, since it’s required to maintain the AU for pol(Q).
A simple example is where P and Q currently look the same to us—so our pol(P) and pol(Q) have the same outcome [ETA for a long time at least, with potentially permanent AU consequences], which happens to be great for Vq(pol(Q)), but not so great for Vp(pol(P)).
In this situation, we want an AI that can tell us:
”You may actually want either P or Q here. Here’s an optimisation that works 99% as well for Q, and much better than your current approach for P. Since you don’t currently know which you want, this is much better than your current optimisation for Q: that only does 40% as well for P.”
A non-obstructive AI cannot give us that information if it predicts it would lower Vq(pol(Q)) in so doing—which it probably would.
Does non-obstruction rule out lowering Vq(pol(Q)) in this way?
If not, I’ve misunderstood you somewhere.
If so, that’s a problem.
I’m not sure I understand the distinction you’re making between a “conceptual frame”, and a “constraint under which...”.
[Non-obstruction with respect to a set S] must be a constraint of some kind.
I’m simply saying that there are cases where it seems to rule out desirable behaviour—e.g. giving us information that allows us to trade a small potential AU penalty for a large potential AU gain, when we’re currently uncertain over which is our true payoff function.
Anyway, my brain is now dead. So I doubt I’ll be saying much intelligible before tomorrow (if the preceding even qualifies :)).