As an author, I’m always excited to see posts like this—thanks for writing this up!
I think there are a couple of important points here, and also a couple of apparent misunderstandings. I’m not sure I understood all of your points, so let me know if I missed something.
Here are your claims:
1 Non-obstruction seems to be useful where our AU landscape is pretty flat by default. 2 Our AU landscape is probably spikey by default. 3 Non-obstruction locks in default spike-tops in S, since it can only make Pareto improvements. (modulo an epsilon here or there) 4 Locking in spike-tops is better than nothing, but we can, and should, do better.
I disagree with #2. In an appropriately analyzed multi-agent system, an individual will be better at some things, and worse at other things. Obviously, strategy-stealing is an important factor here. But in the main, I think that strategy-stealing will hold well enough for this analysis, and that the human policy function can counterfactually find reasonable ways to pursue different goals, and so it won’t be overwhelmingly spiky. This isn’t a crux for me, though.
I agree with #3 and #4. The AU landscape implies a partial ordering over AI designs, and non-obstruction just demands that you do better than a certain baseline (to be precise: that the AI be greater than a join over various AIs which mediocrely optimize a fixed goal). There are many ways to do better than the green line (the human AU landscape without the AI); I think one of the simplest is just to be broadly empowering.
Let me get into some specifics where we might disagree / there might be a misunderstanding. In response to Adam, you wrote:
Oh it’s possible to add up a load of spikes, many of which hit the wrong target, but miraculously cancel out to produce a flat landscape. It’s just hugely unlikely. To expect this would seem silly.
We aren’t adding or averaging anything, when computing the AU landscape. Each setting of the independent variable (the set of goals we might optimize) induces a counterfactual where we condition our policy function on the relevant goal, and then follow the policy from that state. The dependent variable is the value we achieve for that goal.
Importantly (and you may or may not understand this, it isn’t clear to me), the AU landscape is not the value of “the” outcome we would achieve “by default” without turning the AI on. We don’t achieve “flat” AU landscapes by finding a wishy-washy outcome which isn’t too optimized for anything in particular.
We counterfact on different goals, see how much value we could achieve without the AI if we tried our hardest for each counterfactual goal, and then each value corresponds to a point on the green line.
(You can see how this is amazingly hard to properly compute, and therefore why I’m not advocating non-obstruction as an actual policy selection criterion. But I see non-obstruction as a conceptual frame for understanding alignment, not as a formal alignment strategy, and so it’s fine.)
Furthermore, I claim that it’s in-principle-possible to design AIs which empower you (and thereby don’t obstruct you) for payoff functionsP and −P. The AI just gives you a lot of money and shuts off.
Let’s reconsider your very nice graph.
I don’t know whether this graph was plotted with the understanding of how the counterfactuals are computed, or not, so let me know.
Anyways, I think a more potent objection to the “this AI not being activated” baseline is “well what if, when you decide to never turn on AI #1, you turn on AI #2, which destroys the world no matter what your goals are. Then you have spikiness by default.”
This is true, and I think that’s also a silly baseline to use for conceptual analysis.
Also, a system of non-obstructing agents may exhibit bad group dynamics and systematically optimize the world in a certain bad direction. But many properties aren’t preserved under naive composition: corrigible subagents doesn’t imply corrigible system, pairwise independent random variables usually aren’t mutually independent, and so on.
Similar objections can be made for multi-polar scenarios: the AI isn’t wholly responsible for the whole state of the world and the other AIs already in it. However, the non-obstruction / AU landscape frame still provides meaningful insight into how human autonomy can be chipped away. Let me give an example.
You turn on the first clickthrough maximizer, and each individual’s AU landscape looks a little worse than before (in short, because there’s optimization pressure on the world towards the “humans click ads” direction, which trades off against most goals)
...
You turn on clickthrough maximizer n and it doesn’t make things dramatically worse, but things are still pretty bad either way.
Now you turn on a weak aligned AI and it barely helps you out, but still classes as “non-obstructive” (comparing ‘deploy weak aligned AI’ to ‘don’t deploy weak aligned AI’). What gives?
Well, in the ‘original / baseline’ world, humans had a lot more autonomy.
If the world is already being optimized in a different direction, your AU will be less sensitive to your goals because it will be harder for you to optimize in the other direction. The aligned weak AI may have been a lot more helpful in the baseline world.
Yes, you could argue that if they hadn’t originally deployed clickthrough-maximizers, they’d have deployed something else bad, and so the comparison isn’t that good. But this is just choosing a conceptually bad baseline.
The point (which I didn’t make in the original post) isn’t that we need to literally counterfact on “we don’t turn on this AI”, it’s that we should compare deploying the AI to the baseline state of affairs (e.g. early 21st century).
Ok, I think I’m following you (though I am tired, so who knows :)).
For me the crux seems to be: We can’t assume in general that pol(P) isn’t terrible at optimising for P. We can “do our best” and still screw up catastrophically.
If assuming “pol(P) is always a good optimiser for P” were actually realistic (and I assume you’re not!), then we wouldn’t have an alignment problem: we’d be assuming away any possibility of making a catastrophic error.
If we just assume “pol(P) is always a good optimiser for P” for the purpose of non-obstruction definitions/calculations, then our AI can adopt policies of the following form:
Take actions with the following consequences:
If (humans act according to a policy that optimises well for P) then humans are empowered on P Otherwise, consequences can be arbitrarily bad
Once the AI’s bar on the quality of pol is high, we have no AU guarantees at all if we fail to meet that bar. This seems like an untenable approach to me, so I’m not assuming that pol is reliably/uniformly good at optimising.
So e.g. in my diagram, I’m assuming that for every P in S, humans screw up and accidentally create the 80s optimiser (let’s say the 80s optimiser was released prematurely through an error). That may be unlikely: the more reasonable proposition would be that this happens for some subset T of S larger than simply P = 80s utopia. If for all P in T, pol(P) gets you 80s utopia, that will look like a spike on T peaking at P = 80s utopia.
The maximum of this spike may only be achievable by optimising early for 80s utopia (before some period of long reflection that allows us to optimise well across T).
However, once this spike is present for P = 80s utopia, our AI is required by non-obstruction to match that maximum for P = 80s utopia. If it’s still possible that we do want 80s utopia when the premature optimisation would start under pol(P) for P in T, the AI is required to support that optimisation—even if the consequences across the rest of T are needlessly suboptimal (relative to what would be possible for the AI; clearly they still improve on pol, because pol wasn’t good).
To assume that my claim (2) doesn’t hold is to assume that there’s no subset T of S where this kind of thing happens by default. That seems unlikely to me—unless we’re in a world where the alignment problem gets solved very well without non-obstruction. For instance, this can happen if you have a payoff function on T which accidentally misses out some component that’s valuable-but-not-vital over almost all of T, but zero for one member. You may optimise hard for the zero member, sacrificing the component you missed out, and later realise that you actually wanted the non-zero version.
Personally, I’d guess that this kind of thing would happen over many such subsets, so you’d have a green line with a load of spikes, each negatively impacting a very small part of the line as a trade-off to achieve the high spike.
To take your P vs -P example, the “give money then shut off” only reliably works if we assume pol(P) and pol(-P) are sufficiently good optimisers. (though probably the bar isn’t high here)
To take a trivial (but possible) example of its going wrong, imagine that pol(-P) involves using software with some hidden absolute value call that inadvertently converts -P optimisation into P optimisation.
Now giving the money doesn’t work, since it makes things worse for V(-p)pol(-P). The AI can shut off without doing anything, but it can’t necessarily do the helpful thing: saying “Hang on a bit and delay optimisation: you need to fix this absolute value bug”, unless that delay doesn’t cost anything for P optimisation. This case is probably ok either with a generous epsilon, or the assumption that the AI has the capacity to help either optimisation similarly. But in general there’ll be problems of similar form which aren’t so simply resolved.
Here I don’t like the constraint not to sacrifice a small amount of pol(P) value for a huge amount of pol(Q) value.
Hopefully that’s clear. Perhaps I’m still missing something, but I don’t see how assuming pol makes no big mistakes gets you very far (the AI is then free to optimise to put us on a knife-edge between chasms, and ‘blame’ us for falling). Once you allow pol to be a potentially catastrophically bad optimiser for some subset of S, I think you get the problems I outline in the post. I don’t think strategy-stealing is much of an issue where pol can screw up badly.
That’s the best I can outline my current thinking. If I’m still not seeing things clearly, I’ll have to rethink/regroup/sleep, since my brain is starting to struggle.
Once the AI’s bar on the quality of pol is high, we have no AU guarantees at all if we fail to meet that bar. This seems like an untenable approach to me
Er—non-obstruction is a conceptual frame for understanding the benefits we want from corrigibility. It is not a constraint under which the AI finds a high-scoring policy. It is not an approach to solving the alignment problem any more than Kepler’s laws are an approach for going to the moon.
Generally, broad non-obstruction seems to be at least as good as literal corrigibility. In my mind, the point of corrigibility is that we become more able to wield and amplify our influence through the AI. If pol(P) sucks, even if the AI is literally corrigible, we still won’t reach good outcomes. I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility in the real world, where pol is pretty reasonable for the relevant goals.
the “give money then shut off” only reliably works if we assume pol(P) and pol(-P) are sufficiently good optimisers.
I agree it’s possible for pol to shoot itself in the foot, but I was trying to give an example situation. I was not claiming that for every possible pol, giving money is non-obstructive against P and -P. I feel like that misses the point, and I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility.
The point of all this analysis is to think about why we want corrigibility in the real world, and whether there’s a generalized version of that desideratum. To remark that there exists an AI policy/pol pair which induces narrow non-obstruction, or which doesn’t empower pol a whole lot, or which makes silly tradeoffs… I guess I just don’t see the relevance of that for thinking about the alignment properties of a given AI system in the real world.
Thinking of corrigibility, it’s not clear to me that non-obstruction is quite what I want. Perhaps a closer version would be something like: A non-obstructive AI on S needs to do no worse for each P in S than pol(P | off & humans have all the AI’s knowledge)
This feels a bit patchy, but in principle it’d fix the most common/obvious issue of the kind I’m raising: that the AI would often otherwise have an incentive to hide information from the users so as to avoid ‘obstructing’ them when they change their minds.
I think this is more in the spirit of non-obstruction, since it compares the AI’s actions to a fully informed human baseline (I’m not claiming it’s precise, but in the direction that makes sense to me). Perhaps the extra information does smooth out any undesirable spikes the AI might anticipate.
I do otherwise expect such issues to be common. But perhaps it’s usually about the AI knowing more than the humans.
I may well be wrong about any/all of this, but (unless I’m confused), it’s not a quibble about edge cases. If I’m wrong about default spikiness, then it’s much more of an edge case.
(You’re right about my P, -P example missing your main point; I just meant it as an example, not as a response to the point you were making with it; I should have realised that would make my overall point less clear, given that interpreting it as a direct response was natural; apologies if that seemed less-than-constructive: not my intent)
If pol(P) sucks, even if the AI is literally corrigible, we still won’t reach good outcomes.
If pol(P) sucks by default, a general AI (corrigible or otherwise) may be able to give us information I which: Makes Vp(pol(P)) much higher, by making pol(P) given I suck a whole lot less. Makes Vq(pol(Q)) a little lower, by making pol(Q) given I make concessions to allow pol(P) to perform better.
A non-obstructive AI can’t do that, since it’s required to maintain the AU for pol(Q).
A simple example is where P and Q currently look the same to us—so our pol(P) and pol(Q) have the same outcome [ETA for a long time at least, with potentially permanent AU consequences], which happens to be great for Vq(pol(Q)), but not so great for Vp(pol(P)).
In this situation, we want an AI that can tell us: ”You may actually want either P or Q here. Here’s an optimisation that works 99% as well for Q, and much better than your current approach for P. Since you don’t currently know which you want, this is much better than your current optimisation for Q: that only does 40% as well for P.”
A non-obstructive AI cannot give us that information if it predicts it would lower Vq(pol(Q)) in so doing—which it probably would.
Does non-obstruction rule out lowering Vq(pol(Q)) in this way? If not, I’ve misunderstood you somewhere. If so, that’s a problem.
I’m not sure I understand the distinction you’re making between a “conceptual frame”, and a “constraint under which...”.
[Non-obstruction with respect to a set S] must be a constraint of some kind. I’m simply saying that there are cases where it seems to rule out desirable behaviour—e.g. giving us information that allows us to trade a small potential AU penalty for a large potential AU gain, when we’re currently uncertain over which is our true payoff function.
Anyway, my brain is now dead. So I doubt I’ll be saying much intelligible before tomorrow (if the preceding even qualifies :)).
As an author, I’m always excited to see posts like this—thanks for writing this up!
I think there are a couple of important points here, and also a couple of apparent misunderstandings. I’m not sure I understood all of your points, so let me know if I missed something.
Here are your claims:
I disagree with #2. In an appropriately analyzed multi-agent system, an individual will be better at some things, and worse at other things. Obviously, strategy-stealing is an important factor here. But in the main, I think that strategy-stealing will hold well enough for this analysis, and that the human policy function can counterfactually find reasonable ways to pursue different goals, and so it won’t be overwhelmingly spiky. This isn’t a crux for me, though.
I agree with #3 and #4. The AU landscape implies a partial ordering over AI designs, and non-obstruction just demands that you do better than a certain baseline (to be precise: that the AI be greater than a join over various AIs which mediocrely optimize a fixed goal). There are many ways to do better than the green line (the human AU landscape without the AI); I think one of the simplest is just to be broadly empowering.
Let me get into some specifics where we might disagree / there might be a misunderstanding. In response to Adam, you wrote:
We aren’t adding or averaging anything, when computing the AU landscape. Each setting of the independent variable (the set of goals we might optimize) induces a counterfactual where we condition our policy function on the relevant goal, and then follow the policy from that state. The dependent variable is the value we achieve for that goal.
Importantly (and you may or may not understand this, it isn’t clear to me), the AU landscape is not the value of “the” outcome we would achieve “by default” without turning the AI on. We don’t achieve “flat” AU landscapes by finding a wishy-washy outcome which isn’t too optimized for anything in particular.
We counterfact on different goals, see how much value we could achieve without the AI if we tried our hardest for each counterfactual goal, and then each value corresponds to a point on the green line.
(You can see how this is amazingly hard to properly compute, and therefore why I’m not advocating non-obstruction as an actual policy selection criterion. But I see non-obstruction as a conceptual frame for understanding alignment, not as a formal alignment strategy, and so it’s fine.)
Furthermore, I claim that it’s in-principle-possible to design AIs which empower you (and thereby don’t obstruct you) for payoff functions P and −P. The AI just gives you a lot of money and shuts off.
Let’s reconsider your very nice graph.
I don’t know whether this graph was plotted with the understanding of how the counterfactuals are computed, or not, so let me know.
Anyways, I think a more potent objection to the “this AI not being activated” baseline is “well what if, when you decide to never turn on AI #1, you turn on AI #2, which destroys the world no matter what your goals are. Then you have spikiness by default.”
This is true, and I think that’s also a silly baseline to use for conceptual analysis.
Also, a system of non-obstructing agents may exhibit bad group dynamics and systematically optimize the world in a certain bad direction. But many properties aren’t preserved under naive composition: corrigible subagents doesn’t imply corrigible system, pairwise independent random variables usually aren’t mutually independent, and so on.
Similar objections can be made for multi-polar scenarios: the AI isn’t wholly responsible for the whole state of the world and the other AIs already in it. However, the non-obstruction / AU landscape frame still provides meaningful insight into how human autonomy can be chipped away. Let me give an example.
You turn on the first clickthrough maximizer, and each individual’s AU landscape looks a little worse than before (in short, because there’s optimization pressure on the world towards the “humans click ads” direction, which trades off against most goals)
...
You turn on clickthrough maximizer n and it doesn’t make things dramatically worse, but things are still pretty bad either way.
Now you turn on a weak aligned AI and it barely helps you out, but still classes as “non-obstructive” (comparing ‘deploy weak aligned AI’ to ‘don’t deploy weak aligned AI’). What gives?
Well, in the ‘original / baseline’ world, humans had a lot more autonomy.
If the world is already being optimized in a different direction, your AU will be less sensitive to your goals because it will be harder for you to optimize in the other direction. The aligned weak AI may have been a lot more helpful in the baseline world.
Yes, you could argue that if they hadn’t originally deployed clickthrough-maximizers, they’d have deployed something else bad, and so the comparison isn’t that good. But this is just choosing a conceptually bad baseline.
The point (which I didn’t make in the original post) isn’t that we need to literally counterfact on “we don’t turn on this AI”, it’s that we should compare deploying the AI to the baseline state of affairs (e.g. early 21st century).
Ok, I think I’m following you (though I am tired, so who knows :)).
For me the crux seems to be:
We can’t assume in general that pol(P) isn’t terrible at optimising for P. We can “do our best” and still screw up catastrophically.
If assuming “pol(P) is always a good optimiser for P” were actually realistic (and I assume you’re not!), then we wouldn’t have an alignment problem: we’d be assuming away any possibility of making a catastrophic error.
If we just assume “pol(P) is always a good optimiser for P” for the purpose of non-obstruction definitions/calculations, then our AI can adopt policies of the following form:
Once the AI’s bar on the quality of pol is high, we have no AU guarantees at all if we fail to meet that bar.
This seems like an untenable approach to me, so I’m not assuming that pol is reliably/uniformly good at optimising.
So e.g. in my diagram, I’m assuming that for every P in S, humans screw up and accidentally create the 80s optimiser (let’s say the 80s optimiser was released prematurely through an error).
That may be unlikely: the more reasonable proposition would be that this happens for some subset T of S larger than simply P = 80s utopia.
If for all P in T, pol(P) gets you 80s utopia, that will look like a spike on T peaking at P = 80s utopia.
The maximum of this spike may only be achievable by optimising early for 80s utopia (before some period of long reflection that allows us to optimise well across T).
However, once this spike is present for P = 80s utopia, our AI is required by non-obstruction to match that maximum for P = 80s utopia. If it’s still possible that we do want 80s utopia when the premature optimisation would start under pol(P) for P in T, the AI is required to support that optimisation—even if the consequences across the rest of T are needlessly suboptimal (relative to what would be possible for the AI; clearly they still improve on pol, because pol wasn’t good).
To assume that my claim (2) doesn’t hold is to assume that there’s no subset T of S where this kind of thing happens by default. That seems unlikely to me—unless we’re in a world where the alignment problem gets solved very well without non-obstruction.
For instance, this can happen if you have a payoff function on T which accidentally misses out some component that’s valuable-but-not-vital over almost all of T, but zero for one member. You may optimise hard for the zero member, sacrificing the component you missed out, and later realise that you actually wanted the non-zero version.
Personally, I’d guess that this kind of thing would happen over many such subsets, so you’d have a green line with a load of spikes, each negatively impacting a very small part of the line as a trade-off to achieve the high spike.
To take your P vs -P example, the “give money then shut off” only reliably works if we assume pol(P) and pol(-P) are sufficiently good optimisers. (though probably the bar isn’t high here)
To take a trivial (but possible) example of its going wrong, imagine that pol(-P) involves using software with some hidden absolute value call that inadvertently converts -P optimisation into P optimisation.
Now giving the money doesn’t work, since it makes things worse for V(-p)pol(-P).
The AI can shut off without doing anything, but it can’t necessarily do the helpful thing: saying “Hang on a bit and delay optimisation: you need to fix this absolute value bug”, unless that delay doesn’t cost anything for P optimisation.
This case is probably ok either with a generous epsilon, or the assumption that the AI has the capacity to help either optimisation similarly. But in general there’ll be problems of similar form which aren’t so simply resolved.
Here I don’t like the constraint not to sacrifice a small amount of pol(P) value for a huge amount of pol(Q) value.
Hopefully that’s clear. Perhaps I’m still missing something, but I don’t see how assuming pol makes no big mistakes gets you very far (the AI is then free to optimise to put us on a knife-edge between chasms, and ‘blame’ us for falling). Once you allow pol to be a potentially catastrophically bad optimiser for some subset of S, I think you get the problems I outline in the post. I don’t think strategy-stealing is much of an issue where pol can screw up badly.
That’s the best I can outline my current thinking.
If I’m still not seeing things clearly, I’ll have to rethink/regroup/sleep, since my brain is starting to struggle.
Er—non-obstruction is a conceptual frame for understanding the benefits we want from corrigibility. It is not a constraint under which the AI finds a high-scoring policy. It is not an approach to solving the alignment problem any more than Kepler’s laws are an approach for going to the moon.
Generally, broad non-obstruction seems to be at least as good as literal corrigibility. In my mind, the point of corrigibility is that we become more able to wield and amplify our influence through the AI. If pol(P) sucks, even if the AI is literally corrigible, we still won’t reach good outcomes. I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility in the real world, where pol is pretty reasonable for the relevant goals.
I agree it’s possible for pol to shoot itself in the foot, but I was trying to give an example situation. I was not claiming that for every possible pol, giving money is non-obstructive against P and -P. I feel like that misses the point, and I don’t see how this kind of objection supports non-obstruction not being a good conceptual motivation for corrigibility.
The point of all this analysis is to think about why we want corrigibility in the real world, and whether there’s a generalized version of that desideratum. To remark that there exists an AI policy/pol pair which induces narrow non-obstruction, or which doesn’t empower pol a whole lot, or which makes silly tradeoffs… I guess I just don’t see the relevance of that for thinking about the alignment properties of a given AI system in the real world.
Thinking of corrigibility, it’s not clear to me that non-obstruction is quite what I want.
Perhaps a closer version would be something like:
A non-obstructive AI on S needs to do no worse for each P in S than pol(P | off & humans have all the AI’s knowledge)
This feels a bit patchy, but in principle it’d fix the most common/obvious issue of the kind I’m raising: that the AI would often otherwise have an incentive to hide information from the users so as to avoid ‘obstructing’ them when they change their minds.
I think this is more in the spirit of non-obstruction, since it compares the AI’s actions to a fully informed human baseline (I’m not claiming it’s precise, but in the direction that makes sense to me). Perhaps the extra information does smooth out any undesirable spikes the AI might anticipate.
I do otherwise expect such issues to be common.
But perhaps it’s usually about the AI knowing more than the humans.
I may well be wrong about any/all of this, but (unless I’m confused), it’s not a quibble about edge cases.
If I’m wrong about default spikiness, then it’s much more of an edge case.
(You’re right about my P, -P example missing your main point; I just meant it as an example, not as a response to the point you were making with it; I should have realised that would make my overall point less clear, given that interpreting it as a direct response was natural; apologies if that seemed less-than-constructive: not my intent)
If pol(P) sucks by default, a general AI (corrigible or otherwise) may be able to give us information I which:
Makes Vp(pol(P)) much higher, by making pol(P) given I suck a whole lot less.
Makes Vq(pol(Q)) a little lower, by making pol(Q) given I make concessions to allow pol(P) to perform better.
A non-obstructive AI can’t do that, since it’s required to maintain the AU for pol(Q).
A simple example is where P and Q currently look the same to us—so our pol(P) and pol(Q) have the same outcome [ETA for a long time at least, with potentially permanent AU consequences], which happens to be great for Vq(pol(Q)), but not so great for Vp(pol(P)).
In this situation, we want an AI that can tell us:
”You may actually want either P or Q here. Here’s an optimisation that works 99% as well for Q, and much better than your current approach for P. Since you don’t currently know which you want, this is much better than your current optimisation for Q: that only does 40% as well for P.”
A non-obstructive AI cannot give us that information if it predicts it would lower Vq(pol(Q)) in so doing—which it probably would.
Does non-obstruction rule out lowering Vq(pol(Q)) in this way?
If not, I’ve misunderstood you somewhere.
If so, that’s a problem.
I’m not sure I understand the distinction you’re making between a “conceptual frame”, and a “constraint under which...”.
[Non-obstruction with respect to a set S] must be a constraint of some kind.
I’m simply saying that there are cases where it seems to rule out desirable behaviour—e.g. giving us information that allows us to trade a small potential AU penalty for a large potential AU gain, when we’re currently uncertain over which is our true payoff function.
Anyway, my brain is now dead. So I doubt I’ll be saying much intelligible before tomorrow (if the preceding even qualifies :)).