Oh it’s possible to add up a load of spikes [ETA suboptimal optimisations], many of which hit the wrong target, but miraculously cancel out to produce a flat landscape [ETA “spikes” was just wrong; what I mean here is that you could e.g. optimise for A, accidentally hit B, and only get 70% of the ideal value for A… and counterfactually optimise for B, accidentally hit C, and only get 70% of the ideal value for B… and counterfactually aim for C, hit D etc. etc. so things end up miraculously flat; this seems silly because there’s no reason to expect all misses to be of similar ‘magnitude’, or to have the same impact on value]. It’s just hugely unlikely. To expect this would seem silly.
[ETA My point is that in practice we’ll make mistakes, that the kind/number/severity of our mistakes will be P dependent, and that a pol which assumes away such mistakes isn’t useful (at least I don’t see how it’d be useful). Throughout I’m assuming pol(P) isn’t near-optimal for all P—see my response above for details]
For non-spikiness, you don’t just need a world where we never use powerful AI: you need a world where powerful [optimisers for some goal in S] of any kind don’t occur. It’s not clear to me how you cleanly/coherently define such a world. The counterfactual where “this system is off” may not be easy to calculate, but it’s conceptually simple. The counterfactual where “no powerful optimiser for any P in S ever exists” is not. In particular, it’s far from clear that iterated improvements of biological humans with increased connectivity don’t get you an extremely powerful optimiser—which could (perhaps mistakenly) optimise for something spikey. Ruling everything like this out doesn’t seem to land you anywhere natural or cleanly defined.
Then you have the problem of continuing non-obstruction once many other AIs already exist: You build a non-obstructive AI, X, using a baseline of no-great-P-in-S-optimisers-ever. It allows someone else to build Y, a narrow-subset-of-S optimiser (since this outperforms the baseline too). Y takes decisions to lock in the spike it’s optimising for, using irreversible-to-humans actions. Through non-obstruction at this moment X must switch its policy to enforce the locked-in spike, or shut down. (this is true even if X has the power to counter Y’s actions)
Perhaps there’s some clean way to take this approach, but I’m not seeing it. If what you want is to outperform some moderate, flat baseline, then simply say that directly. Trying to achieve a flat baseline by taking a convoluted counterfactual seems foolish.
Fundamentally, I think setting up an AI with an incentive to prefer (+1, 0, 0, … , 0) over (-1, +10, +10, …, +10), is asking for trouble. Pretty-much regardless of the baseline, a rule that says all improvement must be Pareto improvement is just not what I want.
Oh it’s possible to add up a load of
spikes[ETA suboptimal optimisations], many of which hit the wrong target, but miraculously cancel out to produce a flat landscape [ETA “spikes” was just wrong; what I mean here is that you could e.g. optimise for A, accidentally hit B, and only get 70% of the ideal value for A… and counterfactually optimise for B, accidentally hit C, and only get 70% of the ideal value for B… and counterfactually aim for C, hit D etc. etc. so things end up miraculously flat; this seems silly because there’s no reason to expect all misses to be of similar ‘magnitude’, or to have the same impact on value]. It’s just hugely unlikely. To expect this would seem silly.[ETA My point is that in practice we’ll make mistakes, that the kind/number/severity of our mistakes will be P dependent, and that a pol which assumes away such mistakes isn’t useful (at least I don’t see how it’d be useful).
Throughout I’m assuming pol(P) isn’t near-optimal for all P—see my response above for details]
For non-spikiness, you don’t just need a world where we never use powerful AI: you need a world where powerful [optimisers for some goal in S] of any kind don’t occur. It’s not clear to me how you cleanly/coherently define such a world.
The counterfactual where “this system is off” may not be easy to calculate, but it’s conceptually simple.
The counterfactual where “no powerful optimiser for any P in S ever exists” is not. In particular, it’s far from clear that iterated improvements of biological humans with increased connectivity don’t get you an extremely powerful optimiser—which could (perhaps mistakenly) optimise for something spikey.
Ruling everything like this out doesn’t seem to land you anywhere natural or cleanly defined.
Then you have the problem of continuing non-obstruction once many other AIs already exist:
You build a non-obstructive AI, X, using a baseline of no-great-P-in-S-optimisers-ever.
It allows someone else to build Y, a narrow-subset-of-S optimiser (since this outperforms the baseline too).
Y takes decisions to lock in the spike it’s optimising for, using irreversible-to-humans actions.
Through non-obstruction at this moment X must switch its policy to enforce the locked-in spike, or shut down. (this is true even if X has the power to counter Y’s actions)
Perhaps there’s some clean way to take this approach, but I’m not seeing it.
If what you want is to outperform some moderate, flat baseline, then simply say that directly.
Trying to achieve a flat baseline by taking a convoluted counterfactual seems foolish.
Fundamentally, I think setting up an AI with an incentive to prefer (+1, 0, 0, … , 0) over (-1, +10, +10, …, +10), is asking for trouble. Pretty-much regardless of the baseline, a rule that says all improvement must be Pareto improvement is just not what I want.