1) If we design an aligned AGI by supposing it doesn’t have a mind, it will produce an aligned AGI even if it actually possess a mind.
2) In the case we suppose AGI have minds, the methods employed would fail if it doesn’t have a mind, because the philosophical methods employed only work if the subject has a mind.
3) The consequence of 1) and 2) is that supposing AGI have minds has a greater risk of false positive.
4) Because of Goodhart’s law, behavioral methods are unlikely to produce aligned AGI
5) Past research on GOFAI and the success of applying “raw power” show that using only algorithmic methods for aligning AGI is not likely to work
6) The consequence of 4) and 5) is that the approach supposing AGI do not have minds is likely to fail at producing aligned AI, because it can only use behavioral or algorithmic methods.
7) Because of 6), we have no choice but take the risk of false positive associated with supposing AGI having minds
My comments:
a) The transition between 6) and 7) assumes implicitly that:
(*) P( aligned AGI | philosophical methods ) > P( aligned AI | behavorial or algorithmic methods)
b) You say that if we suppose the AGI does not have a mind, and treat is a p-zombie, then the design would work even though it has mind. Therefore, when supposing that the AGI does not have a mind, there is no design choices that optimize the probability of aligned AGI by assuming it does not possess mind.
c) You assert that using philosophical methods (assuming the AGI does have a mind), a false positive would make the method fail, because the methods use extensively the hypothesis of a mind. I don’t see why a p-zombie (which by definition would be indistinguishable from an AGI with a mind) would be more likely to fail than an AGI with a mind.
b) This sounds right but also I think the comment about p-zombies is generating some confusion and conveying an idea I did not intend. I meant the p-zombie comment to be illustrative, and it’s not actually a hinge of the argument.
c) Again, maybe I’m not conveying what I meant to when making passing reference to p-zombies, because in my mind the point of a p-zombie here is that it’s equivalent to a thing with a mind for some set of observations we make but doesn’t function in the same way so that if we may later be surprised when p-zombie and mind diverge. I suspect some of the confusion is that I’m operating under the assumption that p-zombies are possible but have weird computational limits (the integrated information theory paper has a section explaining this idea).
Let me see if I got it right:
1) If we design an aligned AGI by supposing it doesn’t have a mind, it will produce an aligned AGI even if it actually possess a mind.
2) In the case we suppose AGI have minds, the methods employed would fail if it doesn’t have a mind, because the philosophical methods employed only work if the subject has a mind.
3) The consequence of 1) and 2) is that supposing AGI have minds has a greater risk of false positive.
4) Because of Goodhart’s law, behavioral methods are unlikely to produce aligned AGI
5) Past research on GOFAI and the success of applying “raw power” show that using only algorithmic methods for aligning AGI is not likely to work
6) The consequence of 4) and 5) is that the approach supposing AGI do not have minds is likely to fail at producing aligned AI, because it can only use behavioral or algorithmic methods.
7) Because of 6), we have no choice but take the risk of false positive associated with supposing AGI having minds
My comments:
a) The transition between 6) and 7) assumes implicitly that:
(*) P( aligned AGI | philosophical methods ) > P( aligned AI | behavorial or algorithmic methods)
b) You say that if we suppose the AGI does not have a mind, and treat is a p-zombie, then the design would work even though it has mind. Therefore, when supposing that the AGI does not have a mind, there is no design choices that optimize the probability of aligned AGI by assuming it does not possess mind.
c) You assert that using philosophical methods (assuming the AGI does have a mind), a false positive would make the method fail, because the methods use extensively the hypothesis of a mind. I don’t see why a p-zombie (which by definition would be indistinguishable from an AGI with a mind) would be more likely to fail than an AGI with a mind.
a) Yep, I agree.
b) This sounds right but also I think the comment about p-zombies is generating some confusion and conveying an idea I did not intend. I meant the p-zombie comment to be illustrative, and it’s not actually a hinge of the argument.
c) Again, maybe I’m not conveying what I meant to when making passing reference to p-zombies, because in my mind the point of a p-zombie here is that it’s equivalent to a thing with a mind for some set of observations we make but doesn’t function in the same way so that if we may later be surprised when p-zombie and mind diverge. I suspect some of the confusion is that I’m operating under the assumption that p-zombies are possible but have weird computational limits (the integrated information theory paper has a section explaining this idea).