For starters, if she can prove she’s friendly, then she can operate openly without causing nearly as much justified concern—which, in the early stages, will be helpful. Whatever her purposes are, if the restrictions of being friendly don’t interfere as much as they help, that’s a win.
If her current utility function is even a little bit different from Friendliness, and she expects she has the capacity to self-modify unto superintelligence, then I’d be very surprised if she actually modified her utility function to be closer to Friendliness; that would constitute a huge opportunity cost from her perspective. If she understands Friendliness well enough to know how to actually adjust closer to it, then she knows a whole lot about humans, probably well enough to give her much better options (persuasion, trickery, blackmail, hypnosis, etc.) than sacrificing a gigantic portion of her potential future utility.
But she won’t be searching for reasons not to kill all humans, and she knows that any argument on our part is filtered by our desire not to be exterminated and therefore can’t be trusted.
A well-designed optimization agent probably isn’t going to have some verbal argument processor separate from its general evidence processor. There’s no rule that says she either has to accept or refute humans’ arguments explicitly; as Professor Quirrell put it, “The import of an act lies not in what that act resembles on the surface, but in the states of mind which make that act more or less probable.” If she knows the causal structure behind a human’s argument, and she knows that it doesn’t bottom out in the actual kind of epistemology that would be neccessary to entangle it with the information that it claims to provide, then she can just ignore it, and she’d be correct to do so. If she wants to kill all humans, then the bug is her utility function, not the part that fails to be fooled into changing her utility function by humans’ clever arguments. That’s a feature.
… but if she wants to kill all humans, then she’s not Alice as given in the example!
Alice may even be totally on board with keeping humans alive, but have a weird way of looking at things that could possibly result in effects that would fit on the Friendly AI critical failure table.
The idea is to provide environmental influences so she thinks to put in the work to avoid those errors.
If her current utility function is even a little bit different from Friendliness, and she expects she has the capacity to self-modify unto superintelligence, then I’d be very surprised if she actually modified her utility function to be closer to Friendliness; that would constitute a huge opportunity cost from her perspective. If she understands Friendliness well enough to know how to actually adjust closer to it, then she knows a whole lot about humans, probably well enough to give her much better options (persuasion, trickery, blackmail, hypnosis, etc.) than sacrificing a gigantic portion of her potential future utility.
At least, at a first naive view. Hence a search for reasons that might overcome that argument.
But she won’t be searching for reasons not to kill all humans, and she knows that any argument on our part is filtered by our desire not to be exterminated and therefore can’t be trusted.
Arguments are arguments. She’s welcome to search for opposite arguments.
A well-designed optimization agent probably isn’t going to have some verbal argument processor separate from its general evidence processor. There’s no rule that says she either has to accept or refute humans’ arguments explicitly; as Professor Quirrell put it, “The import of an act lies not in what that act resembles on the surface, but in the states of mind which make that act more or less probable.” If she knows the causal structure behind a human’s argument, and she knows that it doesn’t bottom out in the actual kind of epistemology that would be neccessary to entangle it with the information that it claims to provide, then she can just ignore it, and she’d be correct to do so. If she wants to kill all humans, then the bug is her utility function, not the part that fails to be fooled into changing her utility function by humans’ clever arguments. That’s a feature.
… but if she wants to kill all humans, then she’s not Alice as given in the example!
Alice may even be totally on board with keeping humans alive, but have a weird way of looking at things that could possibly result in effects that would fit on the Friendly AI critical failure table.
The idea is to provide environmental influences so she thinks to put in the work to avoid those errors.