I don’t think this is core to our disagreement, but I don’t understand why philosophical questions are especially relevant here.
For example, it seems like a relatively weak AI can recognize that “don’t do anything the user would find terrible; acquire resources; make sure the user remains safe and retains effective control over those resource” is a praise-winning strategy, and then do it. (Especially in the reinforcement learning setting, where we can just tell it things and it can learn that doing things we tell it is a praise-winning strategy.) This strategy also seems close to maximally efficient—the costs of keeping humans around and retaining the ability to consult them are not very large, and the cost of eliciting the needed information is not very high.
So it seems to me that we should be thinking about the AI’s ability to identify and execute strategies like this (and our ability to test that it is correctly executing such strategies).
I discussed this issue a bit in problems #2 and #3 here. It seems like “answers to philosophical questions” can essentially be lumped under “values,” in that discussion, since the techniques for coping with unknown values also seem to cope with unknown answers to philosophical questions.
ETA: my position looks superficially like a common argument that people give for why smart AI wouldn’t be dangerous. But now the tables are turned—there is a strategy that the AI can follow which will cause it to earn high reward, and I am claiming that a very intelligent AI can find it, for example by understanding the intent of human language and using this as a clue about what humans will and won’t approve of.
“don’t do anything the user would find terrible; acquire resources; make sure the user remains safe and retains effective control over those resource”
Acquiring resources has a lot of ethical implications. If you’re inventing new technologies and selling them, you could be increasing existential risk. If you’re trading with others, you would be enriching one group at the expense of another. If you’re extracting natural resources, there’s questions of fairness (how hard should you drive bargains or attempt to burn commons) and time preference (do you want to maximize short term or long term resource extraction). And how much do you care about animal suffering, or the world remaining “natural”? I guess the AI could present a plan that involves asking the overseer to answer these questions, but the overseer probably doesn’t have the answers either (or at least should not be confident of his or her answers).
What we want is to develop an AI that can eventually do philosophy and answer these questions on its own, and correctly. It’s the “doing philosophy correctly on its own” part that I do not see how to test for in a black-box design, without giving the AI so much power that it can escape human control if something goes wrong. The AI’s behavior, while it’s in the not-yet-superintelligent, “ask the overseer about every ethical question” phase, doesn’t seem to tell us much about how good the design and implementation is, metaphilosophically.
I don’t think this is core to our disagreement, but I don’t understand why philosophical questions are especially relevant here.
For example, it seems like a relatively weak AI can recognize that “don’t do anything the user would find terrible; acquire resources; make sure the user remains safe and retains effective control over those resource” is a praise-winning strategy, and then do it. (Especially in the reinforcement learning setting, where we can just tell it things and it can learn that doing things we tell it is a praise-winning strategy.) This strategy also seems close to maximally efficient—the costs of keeping humans around and retaining the ability to consult them are not very large, and the cost of eliciting the needed information is not very high.
So it seems to me that we should be thinking about the AI’s ability to identify and execute strategies like this (and our ability to test that it is correctly executing such strategies).
I discussed this issue a bit in problems #2 and #3 here. It seems like “answers to philosophical questions” can essentially be lumped under “values,” in that discussion, since the techniques for coping with unknown values also seem to cope with unknown answers to philosophical questions.
ETA: my position looks superficially like a common argument that people give for why smart AI wouldn’t be dangerous. But now the tables are turned—there is a strategy that the AI can follow which will cause it to earn high reward, and I am claiming that a very intelligent AI can find it, for example by understanding the intent of human language and using this as a clue about what humans will and won’t approve of.
Acquiring resources has a lot of ethical implications. If you’re inventing new technologies and selling them, you could be increasing existential risk. If you’re trading with others, you would be enriching one group at the expense of another. If you’re extracting natural resources, there’s questions of fairness (how hard should you drive bargains or attempt to burn commons) and time preference (do you want to maximize short term or long term resource extraction). And how much do you care about animal suffering, or the world remaining “natural”? I guess the AI could present a plan that involves asking the overseer to answer these questions, but the overseer probably doesn’t have the answers either (or at least should not be confident of his or her answers).
What we want is to develop an AI that can eventually do philosophy and answer these questions on its own, and correctly. It’s the “doing philosophy correctly on its own” part that I do not see how to test for in a black-box design, without giving the AI so much power that it can escape human control if something goes wrong. The AI’s behavior, while it’s in the not-yet-superintelligent, “ask the overseer about every ethical question” phase, doesn’t seem to tell us much about how good the design and implementation is, metaphilosophically.