I will try to list some objections appeared in my mind while reading your suggestions.
I think that the main problem is the idea that “humans have values”. No, they don’t have values. “Values” were invented by psychologists to better describe human behaviour. To calling that humans “have values” is fundamental attribution error, which in most cases very subtle, because humans behave as if they have values. But this error could become amplified by suggested IDA, because human reactions to AI’s tests will be not coherent and their extrapolation will stuck at some level. How exactly it will stuck is not easy to predict. In short: it is impossible to align two fuzzy things. I am going to write something longer about the problem of non-existence of human values.
I have values about values which contradict my observed behaviour. If AI could observe one’s behaviour, it may conclude that this person—suppose—is lazy, sexually addicted and prefer to live in the world of Game of Thrones. However, a person could deny part of his behaviour as unethical. Thus behaviour will give wrong clues.
Human behaviour is controlled not only by person’s consciousness, but also partly by person’s unconsciousness, which is unobservable to him, but obvious for others. In other words, human’s behaviour often is a sum of several different minds inside his head. We don’t want AI to extrapolate our unconscious goals as they are often unethical and socially inappropriate. (Could support these claims by literature and examples.)
Small tests performed by AI may be dangerous or unethical. Maybe AI will try to torture a person to extract his preference about money and pain.
AI could be aligned this the same human in many different ways. All of them could be called “alignment”, but because AI is very complex, different approximations of human behaviour could satisfy to any criterion we have. Some of this approximations may better or worse.
AI alignment per se doesn’t solve AI safety, as it doesn’t explain how war between humans empowered by AIs will be prevented. I wrote an article about it: Turchin, Alexey, and David Denkenberger. Military AI as a convergent goal of the self-improving AI. In edited volume: Artificial intelligence safety and security, CRC, 2018, https://philpapers.org/rec/TURMAA-6
AI’s test of human values will change human values, and AI could change these values in any possible direction. That is, AI could design “adversarial test” which will change humans’ value system in any other value system without human noticing it. Some human also could do it. See more about it in “Ericksonian hypnosis”. It is particular nasty example of AI boxing goes wrong.
Obviously, if AI’s capabilities are underestimated, it could experience “treacherous turn” in any moment of preparing of the test of human behaviour, - I think this should be a standard MIRI reply.
To be “aligned” is not just mathematical relation, to which typical notations of transitivity—or whatever we could easily do with mathematical symbols—are applicable. Alignment is complex and vague, never certain. It is like love. We can’t describe human love by introducing notation L, and start writing equations like L(A,B)=C.
Anyway, I think that there is a chance that you scheme will work but under one condition: First AI is something like a human upload which has basic understanding of what it is to be aligned. If we have such perfect upload, which has common sense and general understanding of what humans typically want, the amplification and distillation scheme will work. The question is how we can get this first upload? I have some ideas, but this is another topic.
My scheme doesn’t have any explicit dependence on “human values,” or even involve the AI working with an explicit representation of what it values, so seems unusually robust to humans not having values (though obviously this leaves it more vulnerable to certain other problems). I agree there are various undesirable features that may get amplified.
My scheme is definitely not built on AI observing people’s behavior and then inferring their values.
I’m not convinced that it’s bad if our AI extrapolates what we actually want. I agree that my scheme changes the balance of power between those forces somewhat, compared to the no-AI status quo, but I think this procedure will tend to increase rather than decrease the influence of what we say we want, relative to business as usual. I guess you could be concerned that we don’t take the opportunity of building AI to go further in the direction of quashing our unconscious preferences. I think we can and should consider quashing unconscious preferences as a separate intervention from aligning AGI.
An agent trained with my scheme only tries to get info to the extent the oversight process rewards info-gathering. This may lead to inefficiently little info gathering if the overseer doesn’t appreciate its value. Or it may lead to other problems if the overseer is wrong in some other way about VOI. But I don’t see how it would lead to this particular failure mode. (It could if amplification breaks down, such that e.g. the amplified agent just has a representation of “this action finds relevant info” but not an understanding of what it actually does. I think that if amplification fails that badly we are screwed for other reasons.)
I don’t know if it matters what’s called alignment, we can just consider the virtues of iterated amplification.
I’m not proposing a path to world peace, I’m trying to resolve the particular risk posed by misaligned AI. Most likely it seems like there will be war between AI-empowered humans, though with luck AI will enable them to come up with better compromises and ultimately end war.
I don’t understand the relevance exactly.
Here are what I see as the plausible techniques for avoiding a treacherous turn. If none of those techniques work I agree we have a problem.
I agree that there are some desirable properties that we won’t be able to make tight arguments about because they are hard to pin down.
I have some additional thoughts after thinking more about your proposal.
What wary me is the jump from AI to AGI learning. The proposal will work on Narrow AI level, approximately as similar model worked on in case of AlphaGoZero. The proposal will also work if we have perfectly aligned AGI, something like human upload or perfectly aligned Seed AI. It is rather possible that Seed AGI can grow in it capabilities while preserving aligning.
However, the question is how you model will survive the jump from narrow non-agential AI capabilities, to agential AGI capabilities? - This could happen during the evolution of your system in some unpredicted moment, and may include modeling of outside world, all humanity and some converging basic drives, like self-preservation. So it will be classical treacherous turn or intelligent explosion or “becoming self aware moment”—and in that moment previous ways of alignment will be instantly obsolete, and will not provide any guarantee that the system will be aligned on its new level of capabilities.
I will try to list some objections appeared in my mind while reading your suggestions.
I think that the main problem is the idea that “humans have values”. No, they don’t have values. “Values” were invented by psychologists to better describe human behaviour. To calling that humans “have values” is fundamental attribution error, which in most cases very subtle, because humans behave as if they have values. But this error could become amplified by suggested IDA, because human reactions to AI’s tests will be not coherent and their extrapolation will stuck at some level. How exactly it will stuck is not easy to predict. In short: it is impossible to align two fuzzy things. I am going to write something longer about the problem of non-existence of human values.
I have values about values which contradict my observed behaviour. If AI could observe one’s behaviour, it may conclude that this person—suppose—is lazy, sexually addicted and prefer to live in the world of Game of Thrones. However, a person could deny part of his behaviour as unethical. Thus behaviour will give wrong clues.
Human behaviour is controlled not only by person’s consciousness, but also partly by person’s unconsciousness, which is unobservable to him, but obvious for others. In other words, human’s behaviour often is a sum of several different minds inside his head. We don’t want AI to extrapolate our unconscious goals as they are often unethical and socially inappropriate. (Could support these claims by literature and examples.)
Small tests performed by AI may be dangerous or unethical. Maybe AI will try to torture a person to extract his preference about money and pain.
AI could be aligned this the same human in many different ways. All of them could be called “alignment”, but because AI is very complex, different approximations of human behaviour could satisfy to any criterion we have. Some of this approximations may better or worse.
AI alignment per se doesn’t solve AI safety, as it doesn’t explain how war between humans empowered by AIs will be prevented. I wrote an article about it: Turchin, Alexey, and David Denkenberger. Military AI as a convergent goal of the self-improving AI. In edited volume: Artificial intelligence safety and security, CRC, 2018, https://philpapers.org/rec/TURMAA-6
AI’s test of human values will change human values, and AI could change these values in any possible direction. That is, AI could design “adversarial test” which will change humans’ value system in any other value system without human noticing it. Some human also could do it. See more about it in “Ericksonian hypnosis”. It is particular nasty example of AI boxing goes wrong.
Obviously, if AI’s capabilities are underestimated, it could experience “treacherous turn” in any moment of preparing of the test of human behaviour, - I think this should be a standard MIRI reply.
To be “aligned” is not just mathematical relation, to which typical notations of transitivity—or whatever we could easily do with mathematical symbols—are applicable. Alignment is complex and vague, never certain. It is like love. We can’t describe human love by introducing notation L, and start writing equations like L(A,B)=C.
Anyway, I think that there is a chance that you scheme will work but under one condition: First AI is something like a human upload which has basic understanding of what it is to be aligned. If we have such perfect upload, which has common sense and general understanding of what humans typically want, the amplification and distillation scheme will work. The question is how we can get this first upload? I have some ideas, but this is another topic.
My scheme doesn’t have any explicit dependence on “human values,” or even involve the AI working with an explicit representation of what it values, so seems unusually robust to humans not having values (though obviously this leaves it more vulnerable to certain other problems). I agree there are various undesirable features that may get amplified.
My scheme is definitely not built on AI observing people’s behavior and then inferring their values.
I’m not convinced that it’s bad if our AI extrapolates what we actually want. I agree that my scheme changes the balance of power between those forces somewhat, compared to the no-AI status quo, but I think this procedure will tend to increase rather than decrease the influence of what we say we want, relative to business as usual. I guess you could be concerned that we don’t take the opportunity of building AI to go further in the direction of quashing our unconscious preferences. I think we can and should consider quashing unconscious preferences as a separate intervention from aligning AGI.
An agent trained with my scheme only tries to get info to the extent the oversight process rewards info-gathering. This may lead to inefficiently little info gathering if the overseer doesn’t appreciate its value. Or it may lead to other problems if the overseer is wrong in some other way about VOI. But I don’t see how it would lead to this particular failure mode. (It could if amplification breaks down, such that e.g. the amplified agent just has a representation of “this action finds relevant info” but not an understanding of what it actually does. I think that if amplification fails that badly we are screwed for other reasons.)
I don’t know if it matters what’s called alignment, we can just consider the virtues of iterated amplification.
I’m not proposing a path to world peace, I’m trying to resolve the particular risk posed by misaligned AI. Most likely it seems like there will be war between AI-empowered humans, though with luck AI will enable them to come up with better compromises and ultimately end war.
I don’t understand the relevance exactly.
Here are what I see as the plausible techniques for avoiding a treacherous turn. If none of those techniques work I agree we have a problem.
I agree that there are some desirable properties that we won’t be able to make tight arguments about because they are hard to pin down.
I have some additional thoughts after thinking more about your proposal.
What wary me is the jump from AI to AGI learning. The proposal will work on Narrow AI level, approximately as similar model worked on in case of AlphaGoZero. The proposal will also work if we have perfectly aligned AGI, something like human upload or perfectly aligned Seed AI. It is rather possible that Seed AGI can grow in it capabilities while preserving aligning.
However, the question is how you model will survive the jump from narrow non-agential AI capabilities, to agential AGI capabilities? - This could happen during the evolution of your system in some unpredicted moment, and may include modeling of outside world, all humanity and some converging basic drives, like self-preservation. So it will be classical treacherous turn or intelligent explosion or “becoming self aware moment”—and in that moment previous ways of alignment will be instantly obsolete, and will not provide any guarantee that the system will be aligned on its new level of capabilities.