My scheme doesn’t have any explicit dependence on “human values,” or even involve the AI working with an explicit representation of what it values, so seems unusually robust to humans not having values (though obviously this leaves it more vulnerable to certain other problems). I agree there are various undesirable features that may get amplified.
My scheme is definitely not built on AI observing people’s behavior and then inferring their values.
I’m not convinced that it’s bad if our AI extrapolates what we actually want. I agree that my scheme changes the balance of power between those forces somewhat, compared to the no-AI status quo, but I think this procedure will tend to increase rather than decrease the influence of what we say we want, relative to business as usual. I guess you could be concerned that we don’t take the opportunity of building AI to go further in the direction of quashing our unconscious preferences. I think we can and should consider quashing unconscious preferences as a separate intervention from aligning AGI.
An agent trained with my scheme only tries to get info to the extent the oversight process rewards info-gathering. This may lead to inefficiently little info gathering if the overseer doesn’t appreciate its value. Or it may lead to other problems if the overseer is wrong in some other way about VOI. But I don’t see how it would lead to this particular failure mode. (It could if amplification breaks down, such that e.g. the amplified agent just has a representation of “this action finds relevant info” but not an understanding of what it actually does. I think that if amplification fails that badly we are screwed for other reasons.)
I don’t know if it matters what’s called alignment, we can just consider the virtues of iterated amplification.
I’m not proposing a path to world peace, I’m trying to resolve the particular risk posed by misaligned AI. Most likely it seems like there will be war between AI-empowered humans, though with luck AI will enable them to come up with better compromises and ultimately end war.
I don’t understand the relevance exactly.
Here are what I see as the plausible techniques for avoiding a treacherous turn. If none of those techniques work I agree we have a problem.
I agree that there are some desirable properties that we won’t be able to make tight arguments about because they are hard to pin down.
I have some additional thoughts after thinking more about your proposal.
What wary me is the jump from AI to AGI learning. The proposal will work on Narrow AI level, approximately as similar model worked on in case of AlphaGoZero. The proposal will also work if we have perfectly aligned AGI, something like human upload or perfectly aligned Seed AI. It is rather possible that Seed AGI can grow in it capabilities while preserving aligning.
However, the question is how you model will survive the jump from narrow non-agential AI capabilities, to agential AGI capabilities? - This could happen during the evolution of your system in some unpredicted moment, and may include modeling of outside world, all humanity and some converging basic drives, like self-preservation. So it will be classical treacherous turn or intelligent explosion or “becoming self aware moment”—and in that moment previous ways of alignment will be instantly obsolete, and will not provide any guarantee that the system will be aligned on its new level of capabilities.
My scheme doesn’t have any explicit dependence on “human values,” or even involve the AI working with an explicit representation of what it values, so seems unusually robust to humans not having values (though obviously this leaves it more vulnerable to certain other problems). I agree there are various undesirable features that may get amplified.
My scheme is definitely not built on AI observing people’s behavior and then inferring their values.
I’m not convinced that it’s bad if our AI extrapolates what we actually want. I agree that my scheme changes the balance of power between those forces somewhat, compared to the no-AI status quo, but I think this procedure will tend to increase rather than decrease the influence of what we say we want, relative to business as usual. I guess you could be concerned that we don’t take the opportunity of building AI to go further in the direction of quashing our unconscious preferences. I think we can and should consider quashing unconscious preferences as a separate intervention from aligning AGI.
An agent trained with my scheme only tries to get info to the extent the oversight process rewards info-gathering. This may lead to inefficiently little info gathering if the overseer doesn’t appreciate its value. Or it may lead to other problems if the overseer is wrong in some other way about VOI. But I don’t see how it would lead to this particular failure mode. (It could if amplification breaks down, such that e.g. the amplified agent just has a representation of “this action finds relevant info” but not an understanding of what it actually does. I think that if amplification fails that badly we are screwed for other reasons.)
I don’t know if it matters what’s called alignment, we can just consider the virtues of iterated amplification.
I’m not proposing a path to world peace, I’m trying to resolve the particular risk posed by misaligned AI. Most likely it seems like there will be war between AI-empowered humans, though with luck AI will enable them to come up with better compromises and ultimately end war.
I don’t understand the relevance exactly.
Here are what I see as the plausible techniques for avoiding a treacherous turn. If none of those techniques work I agree we have a problem.
I agree that there are some desirable properties that we won’t be able to make tight arguments about because they are hard to pin down.
I have some additional thoughts after thinking more about your proposal.
What wary me is the jump from AI to AGI learning. The proposal will work on Narrow AI level, approximately as similar model worked on in case of AlphaGoZero. The proposal will also work if we have perfectly aligned AGI, something like human upload or perfectly aligned Seed AI. It is rather possible that Seed AGI can grow in it capabilities while preserving aligning.
However, the question is how you model will survive the jump from narrow non-agential AI capabilities, to agential AGI capabilities? - This could happen during the evolution of your system in some unpredicted moment, and may include modeling of outside world, all humanity and some converging basic drives, like self-preservation. So it will be classical treacherous turn or intelligent explosion or “becoming self aware moment”—and in that moment previous ways of alignment will be instantly obsolete, and will not provide any guarantee that the system will be aligned on its new level of capabilities.