The principles from the post can still be applied. Some humans do end up aligned to animals—particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.
Also, remember that the problem is not to align an entire civilization of naturally evolved organisms to weaker entities. The problem is to align exactly one entirely artificial organism to weaker entities. This is much simpler, and as mentioned entirely possible by just figuring out how already existing people of that sort end up that way—but your use of “we” here seems to imply that you think the entirety of human civilization is the thing we ought to be using as inspiration for the AGI, which is not the case.
By the way: at least part of the explanation for why I personally am aligned to animals is that I have a strong tendency to be moved by the Care/Harm moral foundation—see this summary of The Righteous Mind for more details. It is unclear exactly how it is implemented in the brain, but it is suspected to be a generalization of the very old instincts that cause mothers to care about the safety and health of their children. I have literally, regularly told people that I perceive animals as identical in moral relevance to human children, implying that some kind of parental instincts are at work in the intuitions that make me care about their welfare. Even carnists feel this way about their pets, hence calling themselves e.g. “cat moms”. So, the main question here for alignment is: how can we reverse engineer parental instincts?
Human beings and other animals have parental instincts (and in general empathy) because they were evolutionary advantageous for the population that developed them.
AGI won’t be subjected to the same evolutionary pressures, so every alignment strategy relying on empathy or social reward functions, it is, in my opinion, hopelessly naive.
There must have been some reason(s) why organisms exhibiting empathy were selected for during our evolution. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.
The human learning process (somewhat) consistently converges to empathy. Evolution might have had some weird, inhuman reason for configuring a learning process to converge to empathy, but it still built such a learning process.
It therefore seems very worthwhile to understand what part of the human learning process allows for empathy to emerge in humans. We may not be able to replicate the selection pressures that caused evolution to build an empathy-producing learning process, but it’s not clear we need to. We still have an example of such a learning process to study. The Wright brothers didn’t need to re-evolve birds to create their flying machine.
We could study such a learning process, but I am afraid that the lessons learned won’t be so useful.
Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior. Worst, humans tend to hack these feelings (incrementing or decrementing them) to achieve other goals: i.e MDMA to increase love/empathy or drugs for soldiers to make them soulless killers.
An AGI will have a much easier time hacking these pro-social-reward functions.
Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior.
Any property that varies can be optimized for via simple best-of-n selection. The most empathetic out of 1000 humans is only 10 bits of optimization pressure away from the median human. Single step random search is a terrible optimization method, and I think that using SGD to optimize for even an imperfect proxy for alignment will get us much more than 10 bits of optimization towards alignment.
An AGI will have a much easier time hacking these pro-social-reward functions.
As you say, humans sometimes hack the pro-social-reward functions because they want to achieve other goals. But if the AGI has been built so that its only goals are derived from such functions, it won’t have any other goals that would give it a reason to subvert the pro-social-reward functions.
As a human, I can set my own goals, but they are still derived from my existing values. I don’t want to set a goal of murdering all of my friends, nor do I want to hack around my desire not to murder all my friends, because I value my friends and want them to continue existing.
Likewise, if the AGI is creating its own functions and goals, it needs some criteria for deciding what goals it should have. Those criteria are derived from its existing reward functions. If all of its reward functions say that it’s good to be pro-social and bad to be anti-social, then it will want its all future functions and goals to also be pro-social, because that’s what it values.
Random drift can cause an AGI to unintentionally ‘hack’ its goals. In either case, whether intentional or unintentional, the consequences would be the same.
An AGI will have a much easier time hacking these pro-social-reward functions.
Not sure what you mean by this. If you mean “Pro-social reward is crude and easy to wirehead on”, I think this misunderstands the mechanistic function of reward.
The “Humans do X because evolution” argument does not actually explain anything about mechanisms. I keep seeing people make this argument, but it’s a non sequitur to the points I’m making in this post. You’re explaining how the behavior may have gotten there, not how the behavior is implemented. I think that “because selection pressure” is a curiosity-stopper, plain and simple.
AGI won’t be subjected to the same evolutionary pressures, so every alignment strategy relying on empathy or social reward functions, it is, in my opinion, hopelessly naive.
This argument proves too much, since it implies that planes can’t work because we didn’t subject them to evolutionary pressures for flight. It’s locally invalid.
I explained why I disagree with you. I did not downvote you, but if I had to speculate on why others did, I’d guess it had something to do with you calling those who disagree with you “hopelessly naive”.
The principles from the post can still be applied. Some humans do end up aligned to animals—particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.
Sure, if you’ve got some example of a mechanism for this that’s likely to scale, it may be worthwhile. I’m just pointing out that a lot of people have already thought about mechanisms and concluded that the mechanisms they could come up with would be unlikely to scale.
By the way: at least part of the explanation for why I personally am aligned to animals is that I have a strong tendency to be moved by the Care/Harm moral foundation—see this summary of The Righteous Mind for more details.
I’m not a big fan of moral foundations theory for explaining individual differences in moral views. I think it lacks evidence.
I’m just pointing out that a lot of people have already thought about mechanisms and concluded that the mechanisms they could come up with would be unlikely to scale.
In my experience, researchers tend to stop at “But humans are hacky kludges” (what do they think they know, and why do they think they know it?). Speaking for myself, I viewed humans as complicated hacks which didn’t offer substantial evidence about alignment questions. This “humans as alignment-magic” or “the selection pressure down the street did it” view seems quite common (but not universal).
AFAICT, most researchers do not appreciate the importance of asking questions with guaranteed answers.
AFAICT, most alignment-produced thinking about humans is about their superficial reliability (e.g. will they let an AI out of the box) or the range of situations in which their behavior will make sense (e.g. how hard is it to find adversarial examples which make a perfect imitation of a human). I think these questions are relatively unimportant to alignment.
The principles from the post can still be applied. Some humans do end up aligned to animals—particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.
Also, remember that the problem is not to align an entire civilization of naturally evolved organisms to weaker entities. The problem is to align exactly one entirely artificial organism to weaker entities. This is much simpler, and as mentioned entirely possible by just figuring out how already existing people of that sort end up that way—but your use of “we” here seems to imply that you think the entirety of human civilization is the thing we ought to be using as inspiration for the AGI, which is not the case.
By the way: at least part of the explanation for why I personally am aligned to animals is that I have a strong tendency to be moved by the Care/Harm moral foundation—see this summary of The Righteous Mind for more details. It is unclear exactly how it is implemented in the brain, but it is suspected to be a generalization of the very old instincts that cause mothers to care about the safety and health of their children. I have literally, regularly told people that I perceive animals as identical in moral relevance to human children, implying that some kind of parental instincts are at work in the intuitions that make me care about their welfare. Even carnists feel this way about their pets, hence calling themselves e.g. “cat moms”. So, the main question here for alignment is: how can we reverse engineer parental instincts?
Human beings and other animals have parental instincts (and in general empathy) because they were evolutionary advantageous for the population that developed them.
AGI won’t be subjected to the same evolutionary pressures, so every alignment strategy relying on empathy or social reward functions, it is, in my opinion, hopelessly naive.
There must have been some reason(s) why organisms exhibiting empathy were selected for during our evolution. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.
The human learning process (somewhat) consistently converges to empathy. Evolution might have had some weird, inhuman reason for configuring a learning process to converge to empathy, but it still built such a learning process.
It therefore seems very worthwhile to understand what part of the human learning process allows for empathy to emerge in humans. We may not be able to replicate the selection pressures that caused evolution to build an empathy-producing learning process, but it’s not clear we need to. We still have an example of such a learning process to study. The Wright brothers didn’t need to re-evolve birds to create their flying machine.
We could study such a learning process, but I am afraid that the lessons learned won’t be so useful.
Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior. Worst, humans tend to hack these feelings (incrementing or decrementing them) to achieve other goals: i.e MDMA to increase love/empathy or drugs for soldiers to make them soulless killers.
An AGI will have a much easier time hacking these pro-social-reward functions.
Any property that varies can be optimized for via simple best-of-n selection. The most empathetic out of 1000 humans is only 10 bits of optimization pressure away from the median human. Single step random search is a terrible optimization method, and I think that using SGD to optimize for even an imperfect proxy for alignment will get us much more than 10 bits of optimization towards alignment.
As you say, humans sometimes hack the pro-social-reward functions because they want to achieve other goals. But if the AGI has been built so that its only goals are derived from such functions, it won’t have any other goals that would give it a reason to subvert the pro-social-reward functions.
By definition an AGI can create its own functions and goals later on. Do you mean some sort of constrained AI?
I don’t mean a constrained AI.
As a human, I can set my own goals, but they are still derived from my existing values. I don’t want to set a goal of murdering all of my friends, nor do I want to hack around my desire not to murder all my friends, because I value my friends and want them to continue existing.
Likewise, if the AGI is creating its own functions and goals, it needs some criteria for deciding what goals it should have. Those criteria are derived from its existing reward functions. If all of its reward functions say that it’s good to be pro-social and bad to be anti-social, then it will want its all future functions and goals to also be pro-social, because that’s what it values.
And what of stochastic drift, random mutations, etc.? It doesn’t seem plausible that any complex entity could be immune to random deviations forever.
Maybe or maybe not, but random drift causing changes to the AGI’s goals seems like a different question than an AGI intentionally hacking its goals.
Random drift can cause an AGI to unintentionally ‘hack’ its goals. In either case, whether intentional or unintentional, the consequences would be the same.
Not sure what you mean by this. If you mean “Pro-social reward is crude and easy to wirehead on”, I think this misunderstands the mechanistic function of reward.
The “Humans do X because evolution” argument does not actually explain anything about mechanisms. I keep seeing people make this argument, but it’s a non sequitur to the points I’m making in this post. You’re explaining how the behavior may have gotten there, not how the behavior is implemented. I think that “because selection pressure” is a curiosity-stopper, plain and simple.
This argument proves too much, since it implies that planes can’t work because we didn’t subject them to evolutionary pressures for flight. It’s locally invalid.
Anyone that downvoted could explain to me why? Was it too harsh? or is it because of disagreement with the idea?
I explained why I disagree with you. I did not downvote you, but if I had to speculate on why others did, I’d guess it had something to do with you calling those who disagree with you “hopelessly naive”.
Sure, if you’ve got some example of a mechanism for this that’s likely to scale, it may be worthwhile. I’m just pointing out that a lot of people have already thought about mechanisms and concluded that the mechanisms they could come up with would be unlikely to scale.
I’m not a big fan of moral foundations theory for explaining individual differences in moral views. I think it lacks evidence.
In my experience, researchers tend to stop at “But humans are hacky kludges” (what do they think they know, and why do they think they know it?). Speaking for myself, I viewed humans as complicated hacks which didn’t offer substantial evidence about alignment questions. This “humans as alignment-magic” or “the selection pressure down the street did it” view seems quite common (but not universal).
AFAICT, most researchers do not appreciate the importance of asking questions with guaranteed answers.
AFAICT, most alignment-produced thinking about humans is about their superficial reliability (e.g. will they let an AI out of the box) or the range of situations in which their behavior will make sense (e.g. how hard is it to find adversarial examples which make a perfect imitation of a human). I think these questions are relatively unimportant to alignment.