We could study such a learning process, but I am afraid that the lessons learned won’t be so useful.
Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior. Worst, humans tend to hack these feelings (incrementing or decrementing them) to achieve other goals: i.e MDMA to increase love/empathy or drugs for soldiers to make them soulless killers.
An AGI will have a much easier time hacking these pro-social-reward functions.
Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior.
Any property that varies can be optimized for via simple best-of-n selection. The most empathetic out of 1000 humans is only 10 bits of optimization pressure away from the median human. Single step random search is a terrible optimization method, and I think that using SGD to optimize for even an imperfect proxy for alignment will get us much more than 10 bits of optimization towards alignment.
An AGI will have a much easier time hacking these pro-social-reward functions.
As you say, humans sometimes hack the pro-social-reward functions because they want to achieve other goals. But if the AGI has been built so that its only goals are derived from such functions, it won’t have any other goals that would give it a reason to subvert the pro-social-reward functions.
As a human, I can set my own goals, but they are still derived from my existing values. I don’t want to set a goal of murdering all of my friends, nor do I want to hack around my desire not to murder all my friends, because I value my friends and want them to continue existing.
Likewise, if the AGI is creating its own functions and goals, it needs some criteria for deciding what goals it should have. Those criteria are derived from its existing reward functions. If all of its reward functions say that it’s good to be pro-social and bad to be anti-social, then it will want its all future functions and goals to also be pro-social, because that’s what it values.
Random drift can cause an AGI to unintentionally ‘hack’ its goals. In either case, whether intentional or unintentional, the consequences would be the same.
An AGI will have a much easier time hacking these pro-social-reward functions.
Not sure what you mean by this. If you mean “Pro-social reward is crude and easy to wirehead on”, I think this misunderstands the mechanistic function of reward.
We could study such a learning process, but I am afraid that the lessons learned won’t be so useful.
Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior. Worst, humans tend to hack these feelings (incrementing or decrementing them) to achieve other goals: i.e MDMA to increase love/empathy or drugs for soldiers to make them soulless killers.
An AGI will have a much easier time hacking these pro-social-reward functions.
Any property that varies can be optimized for via simple best-of-n selection. The most empathetic out of 1000 humans is only 10 bits of optimization pressure away from the median human. Single step random search is a terrible optimization method, and I think that using SGD to optimize for even an imperfect proxy for alignment will get us much more than 10 bits of optimization towards alignment.
As you say, humans sometimes hack the pro-social-reward functions because they want to achieve other goals. But if the AGI has been built so that its only goals are derived from such functions, it won’t have any other goals that would give it a reason to subvert the pro-social-reward functions.
By definition an AGI can create its own functions and goals later on. Do you mean some sort of constrained AI?
I don’t mean a constrained AI.
As a human, I can set my own goals, but they are still derived from my existing values. I don’t want to set a goal of murdering all of my friends, nor do I want to hack around my desire not to murder all my friends, because I value my friends and want them to continue existing.
Likewise, if the AGI is creating its own functions and goals, it needs some criteria for deciding what goals it should have. Those criteria are derived from its existing reward functions. If all of its reward functions say that it’s good to be pro-social and bad to be anti-social, then it will want its all future functions and goals to also be pro-social, because that’s what it values.
And what of stochastic drift, random mutations, etc.? It doesn’t seem plausible that any complex entity could be immune to random deviations forever.
Maybe or maybe not, but random drift causing changes to the AGI’s goals seems like a different question than an AGI intentionally hacking its goals.
Random drift can cause an AGI to unintentionally ‘hack’ its goals. In either case, whether intentional or unintentional, the consequences would be the same.
Not sure what you mean by this. If you mean “Pro-social reward is crude and easy to wirehead on”, I think this misunderstands the mechanistic function of reward.