What loss function(s), when sent into a future AI’s brain-like configuration of neocortex / hippocampus / striatum / etc.-like learning algorithms, will result in an AGI that is definitely not trying to literally exterminate humanity?
Specifying a correct loss functions is not the right way to think about the Alignment Problem. A system’s architecture matters much more than its loss function for determining whether or not it is dangerous. In fact, there probably isn’t even a well-defined loss function that would remain aligned under infinite optimization pressure.
I enthusiastically endorse keeping in mind the possibility that the correct answer to the question you excerpted is “Haha trick question, there is no such loss function.”
I enthusiastically endorse having an open mind to any good ideas that we can think of to steer our future AGIs in a good direction, including things unrelated to loss functions, and including that are radically different from anything in the human brain. For example in this post I talk about lots of things that are not “choosing the right loss function”.
Specifying a correct loss functions is not the right way to think about the Alignment Problem.
As for your link, I disagree that “specifying the right loss function” is equivalent to “writing down the correct utility function”. I’m not sure it makes sense to say that humans have a utility function at all, and if they do, it would be full of learned abstract concepts like “my children will have rich fulfilling lives”. But we definitely have loss functions in our brain, and they have to be specified by genetically-hardcoded circuitry that (I claim) cannot straightforwardly refer to complicated learned abstract concepts like that.
A system’s architecture matters much more than its loss function for determining whether or not it is dangerous.
I’m not quite sure what you mean here.
If “architecture” means 96 transformer layers versus 112 transformer layers, then I don’t care at all. I claim that the loss function is much more important than that for whether the system is dangerous.
In fact, there probably isn’t even a well-defined loss function that would remain aligned under infinite optimization pressure.
I think I would say “maybe” where you say “probably”. I think it’s an important open question. I would be very interested to know one way or the other.
I think humans are an interesting case study. Almost all humans do not want to literally exterminate humanity. If a human were much “smarter”, but had the same life experience and social instincts, would they reliably develop a motivation to exterminate humanity? I’m skeptical. But mainly I don’t know. I talk about it a bit in §12.4.4 here. Different people seem to have different intuitions on this topic.
See also §9.5 here for my argument against the proposition that brain-like AGIs will make decisions to maximize future rewards.
In the standard picture of a reinforcement learner, suppose you get to specify the reward function and i get to specify the “agent”. No matter what reward function you choose, I claim I can make an agent that both: 1) gets a huge reward compared to some baseline implementation 2) destroys the world. In fact, I think most “superintelligent” systems have this property for any reward function you could specify using current ML techniques.
Now switch the order, I design the agent first and ask you for an arbitrary reward function. I claim that there exist architectures which are: 1) useful, given the correct reward function 2) never, under any circumstances destroy the world.
I think it’s impossible to try to reason about what an RL agent would do solely on the basis of knowing its reward function, without knowing anything else about how the RL agent works, e.g. whether it’s model-based vs model-free, etc.
But that point doesn’t seem to be too relevant in this context. After all, I specified “neocortex / hippocampus / striatum / etc.-like learning algorithms”. My previous reply linked an extensive discussion of what I think that actually means. So I’m not sure how we wound up on this point. Oh well.
In your second paragraph:
If I interpret “useful” in the normal sense (“not completely useless”), then your claim seems true and trivial. Just make it a really weak agent (but not so weak that it’s 100% useless).
If I interpret “useful” to mean “sufficiently powerful as to reach AGI”, then you would seem to be claiming a complete solution to AGI safety, and I would reply that I’m skeptical, and interested to see details.
Specifying a correct loss functions is not the right way to think about the Alignment Problem. A system’s architecture matters much more than its loss function for determining whether or not it is dangerous. In fact, there probably isn’t even a well-defined loss function that would remain aligned under infinite optimization pressure.
Where we probably agree:
I enthusiastically endorse keeping in mind the possibility that the correct answer to the question you excerpted is “Haha trick question, there is no such loss function.”
I enthusiastically endorse having an open mind to any good ideas that we can think of to steer our future AGIs in a good direction, including things unrelated to loss functions, and including that are radically different from anything in the human brain. For example in this post I talk about lots of things that are not “choosing the right loss function”.
As for your link, I disagree that “specifying the right loss function” is equivalent to “writing down the correct utility function”. I’m not sure it makes sense to say that humans have a utility function at all, and if they do, it would be full of learned abstract concepts like “my children will have rich fulfilling lives”. But we definitely have loss functions in our brain, and they have to be specified by genetically-hardcoded circuitry that (I claim) cannot straightforwardly refer to complicated learned abstract concepts like that.
I’m not quite sure what you mean here.
If “architecture” means 96 transformer layers versus 112 transformer layers, then I don’t care at all. I claim that the loss function is much more important than that for whether the system is dangerous.
Or if “architecture” means “There’s a world-model updated by self-supervised learning, and then there’s actor-critic reinforcement learning, blah blah blah”, then yes this is very important, but it’s not unrelated to loss functions—the world-model’s loss function would be sensory prediction error, the critic’s loss function would be reward prediction error, etc. Right?
I think I would say “maybe” where you say “probably”. I think it’s an important open question. I would be very interested to know one way or the other.
I think humans are an interesting case study. Almost all humans do not want to literally exterminate humanity. If a human were much “smarter”, but had the same life experience and social instincts, would they reliably develop a motivation to exterminate humanity? I’m skeptical. But mainly I don’t know. I talk about it a bit in §12.4.4 here. Different people seem to have different intuitions on this topic.
See also §9.5 here for my argument against the proposition that brain-like AGIs will make decisions to maximize future rewards.
In the standard picture of a reinforcement learner, suppose you get to specify the reward function and i get to specify the “agent”. No matter what reward function you choose, I claim I can make an agent that both: 1) gets a huge reward compared to some baseline implementation 2) destroys the world. In fact, I think most “superintelligent” systems have this property for any reward function you could specify using current ML techniques.
Now switch the order, I design the agent first and ask you for an arbitrary reward function. I claim that there exist architectures which are: 1) useful, given the correct reward function 2) never, under any circumstances destroy the world.
I think it’s impossible to try to reason about what an RL agent would do solely on the basis of knowing its reward function, without knowing anything else about how the RL agent works, e.g. whether it’s model-based vs model-free, etc.
(RL is a problem statement, not an algorithm. Not only that, but RL is “(almost) the most general problem statement possible”!)
I think we’re in agreement on that point.
But that point doesn’t seem to be too relevant in this context. After all, I specified “neocortex / hippocampus / striatum / etc.-like learning algorithms”. My previous reply linked an extensive discussion of what I think that actually means. So I’m not sure how we wound up on this point. Oh well.
In your second paragraph:
If I interpret “useful” in the normal sense (“not completely useless”), then your claim seems true and trivial. Just make it a really weak agent (but not so weak that it’s 100% useless).
If I interpret “useful” to mean “sufficiently powerful as to reach AGI”, then you would seem to be claiming a complete solution to AGI safety, and I would reply that I’m skeptical, and interested to see details.