Suppose I build a Bayesian spam filter for my email. It’s highly accurate at filtering spam from non-spam. It’s efficient and easy to run. It’s based on rules that I can understand and modify if I desire. It provably doesn’t filter based on properties I don’t want it to filter on.
Is the spam filter aligned? There’s a valid sense in which the answer is “yes”:
The filter is good at the things I want a spam filter for. It’s safe for me to use, except in the case of user error. It follows Kant’s golden rule—it doesn’t cause problems in society if it’s widely used. It’s not trying to deceive me.
When people say present-day LLMs are aligned, they typically mean this sort of stuff. The LLM is good qua chatbot. It doesn’t say naughty words or tell you how to build a bomb. When you ask it to write a poem or whatever, it will do a good enough job. It’s not actively trying to harm you.
I don’t want to downplay how impressive an accomplishment this is. At the same time, there are still further accomplishments needed to build a system such that humans are confident that it’s acting in their best interests. You don’t get there just by adding more compute.
Just like how a present-day LLM is aligned in ways it doesn’t even make sense to ask a Bayesian spam filter to be aligned (i.e. has to reflect human values in a richer way, across a wider variety of contexts), future AI will have to be aligned in ways it doesn’t even make sense to ask LLama 70B to be aligned (richer understanding and broader context still, combined with improvements to transparency and trustworthiness).
It is a fair point that we should distinguish alignment in the sense that it does what we want it and expect it to do, from having a deep understanding of human values and a good idea of how to properly optimize for that.
However most humans probably don’t have a deep understanding of human values, but I see it as a positive outcome if a random human was picked and given god level abilities. Same thing goes for ChatGPT, if you ask it what it would do as a god it says it would prevent war, prevent climate issues, decrease poverty, give universal access to education etc.
So if we get an AI that does all of those things without a deeper understanding of human values, that is fine by me. So maybe we never even have to solve alignment in latter meaning of the word to create a utopia?
However most humans probably don’t have a deep understanding of human values, but I see it as a positive outcome if a random human was picked and given god level abilities.
Every autocracy in the world has done the experiment of giving a typical human massive amounts of power over other humans: it almost invariably turns out extremely badly for everyone else. For an aligned AI, we don’t just need something as well aligned and morally good as a typical human, we need something morally vary better, comparable to an saint or an angel. That means building something that has never previously existed.
Humans are evolved intelligences. While they can and will cooperate on non-sero-sum games, present them with a non-iterated zero-sum situation and they will (almost always) look out for themselves and their close relatives, just as evolution would predict. We’re building a non-evolved intelligence, so the orthogonality thesis applies, and what we want is something that will look out for us, not itself, in a zero-sum situation. Training (in some sense, distilling) a human-like intelligence off vast amounts of human-produced data isn’t going to do this by default.
Deeper also means going from outputting the words “Prevent war” in many appropriate linguistic contexts to preventing war in the actual real world.[1]
If getting good real-world performance means extending present-day AI with new ways of learning (and planning too, but learning is the big one unless we go all the way to model-based RL), then whether current LLMs output “Prevent war” in response to “What would you do?” is only slightly more relevant then whether my spam filter successfully filters out scams.
Without, of course, killing all humans to prevent war. prevent climate issues, decrease poverty, and make sure all living humans have access to education.
Very different in architecture, capabilities, and appearance to an outside observer, certainly. I don’t know what you consider “fundamental.”
The atoms inside the H-100s running gpt4 don’t have little tags on them saying whether it’s “really” trying to prevent war. The difference is something that’s computed by humans as we look at the world. Because it’s sometimes useful for us to apply the intentional stance to gpt4, it’s fine to say that it’s trying to prevent war. But the caveats that comes with are still very large.
Re. the current alignment of LLMs.
Suppose I build a Bayesian spam filter for my email. It’s highly accurate at filtering spam from non-spam. It’s efficient and easy to run. It’s based on rules that I can understand and modify if I desire. It provably doesn’t filter based on properties I don’t want it to filter on.
Is the spam filter aligned? There’s a valid sense in which the answer is “yes”:
The filter is good at the things I want a spam filter for. It’s safe for me to use, except in the case of user error. It follows Kant’s golden rule—it doesn’t cause problems in society if it’s widely used. It’s not trying to deceive me.
When people say present-day LLMs are aligned, they typically mean this sort of stuff. The LLM is good qua chatbot. It doesn’t say naughty words or tell you how to build a bomb. When you ask it to write a poem or whatever, it will do a good enough job. It’s not actively trying to harm you.
I don’t want to downplay how impressive an accomplishment this is. At the same time, there are still further accomplishments needed to build a system such that humans are confident that it’s acting in their best interests. You don’t get there just by adding more compute.
Just like how a present-day LLM is aligned in ways it doesn’t even make sense to ask a Bayesian spam filter to be aligned (i.e. has to reflect human values in a richer way, across a wider variety of contexts), future AI will have to be aligned in ways it doesn’t even make sense to ask LLama 70B to be aligned (richer understanding and broader context still, combined with improvements to transparency and trustworthiness).
It is a fair point that we should distinguish alignment in the sense that it does what we want it and expect it to do, from having a deep understanding of human values and a good idea of how to properly optimize for that.
However most humans probably don’t have a deep understanding of human values, but I see it as a positive outcome if a random human was picked and given god level abilities. Same thing goes for ChatGPT, if you ask it what it would do as a god it says it would prevent war, prevent climate issues, decrease poverty, give universal access to education etc.
So if we get an AI that does all of those things without a deeper understanding of human values, that is fine by me. So maybe we never even have to solve alignment in latter meaning of the word to create a utopia?
Every autocracy in the world has done the experiment of giving a typical human massive amounts of power over other humans: it almost invariably turns out extremely badly for everyone else. For an aligned AI, we don’t just need something as well aligned and morally good as a typical human, we need something morally vary better, comparable to an saint or an angel. That means building something that has never previously existed.
Humans are evolved intelligences. While they can and will cooperate on non-sero-sum games, present them with a non-iterated zero-sum situation and they will (almost always) look out for themselves and their close relatives, just as evolution would predict. We’re building a non-evolved intelligence, so the orthogonality thesis applies, and what we want is something that will look out for us, not itself, in a zero-sum situation. Training (in some sense, distilling) a human-like intelligence off vast amounts of human-produced data isn’t going to do this by default.
Deeper also means going from outputting the words “Prevent war” in many appropriate linguistic contexts to preventing war in the actual real world.[1]
If getting good real-world performance means extending present-day AI with new ways of learning (and planning too, but learning is the big one unless we go all the way to model-based RL), then whether current LLMs output “Prevent war” in response to “What would you do?” is only slightly more relevant then whether my spam filter successfully filters out scams.
Without, of course, killing all humans to prevent war. prevent climate issues, decrease poverty, and make sure all living humans have access to education.
Thank you for the explanation.
Would you consider a human working to prevent war fundamentally different from a gpt4 based agent working to prevent war?
Very different in architecture, capabilities, and appearance to an outside observer, certainly. I don’t know what you consider “fundamental.”
The atoms inside the H-100s running gpt4 don’t have little tags on them saying whether it’s “really” trying to prevent war. The difference is something that’s computed by humans as we look at the world. Because it’s sometimes useful for us to apply the intentional stance to gpt4, it’s fine to say that it’s trying to prevent war. But the caveats that comes with are still very large.