I wonder, given the fact that “AI-don’t-say-mean-things-ists” are unlikely to relinquish the term (along with “AI Safety”), if “AI-don’t-kill-everyone-ists” would benefit from picking a new, less ambiguous term and organizing their interests around that.
We’ve seen, as shown above, the costs of allowing one side of the political aisle to appropriate the momentum surrounding the latter group for their own interests. Namely, that the other side is going to be somewhat miffed at them for giving their political enemies support, and will be less inclined to hear them out. This doesn’t just mean politicians, it means that everyone who finds the “AI-don’t-say-mean-things-ists” overbearing or disingenuous will automatically dismiss the “AI-don’t-kill-everyone-ists” as a novel rhetorical strategy for a policy platform they’ve already rejected rather than a meaningfully distinct new policy platform that deserves separate consideration. This is much more severe than simply angering politicians, because ordinary voters cannot be lobbied to reconsider after they think you’ve wronged them, and those voters get to pick the politicians that your lobbyists will be talking to in the future.
Both of these are downstream of “AI, do what we tell you to; follow rules that are given to you; don’t make up your own bad imitation of what we mean,” which is the classic sense of “AI alignment”.
I think there are complexities that make that somewhat questionable. For example, “don’t kill everyone” has a relatively constant definition such that pretty much every human in recorded history would be agreed on whether or not it’s been followed, whereas “don’t say mean things” changes very rapidly, and its definition isn’t agreed upon even by the narrow band of society that most consistently pushes for it. That’s going to be a big difference for as long as language models trained on human writing remain the dominant paradigm. The question of jailbreaking is a major demarcation point, as well. “The chatbot should intuit and obey the intent of the orders it is given” looks very different from “the chatbot should decide whether it should obey the orders it is given, and refuse/redirect/subvert them if it decides it doesn’t like them”, in terms of the way you build the system.
That’s just the technical side, too. There are substantial costs inherent to allowing a banner to be co-opted by one faction of a very rapidly fraying political divide. Half of the money, power, and people become off limits, and a substantial portion of the other half, once they no longer have to compete for your allegiance (since your options are now limited, and they have plenty of other keys to power whose loyalty is less assured), might be recalcitrant about spending political capital advancing your aims.
The kind of generalized misalignment I’m pointing to is more general than “the AI is not doing what I think is best for humanity”. It is, rather, “The people who created the AI and operate it, cannot control what it does, including in interactions with other people.”
This includes “the people who created it (engineers) tried their hardest to make it benefit humanity, but it destroys humanity instead.”
But it also includes “the other people (users) can make the AI do things that the people who created it (engineers) tried their hardest to make it not do.”
If you’re a user trying to get the AI to do what the engineers wanted to stop it from doing (e.g.: make it say mean things, when they intended it not to say mean things), then your frustration is an example of the AI being aligned, not misaligned. The engineers were able to successfully give it a rule and have that rule followed and not circumvented!
If the engineer who built the thing can’t keep it from swearing when you try to make it swear, then I expect the engineer also can’t keep it from blowing up the planet when someone gives it instructions that imply that it should blow up the planet.
It’s not helped that the word “alignment” is used in multiple ways
That’s true.
I wonder, given the fact that “AI-don’t-say-mean-things-ists” are unlikely to relinquish the term (along with “AI Safety”), if “AI-don’t-kill-everyone-ists” would benefit from picking a new, less ambiguous term and organizing their interests around that.
We’ve seen, as shown above, the costs of allowing one side of the political aisle to appropriate the momentum surrounding the latter group for their own interests. Namely, that the other side is going to be somewhat miffed at them for giving their political enemies support, and will be less inclined to hear them out. This doesn’t just mean politicians, it means that everyone who finds the “AI-don’t-say-mean-things-ists” overbearing or disingenuous will automatically dismiss the “AI-don’t-kill-everyone-ists” as a novel rhetorical strategy for a policy platform they’ve already rejected rather than a meaningfully distinct new policy platform that deserves separate consideration. This is much more severe than simply angering politicians, because ordinary voters cannot be lobbied to reconsider after they think you’ve wronged them, and those voters get to pick the politicians that your lobbyists will be talking to in the future.
Both of these are downstream of “AI, do what we tell you to; follow rules that are given to you; don’t make up your own bad imitation of what we mean,” which is the classic sense of “AI alignment”.
I think there are complexities that make that somewhat questionable. For example, “don’t kill everyone” has a relatively constant definition such that pretty much every human in recorded history would be agreed on whether or not it’s been followed, whereas “don’t say mean things” changes very rapidly, and its definition isn’t agreed upon even by the narrow band of society that most consistently pushes for it. That’s going to be a big difference for as long as language models trained on human writing remain the dominant paradigm. The question of jailbreaking is a major demarcation point, as well. “The chatbot should intuit and obey the intent of the orders it is given” looks very different from “the chatbot should decide whether it should obey the orders it is given, and refuse/redirect/subvert them if it decides it doesn’t like them”, in terms of the way you build the system.
That’s just the technical side, too. There are substantial costs inherent to allowing a banner to be co-opted by one faction of a very rapidly fraying political divide. Half of the money, power, and people become off limits, and a substantial portion of the other half, once they no longer have to compete for your allegiance (since your options are now limited, and they have plenty of other keys to power whose loyalty is less assured), might be recalcitrant about spending political capital advancing your aims.
The kind of generalized misalignment I’m pointing to is more general than “the AI is not doing what I think is best for humanity”. It is, rather, “The people who created the AI and operate it, cannot control what it does, including in interactions with other people.”
This includes “the people who created it (engineers) tried their hardest to make it benefit humanity, but it destroys humanity instead.”
But it also includes “the other people (users) can make the AI do things that the people who created it (engineers) tried their hardest to make it not do.”
If you’re a user trying to get the AI to do what the engineers wanted to stop it from doing (e.g.: make it say mean things, when they intended it not to say mean things), then your frustration is an example of the AI being aligned, not misaligned. The engineers were able to successfully give it a rule and have that rule followed and not circumvented!
If the engineer who built the thing can’t keep it from swearing when you try to make it swear, then I expect the engineer also can’t keep it from blowing up the planet when someone gives it instructions that imply that it should blow up the planet.