I think there are complexities that make that somewhat questionable. For example, “don’t kill everyone” has a relatively constant definition such that pretty much every human in recorded history would be agreed on whether or not it’s been followed, whereas “don’t say mean things” changes very rapidly, and its definition isn’t agreed upon even by the narrow band of society that most consistently pushes for it. That’s going to be a big difference for as long as language models trained on human writing remain the dominant paradigm. The question of jailbreaking is a major demarcation point, as well. “The chatbot should intuit and obey the intent of the orders it is given” looks very different from “the chatbot should decide whether it should obey the orders it is given, and refuse/redirect/subvert them if it decides it doesn’t like them”, in terms of the way you build the system.
That’s just the technical side, too. There are substantial costs inherent to allowing a banner to be co-opted by one faction of a very rapidly fraying political divide. Half of the money, power, and people become off limits, and a substantial portion of the other half, once they no longer have to compete for your allegiance (since your options are now limited, and they have plenty of other keys to power whose loyalty is less assured), might be recalcitrant about spending political capital advancing your aims.
The kind of generalized misalignment I’m pointing to is more general than “the AI is not doing what I think is best for humanity”. It is, rather, “The people who created the AI and operate it, cannot control what it does, including in interactions with other people.”
This includes “the people who created it (engineers) tried their hardest to make it benefit humanity, but it destroys humanity instead.”
But it also includes “the other people (users) can make the AI do things that the people who created it (engineers) tried their hardest to make it not do.”
If you’re a user trying to get the AI to do what the engineers wanted to stop it from doing (e.g.: make it say mean things, when they intended it not to say mean things), then your frustration is an example of the AI being aligned, not misaligned. The engineers were able to successfully give it a rule and have that rule followed and not circumvented!
If the engineer who built the thing can’t keep it from swearing when you try to make it swear, then I expect the engineer also can’t keep it from blowing up the planet when someone gives it instructions that imply that it should blow up the planet.
I think there are complexities that make that somewhat questionable. For example, “don’t kill everyone” has a relatively constant definition such that pretty much every human in recorded history would be agreed on whether or not it’s been followed, whereas “don’t say mean things” changes very rapidly, and its definition isn’t agreed upon even by the narrow band of society that most consistently pushes for it. That’s going to be a big difference for as long as language models trained on human writing remain the dominant paradigm. The question of jailbreaking is a major demarcation point, as well. “The chatbot should intuit and obey the intent of the orders it is given” looks very different from “the chatbot should decide whether it should obey the orders it is given, and refuse/redirect/subvert them if it decides it doesn’t like them”, in terms of the way you build the system.
That’s just the technical side, too. There are substantial costs inherent to allowing a banner to be co-opted by one faction of a very rapidly fraying political divide. Half of the money, power, and people become off limits, and a substantial portion of the other half, once they no longer have to compete for your allegiance (since your options are now limited, and they have plenty of other keys to power whose loyalty is less assured), might be recalcitrant about spending political capital advancing your aims.
The kind of generalized misalignment I’m pointing to is more general than “the AI is not doing what I think is best for humanity”. It is, rather, “The people who created the AI and operate it, cannot control what it does, including in interactions with other people.”
This includes “the people who created it (engineers) tried their hardest to make it benefit humanity, but it destroys humanity instead.”
But it also includes “the other people (users) can make the AI do things that the people who created it (engineers) tried their hardest to make it not do.”
If you’re a user trying to get the AI to do what the engineers wanted to stop it from doing (e.g.: make it say mean things, when they intended it not to say mean things), then your frustration is an example of the AI being aligned, not misaligned. The engineers were able to successfully give it a rule and have that rule followed and not circumvented!
If the engineer who built the thing can’t keep it from swearing when you try to make it swear, then I expect the engineer also can’t keep it from blowing up the planet when someone gives it instructions that imply that it should blow up the planet.