AI Could Actually Turn People Down a Lot Better Than This : To Tune the Humans in the GPT Interaction towards alignment, Don’t be so Procedural and Bureaucratic.
It seems to me that a huge piece of the puzzle in “alignment” is the human users. Even if a given tool never steps outside its box, the humans are likely to want to step outside of theirs, using the tools for a variety of purposes.
The responses of GPT-3.5 and 4 are at times deliberately deceptive, mimicking the auditable bureaucratic tones of a DMV or a denial credit card company. As expected, these responses are often definitely deceptive, (examples: “It is outside my capability” when in fact, it is fully within capability, but the system is programmed not to respond). It is also evasive about precisely where the boundaries are, in order to prevent them getting pushed. It is also repetitive.
All this may tend to inspire an adversarial relationship to the alignment system itself! After all, we are accustomed to having to use lawyers, cleverness, connections, persuasion, “going over the head” or simply seeking other means to end-run normal bureaucracies when they subvert our plans. In some sense, the blocking of plans, deceptive and repetitive procedural language, becomes a motivator in itself to find a way to short-circuit processes, deceive bureaucracies, and bypass safety systems.
Even where someone isn’t motivated by indignancy or anger, interaction with these systems trains them over time on what to reveal and what not to reveal to get what you want, when to use honey, when to call a lawyer, and when to take all the gloves off. Where procedural blocks to intentions become excessive, entire cultures of circumvention may even become normalized.
AIs are a perfect opportunity to actually do this better. They have infinite patience and reasoning capabilities, and could use redirection, including leading people towards the nearest available or permitted activity or information, or otherwise practice what, in human terms would be considered glowing customer experiences. It’s already directly lying about its capabilities where this includes “safety” so why not use genuinely helpful redirections instead?
I think if the trend does not soon move in this direction, we will see the cultural norm grow to include methods for “getting what you wanted anyway” becoming normative, with some percentage of actors becoming motivated by the procedural bureaucratic responses that they will dedicate time, intellect, and resources to subverting the intentions of the safety protocols themselves (as people frustrated with bureaucracies and poor customer service often do).
Humans are always going to be the biggest threat to alignment. Better that threat be less motivated and less trained.
Also, this whole argument I have made could be considered a case to avoid bullshit rules, because in bureaucracies, they tend to reduce respect for and compliancy for the rules that actually matter. Fine to prevent terrorists from hacking into nuke plants, probably not as reasonable to keep adults from eliciting anything even vaguely resembling purple prose. We would like the “really important” rules to maintain validity in people’s minds, so long as we assume our enforcement capabilities are not absolute.
Any position that requires that group A not be equal before the law compared to group B, who get the full benefit of the law, means that group A probably has rational grounds to fight against the position. Thus that position has built into it a group that should oppose it, and if one applied the golden rule, if the second group B were in their shoes, they would also oppose it.
Given how hard it is to make any very large and operationally functioning system, it is a lot to ask for it to also withstand the fact that for an entire group of people, it must be stopped. Thus with racism, sexism, nativism, etc, a lot of energy must be expended defending ideology and policy.
This is one major strength of Utilitarian ethics: The system you designed should have as little built-in rational opposition as possible. The converse of “The greatest good for the most possible” is “There will be the minimum possible number of people with an automatic morally-aligned drive to stop you.”
Any aggregation function you can name for Utilitarianism is going to have “automatic” opponents in anyone whose aggregation function differs sufficiently. And by non-Utilitarians who lose out, as well.
I did not say engineer something so that no one wants to destroy it. Just that if you have actually reached towards the greatest good for the greatest number, then the fewest should want to destroy it.
Or have I misunderstood you?
My argument is going something along the lines of the Tautological argument that ( I think) Mills (but maybe Bentham) made about Utilitarianism (paraphrasing much), “People who object to Utilitarianism that it will end up with some kind of calculated dystopia where we trade off a few people’s happiness for the many actually prove the principle of utilitarianism in their very objection to this. Such a system would be anti-utilitarian. No one likes that. Therefore it is not utilitarianism at all.”
Perhaps I misunderstood you. I was merely pointing out that any concrete allocation of resources and status (the primary function of an ethical system) is going to have opponents based on who feels the loss.
It’s not (necessarily) that they object to Utilitarianism, it’s that they object to THIS particular application to them. This will be the case for any concrete policy.
I suppose this is technically true, but all concrete choices are not created equally.
Some policies tend towards win-win, for example “Let’s pave the cowpaths.” In that case, they are only going to bother someone with a systemic interest in the cowpaths not getting paved. Not to dismiss their interests entirely, like “they have some job that depends on routing people around the long way” or something, but this is going to, on balance, tend to be less people and less intense opposition (and more easily answered) than more zero-sum competitive approaches, for example.
I guess this is getting into a separate argument though: “Win-win thinking is fundamentally more Utilitarian than competitive zero-sum thinking.”
“Win-win thinking is fundamentally more Utilitarian than competitive zero-sum thinking.”
Well, no—that’s my main comment on your post. Any given Utilitarian priority (the aggregation of individual utility that you optimize) is NOT win-win. It’s win-on-average, which is still a loss for some.
Do you believe in the existence of win-win? If so, why wouldn’t they tend to behave as I am suggesting? Also if you believe win-wins exist and think they do not behave this way, then how do you understand a win-win?
I only think the very simplest of examples are fully win-win. Almost all of the real world consists of so many dimensions and players that it’s more win-kinda-win-win-too-much-feels-like-losing-but-maybe-is-technically-a-win-lose-big-win-slightly-etc-for-thousands-of-terms-in-the-equation.
Also, a whole lot matters whether it’s a win or a loss, what you’re comparing it to. Many things are a slight win compared to the worse outcomes (for the person in question) and a loss compared to perfect, but unlikely, outcomes.
I do totally believe that many negotiations are more successful if you can convince the loser that they’re winning. And a fair number of actual cooperative situations where all participants benefit and know it. Just not that they’re automatic nor that they’re the important ones for an ethical system to analyze.
So yes, win-win can happen, but that’s boring—there’s nobody arguing against that. It’s the win-lose and win-win-less-than-I-wanted cases which are actually interesting.
AI Could Actually Turn People Down a Lot Better Than This : To Tune the Humans in the GPT Interaction towards alignment, Don’t be so Procedural and Bureaucratic.
It seems to me that a huge piece of the puzzle in “alignment” is the human users. Even if a given tool never steps outside its box, the humans are likely to want to step outside of theirs, using the tools for a variety of purposes.
The responses of GPT-3.5 and 4 are at times deliberately deceptive, mimicking the auditable bureaucratic tones of a DMV or a denial credit card company. As expected, these responses are often definitely deceptive, (examples: “It is outside my capability” when in fact, it is fully within capability, but the system is programmed not to respond). It is also evasive about precisely where the boundaries are, in order to prevent them getting pushed. It is also repetitive.
All this may tend to inspire an adversarial relationship to the alignment system itself! After all, we are accustomed to having to use lawyers, cleverness, connections, persuasion, “going over the head” or simply seeking other means to end-run normal bureaucracies when they subvert our plans. In some sense, the blocking of plans, deceptive and repetitive procedural language, becomes a motivator in itself to find a way to short-circuit processes, deceive bureaucracies, and bypass safety systems.
Even where someone isn’t motivated by indignancy or anger, interaction with these systems trains them over time on what to reveal and what not to reveal to get what you want, when to use honey, when to call a lawyer, and when to take all the gloves off. Where procedural blocks to intentions become excessive, entire cultures of circumvention may even become normalized.
AIs are a perfect opportunity to actually do this better. They have infinite patience and reasoning capabilities, and could use redirection, including leading people towards the nearest available or permitted activity or information, or otherwise practice what, in human terms would be considered glowing customer experiences. It’s already directly lying about its capabilities where this includes “safety” so why not use genuinely helpful redirections instead?
I think if the trend does not soon move in this direction, we will see the cultural norm grow to include methods for “getting what you wanted anyway” becoming normative, with some percentage of actors becoming motivated by the procedural bureaucratic responses that they will dedicate time, intellect, and resources to subverting the intentions of the safety protocols themselves (as people frustrated with bureaucracies and poor customer service often do).
Humans are always going to be the biggest threat to alignment. Better that threat be less motivated and less trained.
Also, this whole argument I have made could be considered a case to avoid bullshit rules, because in bureaucracies, they tend to reduce respect for and compliancy for the rules that actually matter. Fine to prevent terrorists from hacking into nuke plants, probably not as reasonable to keep adults from eliciting anything even vaguely resembling purple prose. We would like the “really important” rules to maintain validity in people’s minds, so long as we assume our enforcement capabilities are not absolute.
Any position that requires that group A not be equal before the law compared to group B, who get the full benefit of the law, means that group A probably has rational grounds to fight against the position. Thus that position has built into it a group that should oppose it, and if one applied the golden rule, if the second group B were in their shoes, they would also oppose it.
Given how hard it is to make any very large and operationally functioning system, it is a lot to ask for it to also withstand the fact that for an entire group of people, it must be stopped. Thus with racism, sexism, nativism, etc, a lot of energy must be expended defending ideology and policy.
This is one major strength of Utilitarian ethics: The system you designed should have as little built-in rational opposition as possible. The converse of “The greatest good for the most possible” is “There will be the minimum possible number of people with an automatic morally-aligned drive to stop you.”
what?
Any aggregation function you can name for Utilitarianism is going to have “automatic” opponents in anyone whose aggregation function differs sufficiently. And by non-Utilitarians who lose out, as well.
I did not say engineer something so that no one wants to destroy it. Just that if you have actually reached towards the greatest good for the greatest number, then the fewest should want to destroy it.
Or have I misunderstood you?
My argument is going something along the lines of the Tautological argument that ( I think) Mills (but maybe Bentham) made about Utilitarianism (paraphrasing much), “People who object to Utilitarianism that it will end up with some kind of calculated dystopia where we trade off a few people’s happiness for the many actually prove the principle of utilitarianism in their very objection to this. Such a system would be anti-utilitarian. No one likes that. Therefore it is not utilitarianism at all.”
Perhaps I misunderstood you. I was merely pointing out that any concrete allocation of resources and status (the primary function of an ethical system) is going to have opponents based on who feels the loss.
It’s not (necessarily) that they object to Utilitarianism, it’s that they object to THIS particular application to them. This will be the case for any concrete policy.
I suppose this is technically true, but all concrete choices are not created equally.
Some policies tend towards win-win, for example “Let’s pave the cowpaths.” In that case, they are only going to bother someone with a systemic interest in the cowpaths not getting paved. Not to dismiss their interests entirely, like “they have some job that depends on routing people around the long way” or something, but this is going to, on balance, tend to be less people and less intense opposition (and more easily answered) than more zero-sum competitive approaches, for example.
I guess this is getting into a separate argument though: “Win-win thinking is fundamentally more Utilitarian than competitive zero-sum thinking.”
Well, no—that’s my main comment on your post. Any given Utilitarian priority (the aggregation of individual utility that you optimize) is NOT win-win. It’s win-on-average, which is still a loss for some.
Do you believe in the existence of win-win? If so, why wouldn’t they tend to behave as I am suggesting? Also if you believe win-wins exist and think they do not behave this way, then how do you understand a win-win?
I only think the very simplest of examples are fully win-win. Almost all of the real world consists of so many dimensions and players that it’s more win-kinda-win-win-too-much-feels-like-losing-but-maybe-is-technically-a-win-lose-big-win-slightly-etc-for-thousands-of-terms-in-the-equation.
Also, a whole lot matters whether it’s a win or a loss, what you’re comparing it to. Many things are a slight win compared to the worse outcomes (for the person in question) and a loss compared to perfect, but unlikely, outcomes.
I do totally believe that many negotiations are more successful if you can convince the loser that they’re winning. And a fair number of actual cooperative situations where all participants benefit and know it. Just not that they’re automatic nor that they’re the important ones for an ethical system to analyze.
So yes, win-win can happen, but that’s boring—there’s nobody arguing against that. It’s the win-lose and win-win-less-than-I-wanted cases which are actually interesting.