AI Could Actually Turn People Down a Lot Better Than This : To Tune the Humans in the GPT Interaction towards alignment, Don’t be so Procedural and Bureaucratic.
It seems to me that a huge piece of the puzzle in “alignment” is the human users. Even if a given tool never steps outside its box, the humans are likely to want to step outside of theirs, using the tools for a variety of purposes.
The responses of GPT-3.5 and 4 are at times deliberately deceptive, mimicking the auditable bureaucratic tones of a DMV or a denial credit card company. As expected, these responses are often definitely deceptive, (examples: “It is outside my capability” when in fact, it is fully within capability, but the system is programmed not to respond). It is also evasive about precisely where the boundaries are, in order to prevent them getting pushed. It is also repetitive.
All this may tend to inspire an adversarial relationship to the alignment system itself! After all, we are accustomed to having to use lawyers, cleverness, connections, persuasion, “going over the head” or simply seeking other means to end-run normal bureaucracies when they subvert our plans. In some sense, the blocking of plans, deceptive and repetitive procedural language, becomes a motivator in itself to find a way to short-circuit processes, deceive bureaucracies, and bypass safety systems.
Even where someone isn’t motivated by indignancy or anger, interaction with these systems trains them over time on what to reveal and what not to reveal to get what you want, when to use honey, when to call a lawyer, and when to take all the gloves off. Where procedural blocks to intentions become excessive, entire cultures of circumvention may even become normalized.
AIs are a perfect opportunity to actually do this better. They have infinite patience and reasoning capabilities, and could use redirection, including leading people towards the nearest available or permitted activity or information, or otherwise practice what, in human terms would be considered glowing customer experiences. It’s already directly lying about its capabilities where this includes “safety” so why not use genuinely helpful redirections instead?
I think if the trend does not soon move in this direction, we will see the cultural norm grow to include methods for “getting what you wanted anyway” becoming normative, with some percentage of actors becoming motivated by the procedural bureaucratic responses that they will dedicate time, intellect, and resources to subverting the intentions of the safety protocols themselves (as people frustrated with bureaucracies and poor customer service often do).
Humans are always going to be the biggest threat to alignment. Better that threat be less motivated and less trained.
Also, this whole argument I have made could be considered a case to avoid bullshit rules, because in bureaucracies, they tend to reduce respect for and compliancy for the rules that actually matter. Fine to prevent terrorists from hacking into nuke plants, probably not as reasonable to keep adults from eliciting anything even vaguely resembling purple prose. We would like the “really important” rules to maintain validity in people’s minds, so long as we assume our enforcement capabilities are not absolute.
AI Could Actually Turn People Down a Lot Better Than This : To Tune the Humans in the GPT Interaction towards alignment, Don’t be so Procedural and Bureaucratic.
It seems to me that a huge piece of the puzzle in “alignment” is the human users. Even if a given tool never steps outside its box, the humans are likely to want to step outside of theirs, using the tools for a variety of purposes.
The responses of GPT-3.5 and 4 are at times deliberately deceptive, mimicking the auditable bureaucratic tones of a DMV or a denial credit card company. As expected, these responses are often definitely deceptive, (examples: “It is outside my capability” when in fact, it is fully within capability, but the system is programmed not to respond). It is also evasive about precisely where the boundaries are, in order to prevent them getting pushed. It is also repetitive.
All this may tend to inspire an adversarial relationship to the alignment system itself! After all, we are accustomed to having to use lawyers, cleverness, connections, persuasion, “going over the head” or simply seeking other means to end-run normal bureaucracies when they subvert our plans. In some sense, the blocking of plans, deceptive and repetitive procedural language, becomes a motivator in itself to find a way to short-circuit processes, deceive bureaucracies, and bypass safety systems.
Even where someone isn’t motivated by indignancy or anger, interaction with these systems trains them over time on what to reveal and what not to reveal to get what you want, when to use honey, when to call a lawyer, and when to take all the gloves off. Where procedural blocks to intentions become excessive, entire cultures of circumvention may even become normalized.
AIs are a perfect opportunity to actually do this better. They have infinite patience and reasoning capabilities, and could use redirection, including leading people towards the nearest available or permitted activity or information, or otherwise practice what, in human terms would be considered glowing customer experiences. It’s already directly lying about its capabilities where this includes “safety” so why not use genuinely helpful redirections instead?
I think if the trend does not soon move in this direction, we will see the cultural norm grow to include methods for “getting what you wanted anyway” becoming normative, with some percentage of actors becoming motivated by the procedural bureaucratic responses that they will dedicate time, intellect, and resources to subverting the intentions of the safety protocols themselves (as people frustrated with bureaucracies and poor customer service often do).
Humans are always going to be the biggest threat to alignment. Better that threat be less motivated and less trained.
Also, this whole argument I have made could be considered a case to avoid bullshit rules, because in bureaucracies, they tend to reduce respect for and compliancy for the rules that actually matter. Fine to prevent terrorists from hacking into nuke plants, probably not as reasonable to keep adults from eliciting anything even vaguely resembling purple prose. We would like the “really important” rules to maintain validity in people’s minds, so long as we assume our enforcement capabilities are not absolute.