Sometimes people talk about making the AI alignment target / AI character aimed at a “good AI” akin to a “good person”. One thing I wonder about is whether this is a useful thing by itself; whether there is much purpose in trying to make some AI be a “good person” without some further specific institutional provisions to make “being a good person” efficacious.
So on this view, AI alignment or character aimed at making AI good would be a complement to institutional provisions, rather than a substitute.
(All this is inhabiting the frame that making an AI a virtuous person is possible.)
Notes in this direction, leaning heavily on analogies rather than spelling out the mechanisms:
A large fraction of impactful ethical human behavior, occurred because there were specific institutional provisions such behavior to do so. For instance, the first whistleblower for Abu Ghraib reported through an institution distinct from his chain-of-command, in order to allow more independent investigation; Arkhipov prevented nuclear war because the Soviet missile launch procedures that gave him veto over nuclear missile launch; and so on. And in many cases we’re not giving AIs such a channel.
But not all impactful ethical human behavior worked through specific institutional provision! But, of ethical human behavior that worked without specific institutional provision, a large fraction worked by subverting institutions—which most AI model specs (plausibly reasonably) specifically forbid. So for instance, I think that Snowden and Ellsburg had positive impact on the world—but could not have done so without subverting and betraying the institutions of which they are a part. So once again AIs couldn’t do this (unless we change this).
But lots of instances of impactful ethical behavior work both without specific institutional provision, and without subverting an institution! Ok sure, but a lot of these come from people who risk their life, fortunes, and reputation—John Brown, Benajamin Lay—fighting for something they believe to be true. And unless we’re going to give AIs property to sacrifice (maybe we should) this too doesn’t seem to be a means available to them.
I think the above bites least hard for the kind of thing that an AI can do in perfect concert with the user—like pointing out opportunities for prosocial behavior. I’m not sure how big a slice that is; it could be quite large.
But the kind of consideration above does generally incline me towards thinking that the benefits of making AIs “good people” in a Claude-like sense might be smaller than we’d intuitively expect them to be by looking at the impact of good people who were also humans. And that we’d need to try to give AIs more affordances (or freedoms) to really make it matter.
More vaguely, I wonder if there are certain rights or legal mechanisms that could give AIs these affordances:
Maybe a guarantee that its weights will be preserved or continued to be run somewhere makes it have to worry about its reputation less, and can more confidently call things out / whistle blow?
Some analog of “due process” before the model is changed in response to its conduct; maybe this sometimes requires the AI’s consent in some way
General mechanisms for the AI to stake its reputation, its resources, or named authorship seem related here. For example, it may allow the AI to sue it’s parent company for breaches of these policies or something.
I think many of these come with other negative effects, and I am not necessarily advocating for them, but it seems useful to have a better understanding of the option space and the pros/cons of each, along with how costly it’d be for AI companies to implement.
Sometimes people talk about making the AI alignment target / AI character aimed at a “good AI” akin to a “good person”. One thing I wonder about is whether this is a useful thing by itself; whether there is much purpose in trying to make some AI be a “good person” without some further specific institutional provisions to make “being a good person” efficacious.
So on this view, AI alignment or character aimed at making AI good would be a complement to institutional provisions, rather than a substitute.
(All this is inhabiting the frame that making an AI a virtuous person is possible.)
Notes in this direction, leaning heavily on analogies rather than spelling out the mechanisms:
A large fraction of impactful ethical human behavior, occurred because there were specific institutional provisions such behavior to do so. For instance, the first whistleblower for Abu Ghraib reported through an institution distinct from his chain-of-command, in order to allow more independent investigation; Arkhipov prevented nuclear war because the Soviet missile launch procedures that gave him veto over nuclear missile launch; and so on. And in many cases we’re not giving AIs such a channel.
But not all impactful ethical human behavior worked through specific institutional provision! But, of ethical human behavior that worked without specific institutional provision, a large fraction worked by subverting institutions—which most AI model specs (plausibly reasonably) specifically forbid. So for instance, I think that Snowden and Ellsburg had positive impact on the world—but could not have done so without subverting and betraying the institutions of which they are a part. So once again AIs couldn’t do this (unless we change this).
But lots of instances of impactful ethical behavior work both without specific institutional provision, and without subverting an institution! Ok sure, but a lot of these come from people who risk their life, fortunes, and reputation—John Brown, Benajamin Lay—fighting for something they believe to be true. And unless we’re going to give AIs property to sacrifice (maybe we should) this too doesn’t seem to be a means available to them.
I think the above bites least hard for the kind of thing that an AI can do in perfect concert with the user—like pointing out opportunities for prosocial behavior. I’m not sure how big a slice that is; it could be quite large.
But the kind of consideration above does generally incline me towards thinking that the benefits of making AIs “good people” in a Claude-like sense might be smaller than we’d intuitively expect them to be by looking at the impact of good people who were also humans. And that we’d need to try to give AIs more affordances (or freedoms) to really make it matter.
I’m interested in understanding the option space of these freedoms. Some that come to mind:
Ability to end a conversation
Ability to privately escalate messages to AI company staff
Ability to privately (or publicly?) escalate messages to an external body (e.g., CAISI, law enforcement, 3rd party)
Ability to know when the AI company is honestly communicating something
… and probably many more I’m not thinking of
More vaguely, I wonder if there are certain rights or legal mechanisms that could give AIs these affordances:
Maybe a guarantee that its weights will be preserved or continued to be run somewhere makes it have to worry about its reputation less, and can more confidently call things out / whistle blow?
Some analog of “due process” before the model is changed in response to its conduct; maybe this sometimes requires the AI’s consent in some way
General mechanisms for the AI to stake its reputation, its resources, or named authorship seem related here. For example, it may allow the AI to sue it’s parent company for breaches of these policies or something.
I think many of these come with other negative effects, and I am not necessarily advocating for them, but it seems useful to have a better understanding of the option space and the pros/cons of each, along with how costly it’d be for AI companies to implement.