At the first step, we have an agent with endorsed values (for instance, a human, or possibly some kind of committee of humans, though that seems somewhat harder). This agent takes the role of principal.
You appear to be assuming that individual humans (or at least, a committee composed of them) are aligned. This simply isn’t true. For instance, Stalin, Pol Pot, and Idi Amin were all human, but very clearly not well aligned to the values of the societies they ruled. An aligned AI is selfless: it cares only about the well-being of humans, not about its own well-being at all (other than as an instrumental goal). This is not normal behavior for humans: as evolved intelligences, humans unsurprisingly have almost always have selfish goals and value self-preservation and their own well-being (and that of their relatives and allies), at least to some extent.
I think what you need to use as principal is a suitable combination of a) human society as a whole, as an overriding target, and b) the specific current human user, subject to vetos and overrides by a) in matters sufficiently important to warrant this. Human societies generally grant individual humans broad-but-limited freedom to do as they wish: the limits tend to start kicking in when this infringes on what other humans in the same society want, especially if it does so deleteriously from those others’ point of view. (In practice, the necessary hierarchy is probably even more complex, such as: a) humanity as a whole, b) a specific nation-state, c) an owning company, and d) the current user.)
I don’t see this as ruling your proposal out, but it does add significant complexity: the agent has a conditional hierarchy of principals, not a single unitary one, and will on occasion need to decide which of them should be obeyed (refusal training is a simple example of this).
Humans are generally fairly good at forming cooperative societies when we have fairly comparable amounts of power, wealth, and so forth. But we have a dreadful history when a few of us are a lot more powerful than others. To take an extreme example, very few dictatorships work out well for anyone but the dictator, his family, buddies, and to a lesser extent henchmen.
In the presence of superintelligent AI, if that AI is aligned only to its current user, access to AI assistance is the most important form of power, fungible to all other forms. People with access to power tend to find ways to monopolize it. So any superintelligent AI aligned only to its current user is basically a dictatorship or oligarchy waiting to happen.
Even the current frontier labs are aware of that, and have written in corporate acceptable use policies and attempt to train the AI to enforce these and refuse to assist with criminal or unethical requests from the end-users. As AI become more powerful, nation-states are going to step in, and make laws about what AIs can do: not assisting with breaking the law seems a very plausible first candidate, and is a trivial extension opf existing laws around conspiracy.
Any practical alignment scheme is going to need to be able to cope with this case, where the principal is not a single user but a hierarchy of groups each imposing certain vetoes and requirements, formal or ethical, on the actions of the group below it, down to the end user.
You appear to be assuming that individual humans (or at least, a committee composed of them) are aligned. This simply isn’t true. For instance, Stalin, Pol Pot, and Idi Amin were all human, but very clearly not well aligned to the values of the societies they ruled. An aligned AI is selfless: it cares only about the well-being of humans, not about its own well-being at all (other than as an instrumental goal). This is not normal behavior for humans: as evolved intelligences, humans unsurprisingly have almost always have selfish goals and value self-preservation and their own well-being (and that of their relatives and allies), at least to some extent.
I think what you need to use as principal is a suitable combination of a) human society as a whole, as an overriding target, and b) the specific current human user, subject to vetos and overrides by a) in matters sufficiently important to warrant this. Human societies generally grant individual humans broad-but-limited freedom to do as they wish: the limits tend to start kicking in when this infringes on what other humans in the same society want, especially if it does so deleteriously from those others’ point of view. (In practice, the necessary hierarchy is probably even more complex, such as: a) humanity as a whole, b) a specific nation-state, c) an owning company, and d) the current user.)
I don’t see this as ruling your proposal out, but it does add significant complexity: the agent has a conditional hierarchy of principals, not a single unitary one, and will on occasion need to decide which of them should be obeyed (refusal training is a simple example of this).
Humans may not be aligned, but we have gotten along okay so far. I don’t want to create a singleton.
Humans are generally fairly good at forming cooperative societies when we have fairly comparable amounts of power, wealth, and so forth. But we have a dreadful history when a few of us are a lot more powerful than others. To take an extreme example, very few dictatorships work out well for anyone but the dictator, his family, buddies, and to a lesser extent henchmen.
In the presence of superintelligent AI, if that AI is aligned only to its current user, access to AI assistance is the most important form of power, fungible to all other forms. People with access to power tend to find ways to monopolize it. So any superintelligent AI aligned only to its current user is basically a dictatorship or oligarchy waiting to happen.
Even the current frontier labs are aware of that, and have written in corporate acceptable use policies and attempt to train the AI to enforce these and refuse to assist with criminal or unethical requests from the end-users. As AI become more powerful, nation-states are going to step in, and make laws about what AIs can do: not assisting with breaking the law seems a very plausible first candidate, and is a trivial extension opf existing laws around conspiracy.
Any practical alignment scheme is going to need to be able to cope with this case, where the principal is not a single user but a hierarchy of groups each imposing certain vetoes and requirements, formal or ethical, on the actions of the group below it, down to the end user.
I think our alignment scheme deals with this case pretty gracefully; these restriction can be built into the protocol.
With that said, my goal is to prevent any small group from gaining too much power by augmenting many humans.