I do think there is a way in which this proposal for solving outer alignment “misses a hard bit”, or better said “presupposes outer alignment is approximately already solved”.
Indeed, the outer objective you propose isn’t completely specified. A key missing part the AGI would need to have internalized is “what human actions constitute evidence of which utility functions”. That is, you are incentivizing it to observe actions in order to reduce the space of possible Uis, but haven’t specified how to update on these observations. In your toy example with the switch X, of course we could somehow directly feed the pi into the AI (which amounts to solving the problem ourselves). But in reality we won’t be able to always do this: all small human actions will constitute some sort of evidence the AI needs to interpret.
So for this protocol to work, we already need the AI to correctly interpret any human action as information about their utility function. That is, we need to have solved IRL, and thus outer alignment.
But what if we input this under-specified outer objective and let the AI figure out on its own which actions constitute which evidence?
The problem is it’s not enough for the AI to factually know this, but to have it internalized as part of its objective, and thus we need the whole objective to be specified from the start. This is analogous to how it’s not enough for an AI to factually know “humans don’t want me to destroy the world”, we actually need its objective to care about this (maybe through a pointer to human values). Of course your proposal tries to construct that pointer, the problem is if you provide an under-specified pointer, the AI will fill it in (during training, or whatever) with random elements, pointing to random goals. That is, the AI won’t see “performing updates on Ui in a sensible manner” as part of its goal, even if it eventually learns how to “perform updates on Ui in a sensible manner” as an instrumental capability (because you didn’t specify that from the start).
But this seems like it’s assuming the AI is already completely capable when we give it an outer objective, which is not the case. What if we train it on this objective (providing the actual score at the end of each episode) in hopes that it learns to fill the objective in the correct way?
Not exactly. The argument above can be rephrased as “if you train it on an under-specified objective, it will almost surely (completely ignoring possible inner alignment failures) learn the wrong proxy for that objective”. That is, it will learn a way to fill in the “update the Uis” gap which scores perfectly on training but is not what we really would have wanted. That is, you somehow need to ensure it learns the correct proxy without us completely specifying it, which is again a hard part of the problem. Maybe there’s some practical reason (about the inductive biases of SGD etc.) to expect training with this outer objective to be more tractable than training with any rough approximation of our utility function (maybe because you can somehow vastly reduce the search space), but that’d be a whole different argument.
On another note, I do think this approach might be applicable as an outer shell to “soften the edges” when we already have an approximately correct solution to outer alignment (that is, when we already have a robust account of what constitutes “updating Uis”, and also a good enough approximation to our utility function to be confident that our pool of approximate Uis contains it). In that sense, this seems functionally very reminiscent of Turner’s low-impact measure AUP, which also basically “softens the edges” of an approximate solution by considering auxiliary Uis. That said, I don’t expect both your approaches to coincide on the limit, since his is basically averaging over some Uis, and yours is doing something very different, as explained in your “Limit case” section.
Please do let me know if I’ve misrepresented anything :)
Nice post!
I do think there is a way in which this proposal for solving outer alignment “misses a hard bit”, or better said “presupposes outer alignment is approximately already solved”.
Indeed, the outer objective you propose isn’t completely specified. A key missing part the AGI would need to have internalized is “what human actions constitute evidence of which utility functions”. That is, you are incentivizing it to observe actions in order to reduce the space of possible Uis, but haven’t specified how to update on these observations. In your toy example with the switch X, of course we could somehow directly feed the pi into the AI (which amounts to solving the problem ourselves). But in reality we won’t be able to always do this: all small human actions will constitute some sort of evidence the AI needs to interpret.
So for this protocol to work, we already need the AI to correctly interpret any human action as information about their utility function. That is, we need to have solved IRL, and thus outer alignment.
But what if we input this under-specified outer objective and let the AI figure out on its own which actions constitute which evidence?
The problem is it’s not enough for the AI to factually know this, but to have it internalized as part of its objective, and thus we need the whole objective to be specified from the start. This is analogous to how it’s not enough for an AI to factually know “humans don’t want me to destroy the world”, we actually need its objective to care about this (maybe through a pointer to human values). Of course your proposal tries to construct that pointer, the problem is if you provide an under-specified pointer, the AI will fill it in (during training, or whatever) with random elements, pointing to random goals. That is, the AI won’t see “performing updates on Ui in a sensible manner” as part of its goal, even if it eventually learns how to “perform updates on Ui in a sensible manner” as an instrumental capability (because you didn’t specify that from the start).
But this seems like it’s assuming the AI is already completely capable when we give it an outer objective, which is not the case. What if we train it on this objective (providing the actual score at the end of each episode) in hopes that it learns to fill the objective in the correct way?
Not exactly. The argument above can be rephrased as “if you train it on an under-specified objective, it will almost surely (completely ignoring possible inner alignment failures) learn the wrong proxy for that objective”. That is, it will learn a way to fill in the “update the Uis” gap which scores perfectly on training but is not what we really would have wanted. That is, you somehow need to ensure it learns the correct proxy without us completely specifying it, which is again a hard part of the problem. Maybe there’s some practical reason (about the inductive biases of SGD etc.) to expect training with this outer objective to be more tractable than training with any rough approximation of our utility function (maybe because you can somehow vastly reduce the search space), but that’d be a whole different argument.
On another note, I do think this approach might be applicable as an outer shell to “soften the edges” when we already have an approximately correct solution to outer alignment (that is, when we already have a robust account of what constitutes “updating Uis”, and also a good enough approximation to our utility function to be confident that our pool of approximate Uis contains it). In that sense, this seems functionally very reminiscent of Turner’s low-impact measure AUP, which also basically “softens the edges” of an approximate solution by considering auxiliary Uis. That said, I don’t expect both your approaches to coincide on the limit, since his is basically averaging over some Uis, and yours is doing something very different, as explained in your “Limit case” section.
Please do let me know if I’ve misrepresented anything :)