So this seems to me like it’s the crux. I agree with you that GPT-4 is “pretty good”, but I think the standard necessary for things to go well is substantially higher than “pretty good”, and that’s where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment.
This makes sense to me. On the other hand—it feels like there’s some motte and bailey going on here, if one claim is “if the AIs get really superhumanly capable then we need a much higher standard than pretty good”, but then it’s illustrated using examples like “think of how your AI might not understand what you meant if you asked it to get your mother out of a burning building”.
I don’t understand your objection. A more capable AI might understand that it’s completely sufficient to tell you that your mother is doing fine, and simulate a phone call with her to keep you happy. Or it just talks you into not wanting to confirm in more detail, etc. I’d expect that the problem wouldn’t be to get the AI what you want to do in a specific supervised setting, but to remain in control of the overall situation, which includes being able to rely on the AI’s actions not having any ramifications beyond it’s narrow task.
The question is how do you even train the AI under the current paradigm once “human preferences” stops being a standard for evaluation and just becomes another aspect of the AIs world model, that needs to be navigated.
This makes sense to me. On the other hand—it feels like there’s some motte and bailey going on here, if one claim is “if the AIs get really superhumanly capable then we need a much higher standard than pretty good”, but then it’s illustrated using examples like “think of how your AI might not understand what you meant if you asked it to get your mother out of a burning building”.
I don’t understand your objection. A more capable AI might understand that it’s completely sufficient to tell you that your mother is doing fine, and simulate a phone call with her to keep you happy. Or it just talks you into not wanting to confirm in more detail, etc. I’d expect that the problem wouldn’t be to get the AI what you want to do in a specific supervised setting, but to remain in control of the overall situation, which includes being able to rely on the AI’s actions not having any ramifications beyond it’s narrow task.
The question is how do you even train the AI under the current paradigm once “human preferences” stops being a standard for evaluation and just becomes another aspect of the AIs world model, that needs to be navigated.