Amalthea comments on Evaluating the historical value misspecification argument

Amalthea 6 Oct 2023 7:12 UTC
1 point
0
I don’t understand your objection. A more capable AI might understand that it’s completely sufficient to tell you that your mother is doing fine, and simulate a phone call with her to keep you happy. Or it just talks you into not wanting to confirm in more detail, etc. I’d expect that the problem wouldn’t be to get the AI what you want to do in a specific supervised setting, but to remain in control of the overall situation, which includes being able to rely on the AI’s actions not having any ramifications beyond it’s narrow task.

The question is how do you even train the AI under the current paradigm once “human preferences” stops being a standard for evaluation and just becomes another aspect of the AIs world model, that needs to be navigated.