Ok, I think that makes some sense in so far as you’re softening the f+=f− constraint and training it in more open-ended conditions. I’m not currently clear where this gets us, but I’ll say more about that in my response to Paul.
However, I don’t see how you can use generalization from the kind of dataset where f+ and f− always agree (having asked prescriptive questions). [EDIT: now I do, I was just thinking particularly badly] I see honestly answering a question as a 2-step process (conceptually): 1) Decide which things are true. 2) Decide which true thing to output.
In the narrow case, we’re specifying ((2) | (1)) in the question, and training the model to do (1). Even if we learn a model that does (1) perfectly (in the intended way), it hasn’t learned anything that can generalize to (2). Step (2) is in part a function of human values, so we’d need to be giving it some human-values training signal for it to generalize.
[EDIT: I’ve just realized that I’m being very foolish here. The above suggests that learning (1) doesn’t necessarily generalize to (2). In no way does it imply that it can’t. I think the point I want to make is that an f+ that does generalize extremely well in this way is likely to be doing some close equivalent to predicting-the-human. (in this I’m implicitly claiming that doing (2) well in general requires full understanding of human values)]
Overall, I’m still unsure how to describe what we want: clearly we don’t trust Alice’s answers if she’s being blackmailed, but how about if she’s afraid, mildly anxious, unusually optimistic, slightly distracted, thinking about concept a or b or c...? It’s clear that the instrumental model just gives whatever response Alice would give here. I don’t know what the intended model should do; I don’t know what “honest answer” we’re looking for.
If the situation has property x, and Alice has reacted with unusual-for-Alice property y. Do we want the Alice-with-y answer, or the standard-Alice answer? It seems to depend on whether we decide y is acceptable (or even required) w.r.t. answer reliability, given x. Then I think we get the same problem on that questionetc.
Ok, I think that makes some sense in so far as you’re softening the f+=f− constraint and training it in more open-ended conditions. I’m not currently clear where this gets us, but I’ll say more about that in my response to Paul.
However, I don’t see how you can use generalization from the kind of dataset where f+ and f− always agree (having asked prescriptive questions). [EDIT: now I do, I was just thinking particularly badly]
I see honestly answering a question as a 2-step process (conceptually):
1) Decide which things are true.
2) Decide which true thing to output.
In the narrow case, we’re specifying ((2) | (1)) in the question, and training the model to do (1). Even if we learn a model that does (1) perfectly (in the intended way), it hasn’t learned anything that can generalize to (2).
Step (2) is in part a function of human values, so we’d need to be giving it some human-values training signal for it to generalize.
[EDIT: I’ve just realized that I’m being very foolish here. The above suggests that learning (1) doesn’t necessarily generalize to (2). In no way does it imply that it can’t. I think the point I want to make is that an f+ that does generalize extremely well in this way is likely to be doing some close equivalent to predicting-the-human. (in this I’m implicitly claiming that doing (2) well in general requires full understanding of human values)]
Overall, I’m still unsure how to describe what we want: clearly we don’t trust Alice’s answers if she’s being blackmailed, but how about if she’s afraid, mildly anxious, unusually optimistic, slightly distracted, thinking about concept a or b or c...?
It’s clear that the instrumental model just gives whatever response Alice would give here.
I don’t know what the intended model should do; I don’t know what “honest answer” we’re looking for.
If the situation has property x, and Alice has reacted with unusual-for-Alice property y. Do we want the Alice-with-y answer, or the standard-Alice answer? It seems to depend on whether we decide y is acceptable (or even required) w.r.t. answer reliability, given x. Then I think we get the same problem on that question etc.