Joe Collman comments on Answering questions honestly instead of predicting human answers: lots of problems and some solutions

Joe Collman 18 Jul 2021 3:14 UTC
LW: 1 AF: 1
0
AF
Ok, I think that makes some sense in so far as you’re softening the $f^{+} = f^{-}$ constraint and training it in more open-ended conditions. I’m not currently clear where this gets us, but I’ll say more about that in my response to Paul.
However, I don’t see how you can use generalization from the kind of dataset where $f^{+}$ and $f^{-}$ always agree (having asked prescriptive questions). [EDIT: now I do, I was just thinking particularly badly]
I see honestly answering a question as a 2-step process (conceptually):
1) Decide which things are true.
2) Decide which true thing to output.
In the narrow case, we’re specifying ((2) | (1)) in the question, and training the model to do (1). Even if we learn a model that does (1) perfectly (in the intended way), it hasn’t learned anything that can generalize to (2).
Step (2) is in part a function of human values, so we’d need to be giving it some human-values training signal for it to generalize.
[EDIT: I’ve just realized that I’m being very foolish here. The above suggests that learning (1) doesn’t necessarily generalize to (2). In no way does it imply that it can’t. I think the point I want to make is that an $f^{+}$ that does generalize extremely well in this way is likely to be doing some close equivalent to predicting-the-human. (in this I’m implicitly claiming that doing (2) well in general requires full understanding of human values)]
Overall, I’m still unsure how to describe what we want: clearly we don’t trust Alice’s answers if she’s being blackmailed, but how about if she’s afraid, mildly anxious, unusually optimistic, slightly distracted, thinking about concept a or b or c...?
It’s clear that the instrumental model just gives whatever response Alice would give here.
I don’t know what the intended model should do; I don’t know what “honest answer” we’re looking for.
If the situation has property x, and Alice has reacted with unusual-for-Alice property y. Do we want the Alice-with-y answer, or the standard-Alice answer? It seems to depend on whether we decide y is acceptable (or even required) w.r.t. answer reliability, given x. Then I think we get the same problem on that question etc.
What links here?
- Joe Collman's comment on Answering questions honestly instead of predicting human answers: lots of problems and some solutions by evhub (19 Jul 2021 19:15 UTC; 3 points)