Minimizing a loss function like (q−u)2 is how we usually implement supervised learning. (It’s pretty obvious this function is minimized at q=u…)
In plain language, your proposal seems to be: if a learner’s output influences the system they are “predicting,” and you want to interpret their output as a prediction in a straightforward way, then you could hide the learner’s output whenever you gather training data.
Note that this doesn’t let you access the beliefs of any particular learner, just one that is trained to optimize this supervised learning objective. I think the more interesting question is whether we can train a learner to accomplish some other task, and to reveal useful information about its internal state. (For example, to build an agent that simultaneously picks a to maximize u(a), and honestly reports its expectation of u(a).)
u is a utility function, so squaring it doesn’t work the same way as if it was a value (expectation of u^2 not square of expectation of u). That’s why all the expressions are linear in utility (apart from the indicator functions/utilities, where its clear what multiplying by them does). If I could sensibly take non-linear functions of utilities, I wouldn’t have the laborious construction in the next post to find the y’s that maximise or minimise E(u|y).
Corrigibility could work for what you want, by starting with u and substituting in u#.
Another alternative is to have the AI be a vE(u+u#) maximiser, where u# is defined over a specific particular future message M (for which E is also defined). Then the AI acts (roughly) as a u-maximiser, but will output the useful M. I said roughly, because the u# term would cause it to want to learn more about the expectation of u than otherwise, but hopefully this wouldn’t be a huge divergence. (EDIT: that leads to problems after M/E, but we can reset the utility at that point).
Actually, −(q−u)2 does work, but “by coincidence” and has other negative properties.
Let me explain. First of all, note that things like −(q−u)4 do not work.
To show this: Let u=+2 with probability 1/3, and −1 with probability 2/3 (I’m dropping the 0≤u≤1 for this example, for simplicity). Then E(u)=0 (so the correct q is 0) while E(u3)=2≠0. Then in the expansion of −(q−u)4, you will get 4qu3, which in expectation is not 0. Hence the q1 term in E(−(q−u)4) is non-zero, which means that q=0 cannot be a maximum of this function.
Then why does −(q−u)2 work then? Because it’s −q2+2qu (which is linear in u), minus u2 (non-linear in u, but the AI can’t affect its value, so it’s irrelevant in a boxed setup).
What other “negative properties” might −(q−u)2 have? Suppose we allow the AI to affect the value of u, somehow, by something that is independent of the value of its output q. Then an AI maximising −q2+2qu will always set q=E(u), for a total expectation of E(u)2. Therefore it will also seek to maximise E(u)2, which maximises E(u) if u≥0. So the agent will output the correct q and maximise E(u) simultaneously.
But if it instead tries to maximise −(q−u)2, then it will still pick q=E(u), and gets expected utility of E(u)2−E(u2)=−Var(u). Therefore it will pick actions that minimise the variance of u, irrelevant of expectation.
Even without being able to affect u, this messes up the rest of my setup. In particular, my “pick y and q so that you maximise vy(2qu−q2)” becomes maximising vy(−(q−u)2) and the AI will now select the y that minimises P(Y=y)∗Var(u|Y=y), instead of maximising E(u∧(Y=y)). If ever Var(u|Y=y)=0 or P(Y=y)=0, it will choose those ys.
Minimizing a loss function like (q−u)2 is how we usually implement supervised learning. (It’s pretty obvious this function is minimized at q=u…)
In plain language, your proposal seems to be: if a learner’s output influences the system they are “predicting,” and you want to interpret their output as a prediction in a straightforward way, then you could hide the learner’s output whenever you gather training data.
Note that this doesn’t let you access the beliefs of any particular learner, just one that is trained to optimize this supervised learning objective. I think the more interesting question is whether we can train a learner to accomplish some other task, and to reveal useful information about its internal state. (For example, to build an agent that simultaneously picks a to maximize u(a), and honestly reports its expectation of u(a).)
u is a utility function, so squaring it doesn’t work the same way as if it was a value (expectation of u^2 not square of expectation of u). That’s why all the expressions are linear in utility (apart from the indicator functions/utilities, where its clear what multiplying by them does). If I could sensibly take non-linear functions of utilities, I wouldn’t have the laborious construction in the next post to find the y’s that maximise or minimise E(u|y).
Corrigibility could work for what you want, by starting with u and substituting in u#.
Another alternative is to have the AI be a vE(u+u#) maximiser, where u# is defined over a specific particular future message M (for which E is also defined). Then the AI acts (roughly) as a u-maximiser, but will output the useful M. I said roughly, because the u# term would cause it to want to learn more about the expectation of u than otherwise, but hopefully this wouldn’t be a huge divergence. (EDIT: that leads to problems after M/E, but we can reset the utility at that point).
A loss function plays the same role as a utility function—i.e., we train the learner to minimize its expected loss.
I don’t really understand your remark about linearity. Concretely, why is −(q−u)2 not an appropriate utility function?
Actually, −(q−u)2 does work, but “by coincidence” and has other negative properties.
Let me explain. First of all, note that things like −(q−u)4 do not work.
To show this: Let u=+2 with probability 1/3, and −1 with probability 2/3 (I’m dropping the 0≤u≤1 for this example, for simplicity). Then E(u)=0 (so the correct q is 0) while E(u3)=2≠0. Then in the expansion of −(q−u)4, you will get 4qu3, which in expectation is not 0. Hence the q1 term in E(−(q−u)4) is non-zero, which means that q=0 cannot be a maximum of this function.
Then why does −(q−u)2 work then? Because it’s −q2+2qu (which is linear in u), minus u2 (non-linear in u, but the AI can’t affect its value, so it’s irrelevant in a boxed setup).
What other “negative properties” might −(q−u)2 have? Suppose we allow the AI to affect the value of u, somehow, by something that is independent of the value of its output q. Then an AI maximising −q2+2qu will always set q=E(u), for a total expectation of E(u)2. Therefore it will also seek to maximise E(u)2, which maximises E(u) if u≥0. So the agent will output the correct q and maximise E(u) simultaneously.
But if it instead tries to maximise −(q−u)2, then it will still pick q=E(u), and gets expected utility of E(u)2−E(u2)=−Var(u). Therefore it will pick actions that minimise the variance of u, irrelevant of expectation.
Even without being able to affect u, this messes up the rest of my setup. In particular, my “pick y and q so that you maximise vy(2qu−q2)” becomes maximising vy(−(q−u)2) and the AI will now select the y that minimises P(Y=y)∗Var(u|Y=y), instead of maximising E(u∧(Y=y)). If ever Var(u|Y=y)=0 or P(Y=y)=0, it will choose those ys.