RLHF could then reinforce the high rating mechanism without correspondingly reinforcing the truth mechanism, breaking the correlation.
I unconfidently think that in this case, RLHF will reinforce both mechanisms, but reinforce the high rating mechanism slightly more, which nets out to no clear difference from conditioning. But I wouldn’t be shocked to learn I was wrong.
I unconfidently think that in this case, RLHF will reinforce both mechanisms, but reinforce the high rating mechanism slightly more, which nets out to no clear difference from conditioning. But I wouldn’t be shocked to learn I was wrong.