I would estimate that the difference between “hire some mechanical turkers and have them think for like a few seconds” and the actual data collection process accounts for around 1⁄3 of the effort that went into WebGPT, rising to around 2⁄3 if you include model assistance in the form of citations. So I think that what you wrote gives a misleading impression of the aims and priorities of RLHF work in practice.
I think it’s best to err on the side of not saying things that are false in a literal sense when the distinction is important to other people, even when the distinction isn’t important to you—although I can see why you might not have realized the importance of the distinction to others from reading papers alone, and “a few minutes” is definitely less inaccurate.
I just meant that the usual RLHF setup is essentially RL in which the reward is provided by a learned model, but I agree that I was stretching the way the terminology is normally used.