I think I can just tell a lot of stuff wrt human values! How do you think children infer them? I think in order for human values to not be viable to point to extensionally (ie by looking at a bunch of examples) you have to make the case that they’re much more built-in to the human brain than seems appropriate for a species that can produce both Jains and (Genghis Khan era) Mongols.
I’d also note that “incentivize” is probably giving a lot of the game away here—my guess is you can just pull them out much more directly by gathering a large dataset of human preferences and predicting judgements.
Order matters more at smaller scales—if you’re training a small model on a lot of data and you sample in a sufficiently nonrandom manner, you should expect catastrophic forgetting to kick in eventually, especially if you use weight decay.