For example, if you assume “Anything humans say about their preferences is true,” that’s basically giving up on the Bayesian approach as usually imagined
More formally, what I mean by that is “assume humans are perfectly rational, and fit a reward/utility function given those assumptions”. This is a perfectly Bayesian approach, and will always produce a (over-complicated) utility function that fits with the observed behaviour.
In the usual Bayesian setting, “humans are perfectly reliable” corresponds to believing that human utterances correctly track (fixed) human preferences, i.e. believing that it is impossible to influence those utterances.
Yes and no. Under the assumption that humans are perfectly reliable, influencing human preferences and utterances is impossible. But this leads to behaviour that resembles influencing human utterances under other assumptions.
eg if you threaten a human with a gun and ask them to report they are maximally happy, a sensible model of human preferences will say they are lying. But the “humans are rational” model will simply conclude that humans really like being threatened in this way.
More formally, what I mean by that is “assume humans are perfectly rational, and fit a reward/utility function given those assumptions”. This is a perfectly Bayesian approach, and will always produce a (over-complicated) utility function that fits with the observed behaviour.
Yes and no. Under the assumption that humans are perfectly reliable, influencing human preferences and utterances is impossible. But this leads to behaviour that resembles influencing human utterances under other assumptions.
eg if you threaten a human with a gun and ask them to report they are maximally happy, a sensible model of human preferences will say they are lying. But the “humans are rational” model will simply conclude that humans really like being threatened in this way.