Performance deteriorating implies that the prior p is not yet a fixed point of p*=D(A(p*)).

At least in the case of AlphaZero, isn’t the performance deterioration from A(p*) to p*? I.e. A(p*) is full AlphaZero, while p* is the “Raw Network” in the figure. We could have converged to the fixed point of the training process (i.e. p*=D(A(p*))) and still have performance deterioration if we use the unamplified model compared to the amplified one. I don’t see a fundamental reason why p* = A(p*) should hold after convergence (and I would have been surprised if it held for e.g. chess or Go and reasonably sized models for p*).

I enjoyed reading this! And I hadn’t seen the interpretation of a logistic preference model as approximating Gaussian errors before.

Since you seem interested in exploring this more, some comments that might be helpful (or not):

I’m confused why you’re using a neural network; given the small size of the input space, wouldn’t it be easier to just learn a tabular utility function (i.e. one value for each input, namely its utility)? It’s the largest function space you can have but will presumably also be much easier to train than a NN.

Questions like the ones you raise could become more interesting in settings with much more complicated inputs. But I think in practice, the expensive part of preference/reward learning is gathering the preferences, and the most likely failure modes revolve around things related to training an RL policy in parallel to the reward model. The architecture etc. seem a bit less crucial in comparison.

I thought about this and very similar questions a bit for my Master’s thesis before changing topics, happy to chat about that if you want to go down this route. (Though I didn’t think about inconsistent preferences, just about the effect of noise. Without either, the answer should just be NlogN I guess.)

You might want to think more about how to measure this, or even what exactly it would mean if “no consistent utility function can be inferred”. In principle, for any (not necessarily transitive) set of preferences, we can ask what utility function best approximates these preferences (e.g. in the sense of minimizing loss). The approximation can be exact iff the preferences are consistent. Intuitively, slightly inconsistent preferences lead to a reasonably good approximation, and very inconsistent preferences probably admit only very bad approximations. But there doesn’t seem to be any point where we can’t infer the best possible approximation at all.

Related to this (but a bit more vague/speculative): it’s not obvious to me that approximating inconsistent preferences using a utility function is the “right” thing to do. At least in cases where human preferences are highly inconsistent, this seems kind of scary. Not sure what we want instead (maybe the AI should point out inconsistencies and ask us to please resolve them?).