[Question] Can we learn much by studying the behaviour of RL policies?

Economists sometimes study revealed preferences, which are preferences that we can infer from choices, e.g. when given the choice between an apple or an orange, if I choose an apple, then I have revealed a preference for an apple over an orange. I’m wondering about the revealed preferences of RL policies (applying behavioural econ /​ experimental econ to RL policies). We can elicit revealed preferences from RL policies by observing their actions following various histories and we can see whether the revealed preferences satisfy various decision theoretic axioms.

Revealed preferences don’t tell us anything about the inner workings of an agent but they can tell us whether or not an agent is acting as if they’re following particular decision theories. We can ask questions such as:

  • In deterministic environments, under what conditions do RL policies exhibit behaviour that can be represented by a utility function?

  • In indeterministic environments, under what conditions do RL policies exhibit behaviour that is consistent with the axioms of expected utility theory (EUT)?

  • What do RL agents’ revealed preferences /​ utility functions (if applicable) look like under distributional shifts and/​or computational constraints?

  • How does RLHF affect revealed preferences?

These seem like pretty natural questions to ask, so I’m wondering what existing work there is on related questions and how promising this kind of work could be. Knowing if a policy is consistent with the axioms of EUT seems helpful, but maybe not that helpful, since this isn’t sufficient for the system to be actually internally maximising expected utility with respect to some utility function and since not behaving in a way consistent with EUT isn’t sufficient for being safe.

I imagine this kind of work to be most interesting under distributional shifts and/​or significant computational constraints relative to the complexity of the environment, where we might be able to learn about RL failure modes. Because it focuses only on observed behaviour though, studying revealed preferences seems to me both much less useful and much easier than understanding ML systems through mechanistic interpretability.

I’m interested in what other people think about (i) the object level questions above; (ii) existing work on these questions; (iii) how useful studying these (or similar) questions would be. I’m coming at this from an economics/​maths/​phil background, I’m less familiar with CS, and might be missing important search terms and basic knowledge.