Intuitive examples of reward function learning?

Can you help find the most intuitive example of reward function learning?

In reward function learning, there is a set of possible non-negative reward functions, , and a learning process which takes in a history of actions and observations and returns a probability distribution over .

If is a policy, is the set of histories of length , and is the probability of given that the agent follows policy , the expected value of at horizon is:

where is the total -reward over the history . Problems can occur if is riggable (this used to be called “biasable”, but that term was over-overloaded), or influenceable.

There’s an interesting subset of value learning problems, which could be termed “constrained optimisation with variable constraints” or “variable constraints optimisation”. In that case, there is an overall reward , and every is the reward subject to constraints . This can be modelled as having being (if the constraints are met) and (if they are not).

Then if we define , and let be a distribution over , the set of constraints, the equation changes to:

If is riggable or influenceable, similar sorts of problems occur.

Intuitive examples

Here I’ll present some examples of reward function learning or variable constraints optimisation, and I’m asking for readers to give their opinions as to which one seems the most intuitive to you, and the easiest to explain to outsiders. You’re also welcome to suggest new examples if you think they work better.

  • Classical value learning: human declarations determine the correctness of a given reward . The reward encodes what food the human prefers, and some foods are much easier to get than other.

  • As above, but the reward encodes whether a domestic robot should clean the house or cook a meal.

  • As above, but the reward encodes the totality of human values in all environments.

  • Variable constraint optimisation: the agent is writing an unoriginal academic paper (or a patent), and must maximise the chance it gets accepted. The paper must include a literature review (constraints), but the agent gets to choose the automated process that produces the literature review.

  • Variable constraint optimisation: p-hacking. The agent chooses which hypothesis to formulate. It already knows something about the data, and its reward is the number of citations the paper gets.

  • Variable constraint optimisation: board of directors. The CEO must maximise share price, but its constraint is that the policy it formulates must be approved by the board of directors.

  • Variable constraint optimisation: retail. A virtual assistant guides the purchases of a customer. They must maximise revenue to the seller, subject to the constraint that the product bought must be given a four or five star review by the customer.