I find difficulty understanding inferences about parameters $ \alpha,\beta,\gamma $ in the “Example:regret” part.

Take the fully rational planner `p`

for example.

Since the human will say `h`

following `s`

, the different between reward functions for `h`

and `-h`

is non-negative, which implies that: $ (\beta R(h)+\gamma R(h|s)) - (\beta R(\sim h)+\gamma R(\sim h|s)) \geq 0 $

Then it is concluded that $ \beta R(h-\sim h)+\gamma R(h-\sim h|s)\geq0$

Similarly, from the human will say $ \sim h$ following `i`

, we have $ \beta R(h-\sim h)+\delta R(h-\sim h|i)\leq0$

It seems that more information about the reward function is need in order to arrive at the final model with $ (p,R(\alpha,\beta,\gamma,\delta)|\gamma\geq-\beta\geq\delta) $

