Pattern comments on [AN #106]: Evaluating generalization ability of learned reward models

Pattern 2 Jul 2020 14:29 UTC
LW: 4 AF: 3
AF
The authors prove that EPIC is a pseudometric, that is, it behaves like a distance function, except that it is possible for EPIC(R1, R2) to be zero even if R1 and R2 are different. This is desirable, since if R1 and R2 differ by a potential shaping function, then their optimal policies are guaranteed to be the same regardless of transition dynamics, and so we should report the “distance” between them to be zero.
If EPIC(R1, R2) is thought of as two functions f(g(R1), g(R2)), where g returns the optimal policy of its input, and f is a distance function for optimal policies, then f(OptimalPolicy1, OptimalPolicy2) is a metric?

One nice thing is that, roughly speaking, rewards are judged to be equivalent if they would generalize to any possible transition function that is consistent with DT. This means that by designing DT appropriately, we can capture how much generalization we want to evaluate.
Can more than one DT be used, so there’s more than one measure?
This is a useful knob to have: if we used the maximally large DT, the task would be far too difficult, as it would be expected to generalize far more than even humans can.
There’s a maximum?
- Rohin Shah 2 Jul 2020 18:08 UTC
  LW: 4 AF: 3
  AF Parent
  If EPIC(R1, R2) is thought of as two functions f(g(R1), g(R2)), where g returns the optimal policy of its input, and f is a distance function for optimal policies, then f(OptimalPolicy1, OptimalPolicy2) is a metric?
  The authors don’t prove it, but I believe yes, as long as DS and DA put support over the entire state space / action space (maybe you also need DT to put support over every possible transition).
  I usually think of this as “EPIC is a metric if defined over the space of equivalence classes of reward functions”.
  Can more than one DT be used, so there’s more than one measure?
  Yes.
  There’s a maximum?
  For finite, discrete state/action spaces, the uniform distribution over (s, a, s’) tuples has maximal entropy. However, it’s not clear that that’s the worst case for EPIC.