We want to find algorithms that satisfy provable performance guarantees. A “no free lunch” theorem shows that a specific performance guarantee is impossible. Therefore we should look for other performance guarantees, either by changing the setting or changing the class of environments or doing something else.
My counterexample shows that certain natural performance guarantees are unsatisfiable. Therefore it is important to look for other natural settings / desirada. In particular I suggest one solution which seems natural and feasible, namely allowing the Student to randomize control between itself and the Teacher. This variant of IRL thereby seems qualitatively more powerful than the original.
Also, obviously the first counterexample constructed is always the most “adversarial” one since it’s the easiest to prove. This doesn’t mean that there is an algorithm that works well in most other cases. Given that we are in the AI safety business, the burden of proof is one the claim that the “bad” environment is exceptional, not vice versa. Moreover, my intuition is that this counterexample is not exceptionally bad, it’s just maximally bad, i.e. in a typical scenario IRL will only be able to extract a (not 0 but also not 1) portion of the important information about the utility function. If you can prove me wrong, I will be glad to see it!
Allow me to recapitulate my point.
We want to find algorithms that satisfy provable performance guarantees. A “no free lunch” theorem shows that a specific performance guarantee is impossible. Therefore we should look for other performance guarantees, either by changing the setting or changing the class of environments or doing something else.
My counterexample shows that certain natural performance guarantees are unsatisfiable. Therefore it is important to look for other natural settings / desirada. In particular I suggest one solution which seems natural and feasible, namely allowing the Student to randomize control between itself and the Teacher. This variant of IRL thereby seems qualitatively more powerful than the original.
Also, obviously the first counterexample constructed is always the most “adversarial” one since it’s the easiest to prove. This doesn’t mean that there is an algorithm that works well in most other cases. Given that we are in the AI safety business, the burden of proof is one the claim that the “bad” environment is exceptional, not vice versa. Moreover, my intuition is that this counterexample is not exceptionally bad, it’s just maximally bad, i.e. in a typical scenario IRL will only be able to extract a (not 0 but also not 1) portion of the important information about the utility function. If you can prove me wrong, I will be glad to see it!