we should be able to formally define a natural class of environment that excludes them but is still sufficiently rich
I disagree. There’s a lot of “no free lunch” theorems out there that in principle show various things (eg no agent can be intelligent across all environments) that in practice require very specific environments for the bad stuff to hurt the agent (ie the environment has to behave, in practice, as if it was a smarter adversarial agent that specifically hated the first agent).
We want to find algorithms that satisfy provable performance guarantees. A “no free lunch” theorem shows that a specific performance guarantee is impossible. Therefore we should look for other performance guarantees, either by changing the setting or changing the class of environments or doing something else.
My counterexample shows that certain natural performance guarantees are unsatisfiable. Therefore it is important to look for other natural settings / desirada. In particular I suggest one solution which seems natural and feasible, namely allowing the Student to randomize control between itself and the Teacher. This variant of IRL thereby seems qualitatively more powerful than the original.
Also, obviously the first counterexample constructed is always the most “adversarial” one since it’s the easiest to prove. This doesn’t mean that there is an algorithm that works well in most other cases. Given that we are in the AI safety business, the burden of proof is one the claim that the “bad” environment is exceptional, not vice versa. Moreover, my intuition is that this counterexample is not exceptionally bad, it’s just maximally bad, i.e. in a typical scenario IRL will only be able to extract a (not 0 but also not 1) portion of the important information about the utility function. If you can prove me wrong, I will be glad to see it!
I disagree. There’s a lot of “no free lunch” theorems out there that in principle show various things (eg no agent can be intelligent across all environments) that in practice require very specific environments for the bad stuff to hurt the agent (ie the environment has to behave, in practice, as if it was a smarter adversarial agent that specifically hated the first agent).
Allow me to recapitulate my point.
We want to find algorithms that satisfy provable performance guarantees. A “no free lunch” theorem shows that a specific performance guarantee is impossible. Therefore we should look for other performance guarantees, either by changing the setting or changing the class of environments or doing something else.
My counterexample shows that certain natural performance guarantees are unsatisfiable. Therefore it is important to look for other natural settings / desirada. In particular I suggest one solution which seems natural and feasible, namely allowing the Student to randomize control between itself and the Teacher. This variant of IRL thereby seems qualitatively more powerful than the original.
Also, obviously the first counterexample constructed is always the most “adversarial” one since it’s the easiest to prove. This doesn’t mean that there is an algorithm that works well in most other cases. Given that we are in the AI safety business, the burden of proof is one the claim that the “bad” environment is exceptional, not vice versa. Moreover, my intuition is that this counterexample is not exceptionally bad, it’s just maximally bad, i.e. in a typical scenario IRL will only be able to extract a (not 0 but also not 1) portion of the important information about the utility function. If you can prove me wrong, I will be glad to see it!