Agree that IRL doesn’t solve this problem (it just bumps it to another level).
The second tier thing sounds a lot like KWIK learning. I think this is a decent approach if we’re fine with only learning instrumental goals and are using a bootstrapping procedure.
KWIK learning is definitely related in the sense that we want to follow a “conservative” policy that is risk averse w.r.t. its uncertainty regarding the utility function, which is similar to how KWIK learning doesn’t produce labels about which it is uncertain. Btw, do you know which of the open problems in the Li-Littman-Walsh paper are solved by now?
I don’t know which open problems have been solved.