On first reading, just ignore the Adversary and consider only the r1 term of the reward.
It is actually not a priori obvious that adversarial environments can prevent learning. I might be mistaken, but I don’t think there is a substantially simpler example of this for IRL. Online learning and adversarial multi-armed bandit algorithms can deal with adversarial environments (thanks to randomization). Moreover, I claim that the setup I describe in the Discussion (allowing the Student to switch control between itself and Teacher without the environment knowing it) admits IRL algorithms which satisfy a non-trivial average regret bound for arbitrary environments.
I think that solutions based on communication are not scalable to applications in strong AI safety. Humans are not able to formalize their preferences and are therefore unable to communicate them to an AI. This is precisely why I want a solution based on the AI observing revealed preferences instead.
On first reading, just ignore the Adversary and consider only the r1 term of the reward.
It is actually not a priori obvious that adversarial environments can prevent learning. I might be mistaken, but I don’t think there is a substantially simpler example of this for IRL. Online learning and adversarial multi-armed bandit algorithms can deal with adversarial environments (thanks to randomization). Moreover, I claim that the setup I describe in the Discussion (allowing the Student to switch control between itself and Teacher without the environment knowing it) admits IRL algorithms which satisfy a non-trivial average regret bound for arbitrary environments.
I think that solutions based on communication are not scalable to applications in strong AI safety. Humans are not able to formalize their preferences and are therefore unable to communicate them to an AI. This is precisely why I want a solution based on the AI observing revealed preferences instead.