I think this needs to be simplified, so it’s clear what is going on. Removing the Adversary would be useful; if the environment hates you, it’s no surprise it can hurt you, including preventing learning.
A naive approach would be for the student to say “set the next i to α−1”, sacrificing a single reward session for perfect information. Thus it seems this counterexample only works when the environment really punishes any deviations from ideal behaviour, and prevents any types of communication.
On first reading, just ignore the Adversary and consider only the r1 term of the reward.
It is actually not a priori obvious that adversarial environments can prevent learning. I might be mistaken, but I don’t think there is a substantially simpler example of this for IRL. Online learning and adversarial multi-armed bandit algorithms can deal with adversarial environments (thanks to randomization). Moreover, I claim that the setup I describe in the Discussion (allowing the Student to switch control between itself and Teacher without the environment knowing it) admits IRL algorithms which satisfy a non-trivial average regret bound for arbitrary environments.
I think that solutions based on communication are not scalable to applications in strong AI safety. Humans are not able to formalize their preferences and are therefore unable to communicate them to an AI. This is precisely why I want a solution based on the AI observing revealed preferences instead.
I think this needs to be simplified, so it’s clear what is going on. Removing the Adversary would be useful; if the environment hates you, it’s no surprise it can hurt you, including preventing learning.
A naive approach would be for the student to say “set the next i to α−1”, sacrificing a single reward session for perfect information. Thus it seems this counterexample only works when the environment really punishes any deviations from ideal behaviour, and prevents any types of communication.
On first reading, just ignore the Adversary and consider only the r1 term of the reward.
It is actually not a priori obvious that adversarial environments can prevent learning. I might be mistaken, but I don’t think there is a substantially simpler example of this for IRL. Online learning and adversarial multi-armed bandit algorithms can deal with adversarial environments (thanks to randomization). Moreover, I claim that the setup I describe in the Discussion (allowing the Student to switch control between itself and Teacher without the environment knowing it) admits IRL algorithms which satisfy a non-trivial average regret bound for arbitrary environments.
I think that solutions based on communication are not scalable to applications in strong AI safety. Humans are not able to formalize their preferences and are therefore unable to communicate them to an AI. This is precisely why I want a solution based on the AI observing revealed preferences instead.