Stuart_Armstrong comments on IRL is hard

Stuart_Armstrong 14 Sep 2016 9:06 UTC
LW: 2 AF: 1
0
AF
I think this needs to be simplified, so it’s clear what is going on. Removing the Adversary would be useful; if the environment hates you, it’s no surprise it can hurt you, including preventing learning.

A naive approach would be for the student to say “set the next $i$ to $α - 1$ ”, sacrificing a single reward session for perfect information. Thus it seems this counterexample only works when the environment really punishes any deviations from ideal behaviour, and prevents any types of communication.
- Vanessa Kosoy 14 Sep 2016 9:16 UTC
  0 points
  0
  AF Parent
  On first reading, just ignore the Adversary and consider only the $r_{1}$ term of the reward.
  
  It is actually not a priori obvious that adversarial environments can prevent learning. I might be mistaken, but I don’t think there is a substantially simpler example of this for IRL. Online learning and adversarial multi-armed bandit algorithms can deal with adversarial environments (thanks to randomization). Moreover, I claim that the setup I describe in the Discussion (allowing the Student to switch control between itself and Teacher without the environment knowing it) admits IRL algorithms which satisfy a non-trivial average regret bound for arbitrary environments.
  
  I think that solutions based on communication are not scalable to applications in strong AI safety. Humans are not able to formalize their preferences and are therefore unable to communicate them to an AI. This is precisely why I want a solution based on the AI observing revealed preferences instead.