It is interesting to note that AIXI, a Bayes-optimal reinforcement learner in general environments,is not asymptotically optimal [Orseau, 2010], and in-deed, may cease to explore [Leikeet al., 2015]. Depending on its prior and its past observations, AIXI may decide at some point that further exploration is not worth the risk. Given our result, this seems like reasonable behavior.
Given this, why is your main conclusion “Perhaps our results suggest we are in need of more theory regarding the ‘parenting’ of artificial agents” instead of “We should use Bayesian optimality instead of asymptotic optimality”?
The simplest version of the parenting idea includes an agent which is Bayes-optimal. Parenting would just be designed to help out a Bayesian reasoner, since there’s not much you can say about to what extent a Bayesian reasoner will explore, or how much it will learn; it all depends on its prior. (Almost all policies are Bayes-optimal with respect to some (universal) prior). There’s still a fundamental trade-off between learning and staying safe, so while the Bayes-optimal agent does not do as bad a job in picking a point on that trade-off as the asymptotically optimal agent, that doesn’t quite allow us to say that it will pick the right point on the trade-off. As long as we have access to “parents” that might be able to guide an agent toward world-states where this trade-off is less severe, we might as well make use of them.
And I’d say it’s more a conclusion, not a main one.
From your paper:
Given this, why is your main conclusion “Perhaps our results suggest we are in need of more theory regarding the ‘parenting’ of artificial agents” instead of “We should use Bayesian optimality instead of asymptotic optimality”?
The simplest version of the parenting idea includes an agent which is Bayes-optimal. Parenting would just be designed to help out a Bayesian reasoner, since there’s not much you can say about to what extent a Bayesian reasoner will explore, or how much it will learn; it all depends on its prior. (Almost all policies are Bayes-optimal with respect to some (universal) prior). There’s still a fundamental trade-off between learning and staying safe, so while the Bayes-optimal agent does not do as bad a job in picking a point on that trade-off as the asymptotically optimal agent, that doesn’t quite allow us to say that it will pick the right point on the trade-off. As long as we have access to “parents” that might be able to guide an agent toward world-states where this trade-off is less severe, we might as well make use of them.
And I’d say it’s more a conclusion, not a main one.