Idea: Imitation/​Value Learning AIXI

Note: I’m not writing with rigor

Michele Campolo proposes that goal directed behavior is compressible which gives a condition (not sufficient) for goal directed behavior. Say takes in data from the environment and then converts that to a reward function. One standard way of specifying is to take as a set of expert demonstrations and then to perform IRL on . After you do this we have a reward function. Ultimately we have,

This follows from proposition 3.2 of the original GAIL paper. Thus, the compression bound Campolo presents becomes,

However, at this point I noticed that we could instead try to find the simplest given the data. This would be a policy that best reproduces the structure of in deployment. I’d presume this would be a imitation-learning equivalent of AIXI.

AIXI works by using Solomonoff Induction to apply Occam’s Razor to environment observations. This is an a priori model of the environment that is then updated as AIXI interacts with the environment. It seems that you could define a imitation-learning version of AIXI (AIXIL) by weighting all possible policies with the Solomonoff version of Occam’s Razor.

Using Campolo’s bound, the result would perform at, or above, the level of using a value learning approach (even a AIXVL approach). This seems to imply value learning isn’t really fundamental to creating an AI that models our behavior. I’d assume someone’s worked this out further than I did here...comments/​​references?