The biggest problem about AIXI in my view is the reward system - it cares about the future directly, whereas to have any reasonable hope of alignment an AI in my view needs to care about the future only via what humans would want about the future (so that any reference to the future is encapsulated in the “what do humans want?” aspect).
I.e. the question it needs to be answering is something like “all things considered (including the consequences of my current action on the future, as well as taking into account my possible future actions) what would humans, as they exist now, want me to do at the present moment?”
Now maybe you can take that question and try to slice it up into rewards at particular timesteps, which change over time as what is known about what humans want changes, without introducing corrigibility issues, but the AIXI reward framework isn’t really buying you anything imo even if that works, relative to directly trying to get an AI to solve the question.
On the other hand approximating Solomonoff induction might afaik be a fruitful approach, though the approximations are going to have to be very aggressive for practical performance. I do agree embeddding/self-reference can probably be patched in.
I am currently writing a paper on alternative utility functions for AIXI. Early steps in this direction have been taken for example here by @Anja and here by @AlexMennen—as far as I know the only serious published example is Laurent Orseau’s knowledge-seeking agent.
The reward-seeking formulation of AIXI is a product of its time and not a fundamental feature/constraint—any “continuous, l.s.c.” utility function is fine—the details will be formulated in my paper.
Actually choosing that utility function to be aligned with human values is ~equivalent to the alignment problem. AIXI does not solve it, but does “modularize” it to some extent.
The biggest problem about AIXI in my view is the reward system - it cares about the future directly, whereas to have any reasonable hope of alignment an AI in my view needs to care about the future only via what humans would want about the future (so that any reference to the future is encapsulated in the “what do humans want?” aspect).
I.e. the question it needs to be answering is something like “all things considered (including the consequences of my current action on the future, as well as taking into account my possible future actions) what would humans, as they exist now, want me to do at the present moment?”
Now maybe you can take that question and try to slice it up into rewards at particular timesteps, which change over time as what is known about what humans want changes, without introducing corrigibility issues, but the AIXI reward framework isn’t really buying you anything imo even if that works, relative to directly trying to get an AI to solve the question.
On the other hand approximating Solomonoff induction might afaik be a fruitful approach, though the approximations are going to have to be very aggressive for practical performance. I do agree embeddding/self-reference can probably be patched in.
I am currently writing a paper on alternative utility functions for AIXI. Early steps in this direction have been taken for example here by @Anja and here by @AlexMennen—as far as I know the only serious published example is Laurent Orseau’s knowledge-seeking agent.
The reward-seeking formulation of AIXI is a product of its time and not a fundamental feature/constraint—any “continuous, l.s.c.” utility function is fine—the details will be formulated in my paper.
Actually choosing that utility function to be aligned with human values is ~equivalent to the alignment problem. AIXI does not solve it, but does “modularize” it to some extent.