Actually, I think that if we consider only deterministic maximization policies then an optimal predictor for U wrt a bounded-Somonoff-type measure is sufficient to get an optimal maximization policy. In this case we can do maximization using Levin’s universal search L. A significantly better maximization policy M cannot exist since it would allow us to improve our estimates of U(x,M(x)) and/or U(x,L(x)).
Of course under standard assumptions about derandomization the best deterministic policy is about as good as the best random policy (on average), and in particular requiring U is generatable wrt a suitable measure is sufficient. Random only gives significant advantage in adversarial setups: your toy model of informed oversight is essentially an adversarial setup between A and B.
Also, obviously Levin search is grossly inefficient in practice (like the optimal predictor Λ which is basically a variant of Levin search) but this model suggests that applying a more practical learning algorithm would give satisfactory results.
Also, I think it’s possible to use probabilistic policies which produce computable distributions (since for such a policy outputs with low and high probabilities are always distinguishable).
Actually, I think that if we consider only deterministic maximization policies then an optimal predictor for U wrt a bounded-Somonoff-type measure is sufficient to get an optimal maximization policy. In this case we can do maximization using Levin’s universal search L. A significantly better maximization policy M cannot exist since it would allow us to improve our estimates of U(x,M(x)) and/or U(x,L(x)).
Of course under standard assumptions about derandomization the best deterministic policy is about as good as the best random policy (on average), and in particular requiring U is generatable wrt a suitable measure is sufficient. Random only gives significant advantage in adversarial setups: your toy model of informed oversight is essentially an adversarial setup between A and B.
Also, obviously Levin search is grossly inefficient in practice (like the optimal predictor Λ which is basically a variant of Levin search) but this model suggests that applying a more practical learning algorithm would give satisfactory results.
Also, I think it’s possible to use probabilistic policies which produce computable distributions (since for such a policy outputs with low and high probabilities are always distinguishable).