A summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.
Problem 1: In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).
Currently I only have speculations about the solution. But, I have a few desiderata for it:
Desideratum 1a: The algorithm should guarantee some lower bound on expected utility, compared to what the Bayes-optimal policy gets. We should also have an upper bound for all polynomial time algorithms. The two bounds should not be too far apart.
Desideratum 1b: When it so happens we have no traps, the algorithm should produce asymptotic Bayes optimality with a regret bound close enough to optimal. When there are only “small” traps, the penalty should be proportional.
Problem 2:: In the presence of traps, there is no “frequentist” guarantee (regret bound). We can divide it into subproblems according to different motivations for having such a guarantee in the first place.
Problem 2a: We want such a guarantee as a certificate of safety.
Problem 2b: The guarantee is motivated by an “evolutionary” perspective on intelligence: intelligent agents are agents that are successful in the real world, not just in average over all possible worlds.
Solution:Bootstrapping from a safe baseline policy. For an individual human, the baseline comes from knowledge learned from other people. For human civilization, some of the baseline comes from inborn instincts. For human civilization and evolution both, the baseline comes from locality and thermodynamics: doing random things is unlikely to cause global irreversible damage. For an aligned AI, the baseline comes from imitation learning and quantilization.
Problem 2c: The guarantee is needed to have a notion of “sample complexity”, which is such an important concept that it’s hard to imagine deconfusion without it. This notion cannot come just from Desideratum 1a since sample complexity should remain non-trivial even given unbounded computational resources.
Solution: A prior consists of a space H of hypotheses and a probability measure ζ over this space. We also have a mapping ρ:H→E where E is the space of environments, which provides semantics to the hypotheses. Bayes-optimizing ζ means Bayes-optimizing the environment ζ⋆:=Eh∼ζ[ρ(h)]. Learnability of ζ means that the Bayesian regret Rg(γ):=Eh∼ζ[V(ρ(h),γ)]−V(ζ⋆,γ) must converge to 0 as γ goes to 1. Here V(μ,γ) is the (normalized to [0,1]) value (maximal expected utility) of environment μ at time discount γ. Notice that the second term depends only on ζ⋆ but the first term depends on ζ and ρ. Therefore, we can ask about the regrets for different decompositions of the same ζ⋆ into hypotheses. For some H′, ζ′∈ΔH′ and ρ′:H′→E s.t. ζ⋆=Eh∼ζ′[ρ′(h)], we can have learnability even when we don’t have it for the original decomposition. I think that typically there will be many such decompositions. They live in the convex set surrounding ζ⋆ in which the value function becomes affine in the γ→1 limit. We can say that not all information is learnable, but ζ′ represents some learnable information. We can then study the regret bound (and thus) sample complexity for a particular ζ′ or for all possible ζ′.
A summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.
Problem 1: In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).
Currently I only have speculations about the solution. But, I have a few desiderata for it:
Desideratum 1a: The algorithm should guarantee some lower bound on expected utility, compared to what the Bayes-optimal policy gets. We should also have an upper bound for all polynomial time algorithms. The two bounds should not be too far apart.
Desideratum 1b: When it so happens we have no traps, the algorithm should produce asymptotic Bayes optimality with a regret bound close enough to optimal. When there are only “small” traps, the penalty should be proportional.
Problem 2:: In the presence of traps, there is no “frequentist” guarantee (regret bound). We can divide it into subproblems according to different motivations for having such a guarantee in the first place.
Problem 2a: We want such a guarantee as a certificate of safety.
Solution: Require a subjective regret bound instead.
Problem 2b: The guarantee is motivated by an “evolutionary” perspective on intelligence: intelligent agents are agents that are successful in the real world, not just in average over all possible worlds.
Solution: Bootstrapping from a safe baseline policy. For an individual human, the baseline comes from knowledge learned from other people. For human civilization, some of the baseline comes from inborn instincts. For human civilization and evolution both, the baseline comes from locality and thermodynamics: doing random things is unlikely to cause global irreversible damage. For an aligned AI, the baseline comes from imitation learning and quantilization.
Problem 2c: The guarantee is needed to have a notion of “sample complexity”, which is such an important concept that it’s hard to imagine deconfusion without it. This notion cannot come just from Desideratum 1a since sample complexity should remain non-trivial even given unbounded computational resources.
Solution: A prior consists of a space H of hypotheses and a probability measure ζ over this space. We also have a mapping ρ:H→E where E is the space of environments, which provides semantics to the hypotheses. Bayes-optimizing ζ means Bayes-optimizing the environment ζ⋆:=Eh∼ζ[ρ(h)]. Learnability of ζ means that the Bayesian regret Rg(γ):=Eh∼ζ[V(ρ(h),γ)]−V(ζ⋆,γ) must converge to 0 as γ goes to 1. Here V(μ,γ) is the (normalized to [0,1]) value (maximal expected utility) of environment μ at time discount γ. Notice that the second term depends only on ζ⋆ but the first term depends on ζ and ρ. Therefore, we can ask about the regrets for different decompositions of the same ζ⋆ into hypotheses. For some H′, ζ′∈ΔH′ and ρ′:H′→E s.t. ζ⋆=Eh∼ζ′[ρ′(h)], we can have learnability even when we don’t have it for the original decomposition. I think that typically there will be many such decompositions. They live in the convex set surrounding ζ⋆ in which the value function becomes affine in the γ→1 limit. We can say that not all information is learnable, but ζ′ represents some learnable information. We can then study the regret bound (and thus) sample complexity for a particular ζ′ or for all possible ζ′.