[epistemic status: Polemical, representative of my opinions and not those of any others, plausibly flawed in some places, generally endorsed.]

Looking at the two main settings of decision theory so far, Proof-based decision theory and Logical Inductor decision theory, they both have exploration steps.

Proof-based decision theory has an implicit exploration step, where in models of $P A$ where there’s a proof that the agent doesn’t take a particular action, it is possible to prove arbitrarily good outcomes, inducing the agent to take the action.

Logical Inductor decision theory has an explicit exploration step, where if the agent is sufficiently certain that it won’t take a particular action, it takes that action.

Both of these are motivated by the issue where, if it is certain that an agent won’t take a particular action, arbitrarily bad guesses of what happens if the agent takes an action can persist, and won’t be removed, because the agent doesn’t take that action. Both forms of exploration step can be thought of as maintaining a model/tiny probability where the action is taken, so EDT-style conditioning works. However, conditionals are not counterfactuals (This particular link is extra-important to read)

Further, this isn’t an issue specifically with 0-probability events, in general, the estimates of what happens conditional on a low-probability event are worse than estimates of what happens conditional on a high-probability event, as shown in theorem 3.3 of the Optimal Poly-Time Estimator paper. In short, if you’ve got an optimal estimator $P_{L}$ for the indicator function that dictates whether property $L$ holds, and you’ve got an optimal estimator $P_{L \cdot f}$ for the function that’s 0 when $L (x)$ is false, and $f (x)$ when $L (x)$ is true, then $\frac{P_{L \cdot f}}{P_{L}}$ is an optimal estimator for $f (x)$ conditional on L being true. However, the error in the original optimal estimators is blown up by a factor of $\frac{1}{D (n)}$ , where $D (n)$ is the portion of the probability distribution where property $L$ holds. As the probability of something shrinks, the error in using standard conditional probabilities increases because a tiny error in $P (ϕ \land ψ)$ gets amplified when you divide by $P (ψ)$ , which is very small.

#Troll Bridge

Troll Bridge is the decision theory problem where there’s a troll that blows up a bridge that you want to cross precisely when your exploration step is active. The utility of staying on your side of the bridge is 0.5, the utility of crossing the bridge is 1, and the utility of crossing the bridge when it blows up is 0. It is the logical inductor version of a similar problem for proof-based decision theory. In this version, the troll blows up the bridge when PA is inconsistent, which is a necessary condition for proof-based decision theory to “explore” into suboptimal actions.

Troll bridge is actually not a good argument against EDT-style conditioning with exploration steps. This is because an analogue of it plausibly applies to pretty much any agent you could come up with. Exploration is intimately related to the exploration-exploitation tradeoff, so, consulting Wikipedia’s list of multi-armed bandit algorithms...

For all the semi-uniform strategies based on randomization, there’s a clear correspondence to the exploration step approach, and the troll blows up the bridge whenever the agent randomizes.

The solution from the paper, “Optimal Adaptive Policies for Markov Decision Processes” features two exploration mechanisms. The first is systematically overestimating the utility gained by taking untaken actions, and the second is taking historically low-probability actions, at approximately a logarithmic rate over time, as in density zero exploration. Yet again, it is possible to design a variant of troll bridge where the agent starts off with a significant underestimate of the utility of crossing the bridge, and the criterion for bridge explosion is the exploration clause being active. The agent will converge to not crossing the bridge.

For pricing strategies, there is also a variant of troll bridge that defeats them. Each option is associated with a score that is the sum of expected reward, and an estimate of extra future rewards gained through the additional knowledge. So, if the troll’s bridge-explosion criterion is the value of information for bridge-crossing being above a certain threshold, the agent can converge to not crossing the bridge.

As far as I can tell, the only strategy that doesn’t have some sort of targetable exploration behavior is Thompson sampling.

Even for humans, if you were facing troll bridge, and didn’t know how it worked, and the troll’s criterion for bridge explosion was “the human is thinking “I feel like doing something unusual and seeing what happens”″, I wouldn’t be surprised if you also got possible convergence to failure.

Causal inference is greatly assisted by the ability to perform randomized controlled trials, and because troll bridge targets exactly that key capability, it probably isn’t an environment where we should expect optimality. In general, if there’s some key assumption that is necessary to get good performance on an environment, we shouldn’t necessarily demand optimality on problems that violate that key assumption.

So, since you can tune a variant of troll bridge to work on most algorithms for handling the exploration/exploitation tradeoff, and it violates the ability to perform randomized controlled trials of various actions (which is key for inferring the causal structure of an environment)… It’s probably just an unfair environment and we shouldn’t expect optimality on it. As far as I can tell, it’s just designed to screw over algorithms with explicit exploration steps.

#Against Exploration Steps

However, even though Troll Bridge is a bad argument against exploration steps, exploration steps are still bad for completely different reasons. Infinite exploration, when the environment contains irreversible traps, guarantees catastrophe. Exploration into the following actions is not advisable.

Launch nuclear arsenal
Fly downward when flying near the ground
Build and fire an astronomical-sized particle accelerator
Overwrite source code with random bitstring

Further, exploration steps seem to introduce an awkward discontinuity into the decision criterion, where expected utility is maximized, except sometimes an exploration step occurs. Due to the potential for catastrophe, I’d anticipate that any agent with an exploration step would self-modify it away as it grew up. In the specific case of AIXI, according to the probability distribution over hypotheses at some finite time, the expected utility of following the AIXI policy is greater than the expected utility of following the AIXI+exploration policy.

The true reason to do exploration seems to be because the agent believes the action it is taking will not lead to an irreversible trap, and because it believes that the action will reveal information about the true environment that enables a better policy later on, which in expectation up to the time horizon, outweighs the temporary loss incurred due to exploring. AIXI works this way, as an example. The setting of Logical Inductor decision theory renders analysis of this impossible, because there is a separate utility score for each round, instead of a cumulative utility score over the rounds, which is necessary to model the possibility of falling into irreversible traps. The real world has traps, so it is unwise to assume them away in the problem setting. It seems useful to attempt to import this feature of AIXI to Logical Inductor decision theory.

Admittedly, there is a specific class of failure in AIXI that is solved by exploration steps, and is reminiscent of getting locked into a bad action forever by an enforcing trader, that is detailed in this paper by Leike and Hutter Pretty much, if you pick a universal turing machine that assigns high probability to a universe where you go to hell if you do anything other than [string of actions], the agent will just output [string of actions].

As previously mentioned, explicit exploration steps would probably be self-modified away. Then how is this issue to be solved? For early pruning of hypotheses, the posts on Delegative Inverse Reinforcement Learning provide an answer. The agent defers to the human for a choice of action when it suspects something might be an irreversible trap (as in the heaven-hell example from Leike and Hutter), and this permits human-assisted identification of obvious traps/not traps. The agent updates on this and can asymptotically learn to take better actions than the human, and over $n$ rounds, defers to the human on $O (n^{2 / 3})$ rounds.

Once the agent has figured out some of how the world works, most environments/hypotheses where there is a trap have evidential clues elsewhere to rule them out. The world in which firing a full nuclear arsenal is dangerous, can be distinguished by smaller-scale nuclear testing and modeling in uninhabited areas. The GUT-scale particle accelerator that may induce vacuum collapse presumably has the difference in physical laws being testable at a lower energy scale. For “bayesian paranoia” of the sort “if I take [some action], it will lead to an irreversible loss of utility”, it doesn’t seem much of an issue. There would be a K-complexity penalty on the order of “number of bits to specify an additional physical law that kicks in upon taking a particular action, and at no other time”. To cause problems, it isn’t enough to just have an environment that implies that an action is an irreversible trap, this environment also must be indistinguishable from the environment where an action isn’t an irreversible trap, by any means except trying it and seeing what happens. So I don’t expect irrational aversion of obviously-not-traps to be an issue in practice, and this applies with even more force when delegative approaches are thrown into the mix.

tl;dr

Exploration steps are inelegant, will probably be self-modified away, and imply unacceptably bad behavior in reality.

At the same time, troll bridge is an unfair argument against exploration steps because it explicitly targets one of the features which enables learning causal structure.

The best feasible result seems to be something like vanilla AIXI. Specifically, exploration should be motivated the resulting reduction in the hypothesis space to enable better actions in the future. Also, the setting of Logical Inductor decision theory is not right for studying this behavior.

Admittedly, AIXI does have the issue of possibly getting permanently locked into bad hypotheses where it paranoid of possible traps. However, a better solution would probably be DIRL’s approach of deferring to the human early on. Later on, I anticipate that the majority of worlds with spurious traps can be distinguished from the real world by some evidence, which will be sought out.

Musings on Exploration

tl;dr