A problem with resource-bounded Solomonoff induction and unpredictable environments

Summary: we examine a game involving hash functions and show that an algorithm based on resource-bounded Solomonoff induction is outperformed by a simple strategy. This game has some parallels to Vingean reflection.


Betting environments

Consider the following family of environments (which we might call “betting environments”):

  1. For iteration :

  2. Draw hidden state from unknown distribution .

  3. Output first observation

  4. The agent chooses some action .

  5. Output second observation .

  6. The agent observes and receives reward , where is an unknown function that always returns a number between 0 and 1.

The idea of these environments is that the agent’s action affects nothing except reward, so acting well in these environments only requires predicting the next reward correctly given the agent’s action. The second observation has no apparent effect on what the agent’s strategy should be, but we’ll see later why it is included. First, we will see how one general approach to problems like this fails.

A resource-bounded algorithm for betting environments

Consider the following resource-bounded variant of Solomonoff induction. Each hypothesis is represented as a program that takes as input a bit string and returns a probability (interpreted as the probability that the next bit will be 1). As in ordinary Solomonoff induction, the prior probability for a hypothesis is . The probability that hypothesis assigns to observation history is , i.e. the product of the conditional probabilities for each bit given previous bits. We restrict attention to values that run in , where is the length of the input.

Using this variant of Solomonoff induction, we can develop an AIXI-like algorithm for acting in betting environments. Specifically, when choosing an action, this algorithm does the following:

  1. Use resource-bounded Solomonoff induction to predict the distribution of future observations and rewards, given each action.

  2. Most of the time, pick the action that maximizes the expected next reward; otherwise pick a random action

The purpose of the random action is to learn the correct distribution for each counterfactual reward in the long run. For example, when choosing action we could pick a random with probability . This way, the rate of exploration approaches 0 over time, but the total number of exploration steps approaches infinity.

Failure in an unpredictable environment

Here’s an example of a betting environment that this algorithm fails on. Suppose we have some family of hash functions for each natural , such that returns strings of length . Assume that inverting takes time . Now consider the following betting environment:

  • consists of pairs of binary strings

  • works as follows:

    • The distribution of the second string is uniformly random, with length .

    • 90% of the time,

    • Otherwise is uniformly random with length .

Immediately, it is clear that always choosing action 1 will yield 0.9 expected reward per iteration. This is the optimal strategy for an agent that lacks the computational resources to tell anything about given . I believe that if is an appropriate family of hash functions, this will be true for all polynomial-time algorithms in the long run.

Now let’s see how our resource-bounded algorithm does on this problem. After a large number of iterations, any winning hypothesis for our resource-bounded Solomonoff induction variant will satisfy:

  1. Distribution of is approximately uniformly random

  2. Given observations and action , predict the correct reward with high probability

This is because these conditions state that the hypothesis has close to the correct conditional distributions for and . If a hypothesis fails to satisfy one of these properties, we can replace it with a variant that does satisfy the property, yielding a hypothesis that assigns infinitely higher log-probability to the data (in the long run) without being much more complicated.

This only leaves the distribution of given . Correctly predicting would require reversing the hash function. Given that the hypothesis must run in polynomial time, if we take many samples of given using the hypothesis, we should expect an (exponentially) small fraction to satisfy . So any hypothesis will predict that, with high probability, .

Given this, our algorithm will take action 0, because it achieves expected reward 0.5, while action 1 achieves an exponentially small expected reward. Obviously, this behavior is highly pathological.

Conclusion

As an alternative to this algorithm, consider the same algorithm except that it completely ignores observation . This new algorithm will correctly predict that, about 90% of the time, action 1 yields reward 1. Therefore, it will take action 1. Intuitively, this seems like the correct behavior in general. Since winning hypotheses will predict reward given action and well subject to resource bounds, they can be used in a straightforward way for estimating expected reward for each action.

The problem with our original algorithm parallels a problem with naïve approaches to Vingean reflection. If the agent’s environment contains a hash function inverter, then it may encounter a situation similar to the betting environment in this post: it will see the hash code of a string before the string itself. Since the agent is unable to predict the behavior of the hash function inverter, it must use abstract reasoning to decide how much utility to expect from each action, rather than actually simulating the hash function inverter. The same reasoning sometimes applies to smart agents other than hash function inverters.

For future research, it will be useful to see how strategies for solving the hash function betting environment (i.e. predict expected reward 0.9 from action 1) might translate to strategies for Vingean reflection.