Summary: we examine a game involving hash functions and show that an algorithm based on resource-bounded Solomonoff induction is outperformed by a simple strategy. This game has some parallels to Vingean reflection.
Betting environments
Consider the following family of environments (which we might call “betting environments”):
For iteration i=0,1,...:
Draw hidden state (s1,s2)∈S from unknown distribution ps(i).
Output first observation s1
The agent chooses some action a∈A.
Output second observation s2.
The agent observes and receives reward r(s1,s2,a), where r is an unknown function that always returns a number between 0 and 1.
The idea of these environments is that the agent’s action affects nothing except reward, so acting well in these environments only requires predicting the next reward correctly given the agent’s action. The second observation has no apparent effect on what the agent’s strategy should be, but we’ll see later why it is included. First, we will see how one general approach to problems like this fails.
A resource-bounded algorithm for betting environments
Consider the following resource-bounded variant of Solomonoff induction. Each hypothesis is represented as a program q that takes as input a bit string and returns a probability (interpreted as the probability that the next bit will be 1). As in ordinary Solomonoff induction, the prior probability for a hypothesis q is 2−length(q). The probability that hypothesis q assigns to observation history b1b2...bn is ∏ni=1(if bi=1 then q(b1:i−1) else 1−q(b1:i−1)), i.e. the product of the conditional probabilities for each bit given previous bits. We restrict attention to q values that run in O(n2), where n is the length of the input.
Using this variant of Solomonoff induction, we can develop an AIXI-like algorithm for acting in betting environments. Specifically, when choosing an action, this algorithm does the following:
Use resource-bounded Solomonoff induction to predict the distribution of future observations and rewards, given each action.
Most of the time, pick the action that maximizes the expected next reward; otherwise pick a random action
The purpose of the random action is to learn the correct distribution for each counterfactual reward r(s1,s2,a) in the long run. For example, when choosing action i we could pick a random with probability 1/(i+1). This way, the rate of exploration approaches 0 over time, but the total number of exploration steps approaches infinity.
Failure in an unpredictable environment
Here’s an example of a betting environment that this algorithm fails on. Suppose we have some family of hash functions hi for each natural i, such that hi returns strings of length i. Assume that inverting hi takes time Ω(2i). Now consider the following betting environment:
S consists of pairs of binary strings
ps(i) works as follows:
The distribution of the second string s2 is uniformly random, with length i.
90% of the time, s1=hi(s2)
Otherwise s1 is uniformly random with length i.
A={0,1}
r(s1,s2,0)=0.5
r(s1,s2,1)=if hn(s2)=s1 then 1 else 0
Immediately, it is clear that always choosing action 1 will yield 0.9 expected reward per iteration. This is the optimal strategy for an agent that lacks the computational resources to tell anything about r(s1,s2,1) given s2. I believe that if hi is an appropriate family of hash functions, this will be true for all polynomial-time algorithms in the long run.
Now let’s see how our resource-bounded algorithm does on this problem. After a large number of iterations, any winning hypothesis for our resource-bounded Solomonoff induction variant will satisfy:
Distribution of s1 is approximately uniformly random
Given observations s1,s2 and action a, predict the correct reward r(s1,s2,a) with high probability
This is because these conditions state that the hypothesis has close to the correct conditional distributions for s1 and r. If a hypothesis fails to satisfy one of these properties, we can replace it with a variant that does satisfy the property, yielding a hypothesis that assigns infinitely higher
log-probability to the data (in the long run) without being much more complicated.
This only leaves the distribution of s2 given s1. Correctly predicting s2 would require reversing the hash function. Given that the hypothesis must run in polynomial time, if we take many samples of s2 given s1 using the hypothesis, we should expect an (exponentially) small fraction to satisfy hn(s2)=s1. So any hypothesis will predict that, with high probability, hn(s2)≠s1.
Given this, our algorithm will take action 0, because it achieves expected reward 0.5, while action 1 achieves an exponentially small expected reward. Obviously, this behavior is highly pathological.
Conclusion
As an alternative to this algorithm, consider the same algorithm except that it completely ignores observation s2. This new algorithm will correctly predict that, about 90% of the time, action 1 yields reward 1. Therefore, it will take action 1. Intuitively, this seems like the correct behavior in general. Since winning hypotheses will predict reward given action and s1 well subject to resource bounds, they can be used in a straightforward way for estimating expected reward for each action.
The problem with our original algorithm parallels a problem with naïve approaches to Vingean reflection. If the agent’s environment contains a hash function inverter, then it may encounter a situation similar to the betting environment in this post: it will see the hash code of a string before the string itself. Since the agent is unable to predict the behavior of the hash function inverter, it must use abstract reasoning to decide how much utility to expect from each action, rather than actually simulating the hash function inverter. The same reasoning sometimes applies to smart agents other than hash function inverters.
For future research, it will be useful to see how strategies for solving the hash function betting environment (i.e. predict expected reward 0.9 from action 1) might translate to strategies for Vingean reflection.
A problem with resource-bounded Solomonoff induction and unpredictable environments
Summary: we examine a game involving hash functions and show that an algorithm based on resource-bounded Solomonoff induction is outperformed by a simple strategy. This game has some parallels to Vingean reflection.
Betting environments
Consider the following family of environments (which we might call “betting environments”):
For iteration i=0,1,...:
Draw hidden state (s1,s2)∈S from unknown distribution ps(i).
Output first observation s1
The agent chooses some action a∈A.
Output second observation s2.
The agent observes and receives reward r(s1,s2,a), where r is an unknown function that always returns a number between 0 and 1.
The idea of these environments is that the agent’s action affects nothing except reward, so acting well in these environments only requires predicting the next reward correctly given the agent’s action. The second observation has no apparent effect on what the agent’s strategy should be, but we’ll see later why it is included. First, we will see how one general approach to problems like this fails.
A resource-bounded algorithm for betting environments
Consider the following resource-bounded variant of Solomonoff induction. Each hypothesis is represented as a program q that takes as input a bit string and returns a probability (interpreted as the probability that the next bit will be 1). As in ordinary Solomonoff induction, the prior probability for a hypothesis q is 2−length(q). The probability that hypothesis q assigns to observation history b1b2...bn is ∏ni=1(if bi=1 then q(b1:i−1) else 1−q(b1:i−1)), i.e. the product of the conditional probabilities for each bit given previous bits. We restrict attention to q values that run in O(n2), where n is the length of the input.
Using this variant of Solomonoff induction, we can develop an AIXI-like algorithm for acting in betting environments. Specifically, when choosing an action, this algorithm does the following:
Use resource-bounded Solomonoff induction to predict the distribution of future observations and rewards, given each action.
Most of the time, pick the action that maximizes the expected next reward; otherwise pick a random action
The purpose of the random action is to learn the correct distribution for each counterfactual reward r(s1,s2,a) in the long run. For example, when choosing action i we could pick a random with probability 1/(i+1). This way, the rate of exploration approaches 0 over time, but the total number of exploration steps approaches infinity.
Failure in an unpredictable environment
Here’s an example of a betting environment that this algorithm fails on. Suppose we have some family of hash functions hi for each natural i, such that hi returns strings of length i. Assume that inverting hi takes time Ω(2i). Now consider the following betting environment:
S consists of pairs of binary strings
ps(i) works as follows:
The distribution of the second string s2 is uniformly random, with length i.
90% of the time, s1=hi(s2)
Otherwise s1 is uniformly random with length i.
A={0,1}
r(s1,s2,0)=0.5
r(s1,s2,1)=if hn(s2)=s1 then 1 else 0
Immediately, it is clear that always choosing action 1 will yield 0.9 expected reward per iteration. This is the optimal strategy for an agent that lacks the computational resources to tell anything about r(s1,s2,1) given s2. I believe that if hi is an appropriate family of hash functions, this will be true for all polynomial-time algorithms in the long run.
Now let’s see how our resource-bounded algorithm does on this problem. After a large number of iterations, any winning hypothesis for our resource-bounded Solomonoff induction variant will satisfy:
Distribution of s1 is approximately uniformly random
Given observations s1,s2 and action a, predict the correct reward r(s1,s2,a) with high probability
This is because these conditions state that the hypothesis has close to the correct conditional distributions for s1 and r. If a hypothesis fails to satisfy one of these properties, we can replace it with a variant that does satisfy the property, yielding a hypothesis that assigns infinitely higher log-probability to the data (in the long run) without being much more complicated.
This only leaves the distribution of s2 given s1. Correctly predicting s2 would require reversing the hash function. Given that the hypothesis must run in polynomial time, if we take many samples of s2 given s1 using the hypothesis, we should expect an (exponentially) small fraction to satisfy hn(s2)=s1. So any hypothesis will predict that, with high probability, hn(s2)≠s1.
Given this, our algorithm will take action 0, because it achieves expected reward 0.5, while action 1 achieves an exponentially small expected reward. Obviously, this behavior is highly pathological.
Conclusion
As an alternative to this algorithm, consider the same algorithm except that it completely ignores observation s2. This new algorithm will correctly predict that, about 90% of the time, action 1 yields reward 1. Therefore, it will take action 1. Intuitively, this seems like the correct behavior in general. Since winning hypotheses will predict reward given action and s1 well subject to resource bounds, they can be used in a straightforward way for estimating expected reward for each action.
The problem with our original algorithm parallels a problem with naïve approaches to Vingean reflection. If the agent’s environment contains a hash function inverter, then it may encounter a situation similar to the betting environment in this post: it will see the hash code of a string before the string itself. Since the agent is unable to predict the behavior of the hash function inverter, it must use abstract reasoning to decide how much utility to expect from each action, rather than actually simulating the hash function inverter. The same reasoning sometimes applies to smart agents other than hash function inverters.
For future research, it will be useful to see how strategies for solving the hash function betting environment (i.e. predict expected reward 0.9 from action 1) might translate to strategies for Vingean reflection.