Asymptotic Logical Uncertainty: Solomonoff Induction Inspired Approach

Scott Garrabrant22 Jun 2015 18:39 UTC

LW: 4 AF: 3

This is post is part of the Asymptotic Logical Uncertainty series. This is about a failed attempt to satisfy the Benford test. I will not prove things about this approach. The only reason I am including this approach is because the fact that it does not work is surprising and insightful.

Let Geometric( $\frac{1}{2}$ ) be the function which outputs $n \in N$ with probability $2^{- m}$ .

Fix a prefix-free encoding of probabilistic Turing machines, and let $ℓ (M)$ denote the number of bits used to encode $M$ . Let $R T M (k)$ be the function which outputs the Turing machine $M$ with probability proportional to $2^{- k μ (M)}$ .

Fix $E$ , a deterministic Turing Machine (for example $E = L$ ). Fix a time complexity function $T (n)$ and another time complexity function $R (n) \in o (\frac{T (n)}{log T (n)})$ . Consider the following algorithm.

SIA(E,T,R)
 1: n=0
 2: N=1
 3: M=RTM(3)
 4: g=Geometric(1/2)
 5: loop
 6:     if 4^n*R(N)*log(R(N))*N<T(N) then
 7:         run E for one more time step
 8:         if E outputs another bit then
 9:             n=n+1
10:             repeat
11:                 M=RTM(3)
12:                 g=Geometric(1/2)
13:             until M^{gR}(i)=E(i) for all i<=n
14:     output M^{gR}(N)
15:     N=N+1

This code is inspired by Solomonoff Induction. We cannot directly apply Solomonoff induction, because the lack of a time bound. Solomonoff induction would pick out the program $E$ , but that will not allow us to predict $E$ quickly. We have to restrict the run times of the programs to ensure that they can compute their $N$ th bit in time T(N).

When trying to predict $E (N)$ , we compute the first $n$ values of $E$ for some $n$ much smaller than $N$ . We then sample probabilistic Turing machines until we find one that quickly gets all of these first $n$ bits correct. We use that Turing machine to compute our guess at $E (N)$ .

The purpose of the geometric random variable $g$ is to make the time we allow the sampled Turing machines to run more flexible. The sampled Turing machines can take any amount of time in $O (T (i))$ but get an extra penalty for the size of the constant. The fact that we use $R T M (3)$ instead of $R T M (1)$ is to make the proof that it satisfies the weak Benford test work, $R T M (1)$ would probably work too. Line 6 is only to make sure it runs in time $R (N)$ .

$S I A (L, T, R)$ passes the weak Benford test, but probably fails the Benford test.

First, let me explain why we thought this approach would work. If you sample a Turing machine $M$ that is very likely to get all of the bits correct, but gives suboptimal probabilities on the Benford test sentences, then you can modify that program to form a program that outputs 1 with probability $B (n)$ for the Benford test sentences, and follows $M$ otherwise.

This modified program $M^{'}$ will be just as likely to get all the non-Benford sentences correct, and more likely to get the the Benford sentences correct. Further, this improvement on the Benford sentences is unbounded as you consider more and more sentences, and therfore eventually pays for all of the cost associated with replacing $M$ with $M^{'}$ .

The problem is that while $M^{'}$ is more likely to get the Benford sentences correct, we are not optimizing for programs that are likely to get the Benford sentences correct. Instead, we are optimizing for programs that are likely to get the Benford sentences correct conditioned on getting all the other questions correct.

At first, this may not seem like much of problem. The programs are entangling their results on some inputs with their results on other inputs. The problem comes from the fact that sometimes the program tries to entangle its results with a much much earlier sentence, and when the program was answering that earlier sentence it had much less time to think.

To give a concrete example of why this is a problem, consider the sentence: $t_{n}$ =”The first digit of $A (n)$ is a 1 if and only if the first digit of $A (A (n))$ is a 1.”

The probability we should assign to this sentence is $(\frac{1}{log 10})^{2} + (1 - \frac{1}{log 10})^{2}$ . However, by the time we have to assign a probability to the $A (n)$ th Benford sentence, we may have already calculated the first digit of $A (n)$ and seen that it is a 1. In which case, to maximize the probability of getting all the bits correct, we have to have our answer to the $A (n)$ th Benford sentence match our answer to $t_{n}$ .

The correct thing to do would be to update the probability assigned to $t_{n}$ when observing that the first digit of $A (n)$ is a 1. However, that is not possible, so the next best thing is to change the answer to the $A (n)$ th Benford sentence, which causes it to give the wrong answer.

I believe that the fact that this program does not work can be extended to an impossibility result saying that any program which loops through Turing machines, gives them a score based on how many answers they get correct, and samples according to their complexity and their score, must fail the Benford test.

What links here?

Scott Garrabrant22 Jun 2015 18:39 UTC

LW: 4 AF: 3

6 comments3 min readLW link

abramdemski 4 Jul 2015 0:18 UTC
LW: 3 AF: 2
0
AF
I’m not sure the counterexample given here works as stated. What follows is my attempt to go through details. (I still thing SIA is doing something wrong and is unlikely to work, but it is important to get the details right to be sure.)

As I understood it, the problem was as follows. (Most of this is just re-wording of things you said in the post.)

We wanted to design SIA such that if there is an optimal heuristic for a quickly computable sub-sequence of $E$ , then it would learn to apply that heuristic to those problems. In particular, if the Benford sentences are embedded in a larger sequence such as $L$ , it should use Benford probabilities on that subset.

SIA fails to achieve this sub-sequence optimality because the “objective function” is not decoupled: Bayesian updating “incentivizes” a high joint score, not a high score on individual questions. In particular the program “wants” to condition on its getting all previous questions in sequence correct.

As we’ve discussed extensively in person, this incentive gives an advantage to programs which answer with 1s and 0s deterministically rather than probabilistically. (The stochastic Turing machines will learn to act like deterministic TMs.) The programs want badly to be able to remember their previous answers, to be able to update on them. They can do this by using extra bits in their description length to “memorize” answers, rather than generating answers stochastically. This is worthwhile for the programs even as we make program-description bits more expensive (using $R T M (3)$ rather than $R T M (1)$ ), because a memorized logical fact can be used to get correct answers on so many future logical facts. In effect, giving 0s and 1s rather than stochastic answers is such a good strategy that we cannot incentivize programs out of this behavior (without totally washing out the effect of learning).

Rather than gaining traction by giving answers with Benford probabilities, programs gain traction by using appropriate description languages in memory such that the prior on programs will assign Benford probabilities to the different extensions of a program, purely as a matter of program length. This allows $S I A$ to give good probabilities even though the programs in its mixture distributions are learning not to do so.

Having understood this part of the problem, let’s discuss the example you give in the post.

You consider three sentences: $ϕ_{s_{n}}$ , $ϕ_{s_{A (n)}}$ , and $t_{n} := ϕ_{s_{n}} \leftrightarrow ϕ_{s_{A (n)}}$ . We assume that these are interspersed in $E$ , and that $S I A$ has already been trained up to a large $n$ on this kind of problem; we wish to show that the answers for the subsequence $ϕ_{x}$ depend in a problematic way on the answers for subsequence $t_{x}$ .

The argument seems to be: suppose that the question ordering is such that $ϕ_{s_{n}}$ and $t_{n}$ are considered long before $ϕ_{s_{A (n)}}$ . Now, when considering $ϕ_{s_{A (n)}}$ , the programs will have a lot more time; in particular, they have time to compute the actual answer to $ϕ_{s_{n}}$ from scratch, and also have time to call themselves on $ϕ_{s_{n}}$ and $t_{n}$ to see what their earlier selves answered for those questions.

We note that the probability $P$ we could independently give to $t_{n}$ is a specific quantity based on Benford probabilities, but not a Benford probability itself.

Now I’m going to start filling in parts of the argument which I don’t see spelled out.

We assume that when considering $ϕ_{s_{A (n)}}$ , $S I A$ has enough time to calculate $ϕ_{s_{n}}$ as an explicit training example. It turns out that this sentence is true. All programs which guessed that it was false are eliminated from the set of possible programs to use on $ϕ_{s_{A (n)}}$ .

Now, when calling themselves to see what they said for simpler problems, the programs can potentially see that they guessed 1 for $ϕ_{s_{n}}$ . They can also see their guess for $t_{n}$ . Some of the programs will have guessed true and some false. Assume that they answer true and false with proportion $P$ . I’ll call this assumption (*) for later reference.

The programs have enough information to deduce the answer to $ϕ_{s_{A (n)}}$ which is consistent with their answers so far. $ϕ_{s_{n}}$ is true, so they can simply check what they replied on $t_{n}$ and reply the same thing for $ϕ_{s_{A (n)}}$ . This is better than guessing with Benford probability, because programs which guessed the wrong answer for $t_{n}$ will be eliminated later anyway. By assumption (*), we conclude that the probability $S I A$ approaches $P$ as $n$ increases.

Can we justify assumption (*)? Suppose that when the program considers $t_{n}$ , it does not have time to look up the answer it gave for $ϕ_{s_{n}}$ . Then its best bet is to answer using the Benford assumption on $ϕ_{s_{n}}$ and $ϕ_{s_{A (n)}}$ , resulting in the probability estimate $P$ .

But this seems potentially unrealistic. The program has learned to memorize $ϕ_{s_{n}}$ . It can find what it answered in the time it takes to do a lookup. In this case, the programs are better off having guessed in Benford proportion, conditioned on $ϕ_{s_{n}}$ being true. (They also guess in reverse of Benford proportion for the case when $ϕ_{s_{n}}$ is false, but remember that (*) was an assumption specifically about the proportion conditioning on $ϕ_{s_{n}}$ being true.)
- Scott Garrabrant 4 Jul 2015 21:15 UTC
  LW: 1 AF: 1
  0
  AF Parent
  I believe your concern comes from the fact that at the time the program has to assign a probability to $ϕ_{s_{A (n)}}$ , it has not only deduced the truth of $ϕ_{s_{n}}$ but it also earlier guessed at the truth value of $ϕ_{s_{n}}$ . When it guesses here, it loses some probability mass, but it can lose some of that probability mass in a way that is correlated to the answer it gave to $t_{n}$ . This way, it can still give the correct probability on $ϕ_{s_{A (n)}}$ .
  
  Here is my fix: Instead of $L$ consider the case where we are trying to guess only sentences of the form $ϕ_{s_{A (n)}}$ and $t_{n}$ for some $n$ . Meaning we modify $L$ to reject any sentence not of that form. Both of these sub sequences are indistinguishable from coin flips with fixed probabilities. In this case, SIA will not get the correct probabilities on both subsequences, because it has an incentive to make its answers to $ϕ_{s_{A (n)}}$ match its answers to $t_{n}$ (not match when $ϕ_{s_{n}}$ is false), and any program that does not make them match will be trumped by one that will.
  
  This does not mean that we have this property when we consider all of $L$ , but the code in no way depends on $E$ , and I see no reason to think that it will work for $L$ , but not the modified $L$ .
  - abramdemski 4 Jul 2015 23:26 UTC
    LW: 1 AF: 1
    0
    AF Parent
    I agree, this version works.
    
    To walk through it in a bit more detail:
    
    Now we are only considering two sentence schemas, $ϕ_{s_{A (n)}}$ and $t_{n} := ϕ_{s_{n}} \leftrightarrow ϕ_{n_{A (n)}}$ . (Also, ignore the (rare) case where $n$ is an Ackermann number.)
    
    I’ll call the Benford probability $B := \frac{1}{log 10}$ , and (as before) the $t_{n}$ probability $P := B^{2} + (1 - B)^{2}$ .
    
    At the time when $S I A$ considers $t_{n}$ , we assume it does not give its sampled programs enough time to solve either $ϕ_{s_{n}}$ or $ϕ_{n_{A (n)}}$ . (This assumption is part of the problem setup; it seems likely that cases like this cannot be ruled out by a simple rule, though.) The best thing the programs can do is treat the $t_{n}$ like coin flips with probability $P$ .
    
    At the time when $ϕ_{s_{A (n)}}$ is considered, the program has enough time to compute $ϕ_{s_{n}}$ (again as part of the problem setup). It can also remember what guess it made on $t_{n}$ . The best thing it can do now is to logically combine those to determine $ϕ_{s_{A (n)}}$ . This causes it to not treat $ϕ_{s_{A (n)}}$ like a random coin. For cases where $ϕ_{s_{n}} = t r u e$ , the population of sampled programs will guess $ϕ_{s_{A (n)}} = t r u e$ with frequency approaching $P$ . For cases where $ϕ_{s_{n}} = f a l s e$ , the frequency will be $1 - P$ .
    
    This behavior is the optimal response to the problem as given to $S I A$ , but is suboptimal for what we actually wanted. The Bayes score of $S I A$ on the sub-sequence consisting of only the $ϕ_{s_{A (n)}}$ is suboptimal. It will average out to probability $B$ , but continue to be higher and lower for individual cases, without actually predicting those cases more effectively; $S I A$ is acting like it thinks there is a correlation between $ϕ_{s_{n}}$ and $ϕ_{n_{A (n)}}$ when there is none. (This is especially odd considering that $S I A$ isn’t even being asked to predict $ϕ_{s_{n}}$ in general, in this case!)
    
    This is still not a proof, but it looks like it could be turned into one.
    
    I’m hoping writing it out like this unpacks some of the mutual assumptions Scott and I share as a result of talking about this.
- Scott Garrabrant 4 Jul 2015 22:09 UTC
  0 points
  0
  AF Parent
  No we cannot justify (*). In fact, (*) will not even hold. However if (*) does not hold, I think that is just as bad as failing the Berford test. The $t_{n}$ sentences are themselves a sequence that is indistinguishable from coming a sequence coming from a weighted coin. Therefore failing to provide probability $P$ to the sentences $t_{n}$ is a strong sign that the code will also give the wrong probability to $ϕ_{s_{n}}$ . The two are not qualitatively different.
  
  A formal proof of why it fails is not written up, but if it is, the conclusion will be that either $ϕ_{s_{n}}$ OR $t_{n}$ will have incorrect limiting probabilities.
orthonormal 2 Jul 2015 23:13 UTC
0 points
0
AF
I’m worried about whether we can even formalize the argument that the algorithm passes the weak Benford test, let alone that it fails the strong Benford test. Yes, any particular non-Benfordian $M$ would eventually get outclassed by $M^{'}$ , but this may happen slowly enough that the winner at every particular stage is a different non-Benfordian algorithm...
- Scott Garrabrant 3 Jul 2015 3:17 UTC
  LW: 2 AF: 1
  0
  AF Parent
  The fact that it passes weak Benford should not be clear from this post at all, and you are correct to be skeptical from what I showed you. The complete proof is not written up in a nice way yet, because the other algorithm I will share soon is much more important, and I have been focusing on that.
  
  The argument you present is something I am very aware of. The answer is that If there was a sequence of different non-Benfordian algorithms that did increasingly well, then if you consider the algorithm A that picks a random algorithm according to complexity, and runs that algorithm, then A will also do better than Benford (or at least not in big O of how well Benford does), just by being able to sample the algorithms in the infinite sequence.
  
  Making the above argument work is actually the reason I use RTM(3), not RTM(1). I need the hypothetical random algorithm A to discount later algorithms significantly less than the sampling procedure.
  
  I think the fact that this fails strong Bendord is very interesting, and I want to write more about that. I agree that I have not formally shown that it fails strong Benford, and I dont even have a proof of this. All I know is that if you replace L with something that looks at a particular subset of sentences that contains the Benford test sentences, that this approach fails strong Benford in that domain.
  
  However, I do not think the proof that this satisfies weak Benford is all that important. Weak Benford really is a weak test, and passing it is not that impressive.