Program Search and Incomplete Understanding

epistemic status: crystallizing a pattern

When trying to solve some foundational AI problem, there is a series of steps on the road to implementation that goes something like this (and steps may be skipped):

First, there is developing some philisophical solution to the problem. The early decision theory work on TDT falls in this category. Making progress on this step involves developing the ability to take an idea and find holes in it, although a lot of the thought here isn’t quite precise enough to be mathematized.

Second, there is developing some solution to the problem, given infinite computing power. If there is some algorithm to solve the problem with a halting oracle or a reflective oracle, then this step is finished. Being able to translate something from philosophy to math is extremely important, and there’s no official training method for this skill. Inventing AIXI for the first time would lie in this level. Often, by crystallizing some idea into math, a bunch of subtle little holes will appear that aren’t findable by philisophical thought, although once they’re found, it’s generally possible to translate them back into intuitive philisophical terms.

Third, there is developing some sort of very impractical solution to the problem. The algorithm from the logical induction paper falls in this category, and so do brute-force approaches to solving 3SAT or playing go by building a game tree 50 moves ahead. Going from “doable with infinite computing power” to “doable with very large amounts of computing power” is generally easier than crystallizing an idea into an algorithm in the first place.

Fourth, there is developing some sort of poly-time solution to the problem. Generating an algorithm that solves practical 3SAT instances falls in this category. I’m unsure, but this seems to be the step that involves the most technical and brute-force math work, because it requires characterizing easily exploitable patterns, as well as features that make a problem hard, in order to pick as much low-hanging (polynomial) fruit as you can. Also, there generally aren’t proofs that the algorithm is optimal among the set of poly-time algorithms, there are probably going to be enough advances available to keep a sub-sub-field going for a while. Computational complexity theory is the most applicable in this stage.

Fifth, there is developing a practical solution to the problem, ie, a poly-time solution of degree 3 or 2 (or a linear-time algorithm or linear-log algorithm for very large datasets). This feels like an extension of the fourth step, although this stage is more likely to be characterized by relaxing to a probabilistic, approximately correct algorithm (compare how an optimal binary search tree takes $O (n^{2})$ time to generate, while generating an approximately optimal binary search tree takes $O (n)$ time).

There’s a special class of the third level that I haven’t seen discussed, however, and it seems to be a very important class in practice. Program Search Algorithms. In general, it’s *really* hard to show that some algorithm is optimal, and one of the only ways to get an optimality result is to assume that the algorithm is iterating through all Turing Machines, and doing something with the best ones found so far. So the class of “program search algorithms” is comprised of any algorithm that iterates through all Turing Machines, in order to get good provable asymptotic properties. Here are some examples:

Levin’s Universal Search: This simulates all other turing machines, in a way that gives the $n$ ’th turing machine a $\frac{1}{n^{2}}$ fraction of the computational resources. So, if there’s some turing machine to solve an NP problem in $O (f (n))$ time, Levin’s Universal Search will be able to solve it in $O (f (n))$ time as well, although with an absolutely god-awful constant factor slowdown that makes it useless in practice.

Logical Induction: this simulates all turing machines, and has them act as traders that can go into debt from losing money, and it runs a stock market on mathematical sentences, where the “trading mass” of a trader drops exponentially w.r.t the worst-case debt it has ever accumulated, as well as its complexity.

Policy Selection: This has a weak logical inductor selecting a policy from an infinite set of policies, for a strong logical inductor to use. It’s essentially searching over the space of all turing machines (interpreted as policies), for the best policy to use on some sequence of problems. It can be thought of as a brute-force implementation of self-modification (although one which would be useless in reality due to the danger of selecting a stronger policy which gets rid of the weak logical inductor and just locks in the existing policy. Also due to being hideously slow)

AIXItl: This is a brute-force implementation of Solomonoff Induction, by iterating through Turing Machines and having the probability distribution composed of Turing Machines that do well at predicting the sequence of incoming bits.

The Universal $F_{u n i}$ -Optimal Estimator: This is sort of like a hybrid between Levin Universal Search, and a brute-force agnostic PAC learner, for the task of finding the best poly-time algorithm to estimate what a long computation will output. It’s Theorem 5.1 in the linked paper.

These examples are the type specimens for the class of Program Search Algorithms, which lie somewhere around level 3 in the hierarchy. These aren’t really satisfying, for two big reasons.

First, program search takes a really really long time, unless the space of programs has some especially nice property that lets you search more effectively through hypothesis-space than brute-force enumeration of turing machines (such as neural networks). So it won’t work in practice without further insights.

Second, they’re secretly blackbox algorithms. They’re punting the computational work of solving the problem off to some turing machine of unknown properties. If there’s some poly-time method to get good performance on practical 3SAT instances, it would probably get picked up (eventually) by some of these algorithms, but there’s a very important sense in which you don’t actually understand how 3SAT is getting solved. There are no gears to the solution. The turing machines may fail on important but rare instances, they may have exploitable biases, they may have figured out some way to goodhart the scoring measure… Not only are you running a huge number of untrusted computations with no tidy guarantees, you’re also dodging the hard part of explicitly coming up with heuristics to solve the tractible instances of the intractible problem. The survey propagation algorithm for 3SAT and Levin Universal Search are very very different.

Getting a Program Search Algorithm is helpful because it means that you have developed some sort of scoring measure to identify whether something does better or worse on the problem, but it completely dodges the hard part of opening the damn black box which contains comprehensible ways to make headway on the problem. In computational complexity terms, it’s kind of like having a verifier for a NP problem, without a generator of solutions.

Also, as mentioned in step 4, there’s a very important reason why understanding gets stuck at this stage. It’s generally possible to prove some sort of asymptotic optimality theorem for a Program Search Algorithm. This does not apply for having an explicit algorithm for whatever problem you are trying to solve. You may be able to prove stuff about how your explicit algorithm behaves, sure, but you also don’t know if there’s important exploitable structure you’re missing. Proving lower bounds in computational complexity theory is really hard. When you go beyond program search you (almost certainly) go beyond the realm of accessible optimality proofs.

May this new category serve you well.