Limited agents need approximate induction
[This post borders on some well-trodden ground in information theory and machine learning, so ideas in this post have an above-average chance of having already been stated elsewhere, by professionals, better. EDIT: As it turns out, this is largely the case, under the subjects of the justifications for MML prediction and Kolmogorov-simple PAC-learning.]
I am fascinated by methods of thinking that work for well-understood reasons—that follow the steps of a mathematically elegant dance. If one has infinite computing power the method of choice is something like Solomonoff induction, which is provably ideal in a certain way at predicting the world. But if you have limited computing power, the choreography is harder to find.
To do Solomonoff induction, you search through all Turing machine hypotheses to find the ones that exactly output your data so far, then use the weighted average of those perfect retrodictors to predict the next time step. So the naivest way to build an ideal limited agent is to merely search through lots of hypotheses (chosen from some simple set) rather than all of them, and only run each Turing machine for time less than some limit. At least it’s guaranteed to work in the limit of large computing power, which ain’t nothing.
Suppose then that we take this nice elegant algorithm for a general predictor, and we implement it on today’s largest supercomputer, and we show it the stock market prices from the last 50 years to try to predict stocks and get very rich. What happens?
Bupkis happens, that’s what. Our Solomonoff predictor tries a whole lot of Turing machines and then runs out of time before finding any useful hypotheses that can perfectly replicate 50 years of stock prices. This is because such useful hypotheses are very, very, very rare.
We might then turn to the burgeoning field of logical uncertainty, which has a major goal of handling intractable math problems in an elegant and timely manner. We are logically uncertain about what distribution Solomonoff induction will output, so can we just average over that logical uncertainty to get some expected stock prices?
The trouble with this is that current logical uncertainty methods rely on proofs that certain outputs are impossible or contradictory. For simple questions this can narrow down the answers, but for complicated problems it becomes intractable, replacing the hard problem of evaluating lots of Turing machines with the hard problem of searching through lots and lots of proofs about lots of Turing machines—and so again our predictor runs out of time before becoming useful.
In practice, the methods we’ve found to work don’t look very much like Solomonoff induction. Successful methods don’t take the data as-is, but instead throw some of it away: curve fitting and smoothing data, filtering out hard-to-understand signals as noise, and using predictive models that approximate reality imperfectly. The sorts of things that people trying to predict stocks are already doing. These methods are vital to improve computational tractability, but are difficult (to my knowledge) to fit into a framework as general as Solomonoff induction.
Suppose that our AI builds a lot of models of the world, including lossy models. How should it decide which models are best to use for predicting the world? Ideally we’d like to make a tradeoff between the accuracy of the model, measured in the expected utility of how accurate you expect the model’s predictions to be, and the cost of the time and energy used to make the prediction.
Once we know how to tell good models, the last piece would be for our agent to make the explore/exploit tradeoff between searching for better models and using its current best.
There are various techniques to estimate resource usage, but how does one estimate accuracy?
Here was my first thought: If you know how much information you’re losing (e.g. by binning data), for discrete distributions this sets the Shannon information of the ideal value (given by Solomonoff prediction) given the predicted value. This uses the relationship between information in bits of data and Shannon information that determines how sharp your probability distribution is allowed to be.
But with no guarantees about the normality (or similar niceness properties) of the ideal value given the prediction, this isn’t very helpful. The problem is highlighted by hurricane prediction. If hurricanes behaved nicely as we threw away information, weather models would just be small, high-entropy deviations from reality. Instead, hurricanes can change route greatly even with small differences in initial conditions.
The failure of the above approach can be explained in a very general way: it uses too little information about the model and the data, only the amount of information thrown away. To do better, our agent has to learn a lot from its training data—a subject that workers in AI have already been hard at work on. On the one hand, it’s a great sign if we can eventually connect ideal agents to current successful algorithms. On the other, doing so elegantly seems like a hard problem.
To sum up in the blandest possible way: If we want to build successful predictors of the future with limited resources, they should use their experience to learn approximate models of the world.
The real trick, though, is going to be to set this on a solid foundation. What makes a successful method of picking models? As we lack access to the future (yet! Growth mindset!), we can’t grade models based on their future predictions unless we descend to solipsism and grade models against models. Thus we’re left with grading models based on how well they retrodict the data so far. Sound familiar? The foundation we want seems like an analogue to Solomonoff induction, one that works for known reasons but doesn’t require perfection.
III: An Example
Here’s a paradigm that might or might not be a step in the right direction, but at least gestures at what I mean.
The first piece of the puzzle is that a model that gets proportion P of training bits wrong can be converted to a Solomonoff-accepted perfectly-precise model just by specifying the bits it gets wrong. Suppose we break the model output (with total length N) into chunks of size L, and prefix each chunk with the locations of the wrong bits in that chunk. Then the extra data required to rectify an approximate model is at most N/L·log(P·L)+N·P·log(L). Then the hypothesis where the model is right about the next bit is simpler than the hypothesis when it’s wrong, because when the model is right you don’t have to spend ~log(L) bits correcting it.
In this way, Solomonoff induction natively cares about some approximate models’ predictions. There are some interesting details here that are outside the focus of this particular post. Does using the optimal chunk length lead to Solomonoff induction reflecting model accuracy correctly? What are some better schemes for rectifying models that handle things like models that output probabilities? The point is just that even if your model is wrong on fraction P of the training data, Solomonoff induction will still promote it as long as it’s simpler than N-N/L·log(P·L)-N·P·log(L).
The second piece of the puzzle is that induction can be done over processed functions of observations, like smoothing the data or filtering difficult-to-predict parts (noise) out. If this processing increases the accuracy of models, we can use this to make high-accuracy models of functions the training data, and then use those models to predict the the processed future observations as above.
These two pieces allow an agent to use approximate models, and to throw away some of its information, and still have its predictions work for the same reason as Solomonoff induction. We can use this paradigm to interpret what an algorithm like curve fitting is doing—the fitted curve is a high-accuracy retrodiction of some smoothed function of the data, which therefore does a good job of predicting what that smoothed function will be in the future.
There are some issues here. If a model that you are using is not the simplest, it might have overfitting problems (though perhaps you can fix this just by throwing away more information than naively appears necessary) or systematic bias. More generally, we haven’t explored how models get chosen; we’ve made the problem easier to brute force but we need to understand non-brute force search methods and what their foundations are. It’s a useful habit to keep in mind what actually works for humans—as someone put it to me recently, “humans can make models they understand that work for reasons they understand.”
Furthermore, this doesn’t seem to capture reductionism well. If our agent learns some laws of physics and then is faced with a big complicated situation it needs to use a simplified model to make a prediction about, it should still in some sense “believe in the laws of physics,” and not believe that this complicated situation violates the laws physics even if its current best model is independent of physics.
IV: Logical Uncertainty
It may be possible to relate this back to logical uncertainty—where by “this” I mean the general thesis of predicting the future by building models that are allowed to be imperfect, not the specific example in part III. Soares and Fallenstein use the example of a complex Rube Goldberg machine that deposits a ball into one of several chutes. Given the design of the machine and the laws of physics, suppose that one can in principle predict the output of this machine, but that the problem is much too hard for our computer to do. So rather than having a deterministic method that outputs the right answer, a “logical uncertainty method” in this problem is one that, with a reasonable amount of resources spent, takes in the description of the machine and the laws of physics, and gives a probability distribution over the machine’s outputs.
Meanwhile, suppose that we take an approximately inductive predictor and somehow teach it the the laws of physics, then ask it to predict the machine. We’d like it to make predictions via some appropriately simplified folk model of physics. If this model gives a probability distribution over outcomes—like in the simple case of “if you flip this coin in this exact way, it has a 50% shot at landing heads”—doesn’t that make it a logical uncertainty method? But note that the probability distribution returned by a single model is not actually the uncertainty introduced by replacing an ideal predictor with a resource-limited predictor. So any measurement of logical uncertainty has to factor in the uncertainty between models, not just the uncertainty within models.
Again, we’re back to looking for some prediction method that weights models with some goodness metric more forgiving than just using perfectly-retrodicting Turing machines, and which outputs a probability distribution that includes model uncertainty. But can we apply this to mathematical questions, and not just Rube Goldberg machines? Is there some way to subtract away the machine and leave the math?
Suppose that our approximate predictor was fed math problems and solutions, and built simple, tractable programs to explain its observations. For easy math problems a successful model can just be a Turing machine that finds the right answer. As the math problems get more intractable, successful models will start to become logical uncertainty methods, like how we can’t predict a large prime number exactly, but we can predict it’s last digit is 1, 3, 7, or 9. Within this realm we have something like low-level reductionism, where even though we can’t find a proof of the right answer, we still want to act as if mathematical proofs work and all else is ignorance, and this will help us make successful predictions.
Then we have complicated problems that seem to be beyond this realm, like P=NP. Humans certainly seem to have generated some strong opinions about P=NP without dependence on mathematical proofs narrowing down the options. It seems to such humans that the genuinely right procedure to follow is that, since we’ve searched long and hard for a fast algorithm for NP-complete problems without success, we should update in the direction that no such algorithm exists. In approximate-Solomonoff-speak, it’s that P!=NP is consistent with a simple, tractable explanation for (a recognizable subset of) our observations, while P=NP is only consistent with more complicated tractable explanations. We could absolutely make a predictor that reasons this way—it just sets a few degrees of freedom. But is it the right way to reason?
For one thing, this seems like it’s following Gaifman’s proposed property of logical uncertainty, that seeing enough examples of something should convince you of it with probability 1 - which has been shown to be “too strong” in some sense (it assigns probability 0 to some true statements—though even this could be okay if those statements are infinitely dilute). Does the most straightforward implementation actually have the Gaifman condition, or not? (I’m sorry, ma’am. Your daughter has… the Gaifman condition.)
This inductive view of logical uncertainty lacks the consistent nature of many other approaches—if it works, it does so by changing approaches to suit the problem at hand. This is bad if you want your logical uncertainty methods to be based on a simple prior followed by some kind of updating procedure. But logical uncertainty is supposed to be practical, after all, and at least this is a simple meta-procedure.
Thanks for reading this post. In conclusion, here are some of my questions:
What’s the role of Solomonoff induction in approximate induction? Is Solomonoff induction doing all of the work, or is it possible to make useful predictions using tractable hypotheses Solomonoff induction would exclude, or excluding intractable hypotheses Solomonoff induction would have to include?
Somehow we have to pick out models to promote to attention in the first place. What properties make a process for this good or bad? What methods for picking models can be shown to still lead to making useful predictions—and not merely in the limit of lots of computing time?
Are humans doing the right thing by making models they understand that work for reasons they understand? What’s up with that reductionism problem anyhow?
Is it possible to formalize the predictor discussed in the context of logical uncertainty? Does it have to fulfill Gaifman’s condition if it finds patterns in things like P!=NP?