There is another problem with Solomonoff Induction: Overfitting. It only consideres hypotheses which predict the past data perfectly, and only based on that, it predicts future data.
If a Solomonoff inductor (like AIXI) observes its environment with a digital camera, it will only consider predictions about the future made by hypotheses which perfectly predict the past camera data, including random noise.
But the noise is very irrelevant for what the future will most likely be. Maybe there is a much simpler hypothesis, which retrodicts the past data only approximately, but which makes much better predictions, because the most successful theories are those which trade-off simplicity of a hypothesis with its fit to the “training sample”, in this case, the data observed in the past.
It is often said that successful prediction has a lot to do with compression. But it has a lot to do with lossy compression, not lossless compression. Solomonoff Induction relies entirely on lossless compression, and so it overfits. When animals or humans make predictions based on past observation, they don’t normally pick a hypothesis which predicts the past observations perfectly, because this hypothesis would be unnecessarily complex and would make, consequently, probably bad predictions.
Solomonoff induction optimizes not for both simplicity and fit, it demands perfect fit first and only then optimizes for simplicity, but this simplicity might not very simple at all.
Consider again the noisy input from the camera. The SI compression is like compressing all the camera frames as PNG, a lossless image format. The compression rate on video frames or photos is terrible with PNG, because photos contain a lot of noise and little ordered data which could be well compressed with a lossless algorithm. In photos or video frames JPEG (a lossless format) gives a result which has still pretty good quality but is massively smaller.
So how much better would Solomonoff Induction be than PNG on noisy data like photos? I think not much better. SI would be much better than PNG on some images without noise, which contain e.g. something like a large Game of Life board. SI would figure out the simple GoL algorithm which generated the image, and PNG wouldn’t. But I see no reason why SI would be significantly better on noisy data.
Okay, maybe the noise and the whole universe is deterministic (some deterministic interpretation of QM, like Bohemian mechanics, is true). Then SI could perhaps deduce the laws of nature and the initial conditions of the universe with high confidence from some amount of video data, which would compress the past data incredibly well and which would give it excellent prediction capabilities, similar to Laplace’s demon.
Is this plausible? Even if the camera noise is perfectly determined by the laws and the initial conditions of the universe, it seems not clear whether SI could deduce them, since random noise is not the only sort of uncertainty. Another source is indexical uncertainty. Many different space time locations in the universe may be consistent with the available data, and different laws / initial conditions may make different predictions for different locations. So the space of possible laws / initial conditions may not be very strongly constrained by the available data. Then a system which trades off simplicity and fit could perhaps still beat Solomonoff prediction.
Moreover, even the deterministic Many Worlds interpretation of quantum mechanics would lead to unpredictable noise for the Solomonoff inductor, again because of indexical uncertainty: It doesn’t know in which of the many worlds it is located in, so it must regard it’s location as random, which is equivalent to regarding the noise as random.
So it seems SI would most likely fail against conventional intelligences (which make a trade-off between fit and simplicity) in indeterministic or Many Worlds universes, because this leads to (real or pseudo) indeterministic noise which can’t be well compressed/predicted losslessly. And for deterministic Bohemian mechanics style universes, the success of SI is questionable because of indexical uncertainty about space time location.
This is not a problem for Solomonoff induction because
(Compressed info meaningful to humans) + uncompressed meaningless random noise)
is a better hypotheses than
(Uncompressed info meaningful to humans) + (uncompressed meaningless random noise)
So Solomonoff induction still does as well as a human’s ontology. Solomonoff induction tries to compress everything it can, including the patterns human’s care about, even if other parts of the data can’t be compressed.
There is a precise trade-off involved. If you make a lossy fit better, you lose bits based on how much more complicated it is, but you gain bits in that you no longer need to hardcode explanations for the errors. If those errors are truly random, you might as well stick with your lossy fit (and Solomonoff induction does this).
(Compressed info meaningful to humans) + uncompressed meaningless random noise)
is a better hypotheses than
(Uncompressed info meaningful to humans) + (uncompressed meaningless random noise)
I don’t see how these claims refute anything I said. You could probably use a similar argument to justify overfitting in general. A model which overfits doesn’t care at all about more or less approximate fit, it cares only about perfect fit, and two hypotheses with perfect fit on the training data could have wildly different approximate fit to reality on the predictions, while as perfect predictions they are equally bad. Then Solomonoff induction wouldn’t care at all about picking the one with better approximate predictions!
Just think about predictions of actual scientific theories: We know in advance that these theories are all strictly speaking wrong, since they are simplifications of reality (so they would be equally bad for Solomonoff), but one theory could be closer to the truth, a much better approximation, than the other. While the probability of being precisely correct could still be equal (equally low) for both theories.
That A is a better approximate prediction than B doesn’t imply that A is more likely true than B. In fact, B could (and probably would, under Solomonoff Induction) contain a lot of made-up fake precision, which would give it at least a chance of being precisely true, in contrast to A, which can fit reality only ever imperfectly. Then B would be more likely true than A, but far less similar, in expectation, to reality.
Larger errors literally take more bits to describe. For example, in binary, 3 is 11₂ and 10 is 1010₂ (twice the bits).
Say that you have two hypotheses, A and B, such that A is 100 bits more complicated than B but 5% closer to the true value. This means for each sample, the error in B on average takes log₂(1.05) = 0.07 bits more to describe than the error in A.
After about 1,430 samples, A and B will be considered equally likely. After about 95 more samples, A will be considered 100 times more likely than B.
In general, if f(x) is some high level summary of important information in x, Solomonoff induction that only tries to predict x is also universal for predicting f(x) (and it even has the same or better upper-bounds).
There is another problem with Solomonoff Induction: Overfitting. It only consideres hypotheses which predict the past data perfectly, and only based on that, it predicts future data.
If a Solomonoff inductor (like AIXI) observes its environment with a digital camera, it will only consider predictions about the future made by hypotheses which perfectly predict the past camera data, including random noise.
But the noise is very irrelevant for what the future will most likely be. Maybe there is a much simpler hypothesis, which retrodicts the past data only approximately, but which makes much better predictions, because the most successful theories are those which trade-off simplicity of a hypothesis with its fit to the “training sample”, in this case, the data observed in the past.
It is often said that successful prediction has a lot to do with compression. But it has a lot to do with lossy compression, not lossless compression. Solomonoff Induction relies entirely on lossless compression, and so it overfits. When animals or humans make predictions based on past observation, they don’t normally pick a hypothesis which predicts the past observations perfectly, because this hypothesis would be unnecessarily complex and would make, consequently, probably bad predictions.
Solomonoff induction optimizes not for both simplicity and fit, it demands perfect fit first and only then optimizes for simplicity, but this simplicity might not very simple at all.
Consider again the noisy input from the camera. The SI compression is like compressing all the camera frames as PNG, a lossless image format. The compression rate on video frames or photos is terrible with PNG, because photos contain a lot of noise and little ordered data which could be well compressed with a lossless algorithm. In photos or video frames JPEG (a lossless format) gives a result which has still pretty good quality but is massively smaller.
So how much better would Solomonoff Induction be than PNG on noisy data like photos? I think not much better. SI would be much better than PNG on some images without noise, which contain e.g. something like a large Game of Life board. SI would figure out the simple GoL algorithm which generated the image, and PNG wouldn’t. But I see no reason why SI would be significantly better on noisy data.
Okay, maybe the noise and the whole universe is deterministic (some deterministic interpretation of QM, like Bohemian mechanics, is true). Then SI could perhaps deduce the laws of nature and the initial conditions of the universe with high confidence from some amount of video data, which would compress the past data incredibly well and which would give it excellent prediction capabilities, similar to Laplace’s demon.
Is this plausible? Even if the camera noise is perfectly determined by the laws and the initial conditions of the universe, it seems not clear whether SI could deduce them, since random noise is not the only sort of uncertainty. Another source is indexical uncertainty. Many different space time locations in the universe may be consistent with the available data, and different laws / initial conditions may make different predictions for different locations. So the space of possible laws / initial conditions may not be very strongly constrained by the available data. Then a system which trades off simplicity and fit could perhaps still beat Solomonoff prediction.
Moreover, even the deterministic Many Worlds interpretation of quantum mechanics would lead to unpredictable noise for the Solomonoff inductor, again because of indexical uncertainty: It doesn’t know in which of the many worlds it is located in, so it must regard it’s location as random, which is equivalent to regarding the noise as random.
So it seems SI would most likely fail against conventional intelligences (which make a trade-off between fit and simplicity) in indeterministic or Many Worlds universes, because this leads to (real or pseudo) indeterministic noise which can’t be well compressed/predicted losslessly. And for deterministic Bohemian mechanics style universes, the success of SI is questionable because of indexical uncertainty about space time location.
This is not a problem for Solomonoff induction because
is a better hypotheses than
So Solomonoff induction still does as well as a human’s ontology. Solomonoff induction tries to compress everything it can, including the patterns human’s care about, even if other parts of the data can’t be compressed.
There is a precise trade-off involved. If you make a lossy fit better, you lose bits based on how much more complicated it is, but you gain bits in that you no longer need to hardcode explanations for the errors. If those errors are truly random, you might as well stick with your lossy fit (and Solomonoff induction does this).
I don’t see how these claims refute anything I said. You could probably use a similar argument to justify overfitting in general. A model which overfits doesn’t care at all about more or less approximate fit, it cares only about perfect fit, and two hypotheses with perfect fit on the training data could have wildly different approximate fit to reality on the predictions, while as perfect predictions they are equally bad. Then Solomonoff induction wouldn’t care at all about picking the one with better approximate predictions!
Just think about predictions of actual scientific theories: We know in advance that these theories are all strictly speaking wrong, since they are simplifications of reality (so they would be equally bad for Solomonoff), but one theory could be closer to the truth, a much better approximation, than the other. While the probability of being precisely correct could still be equal (equally low) for both theories.
That A is a better approximate prediction than B doesn’t imply that A is more likely true than B. In fact, B could (and probably would, under Solomonoff Induction) contain a lot of made-up fake precision, which would give it at least a chance of being precisely true, in contrast to A, which can fit reality only ever imperfectly. Then B would be more likely true than A, but far less similar, in expectation, to reality.
Larger errors literally take more bits to describe. For example, in binary, 3 is 11₂ and 10 is 1010₂ (twice the bits).
Say that you have two hypotheses, A and B, such that A is 100 bits more complicated than B but 5% closer to the true value. This means for each sample, the error in B on average takes log₂(1.05) = 0.07 bits more to describe than the error in A.
After about 1,430 samples, A and B will be considered equally likely. After about 95 more samples, A will be considered 100 times more likely than B.
In general, if f(x) is some high level summary of important information in x, Solomonoff induction that only tries to predict x is also universal for predicting f(x) (and it even has the same or better upper-bounds).