But the No Free Lunch theorems suggest that this means it will be suboptimal compared to any algorithm tailored to any specific world.
This is a subtle point. The NFL theorem does prohibit any algorithm from doing well over all possible worlds. But Solomonoff induction does well on any world that has any kind of computable regularity. If there is no computable regularity, then no prior can do well. In fact, the Solomonoff prior does just as well asymptotically as any computable prior.
As is often the case, thinking in terms of codes can clear up the issue. A world is a big data file. Certainly, an Earth-specific algorithm can get good compression rates if it is fed data that comes from Earth. But as the data file gets large, the Solomonoff general-purpose compression algorithm will achieve compression rates that are nearly as good; in the worst case, just has to prepend the code of the Earth-specific algorithm to its encoded data stream, and it only underperforms by that program size.
The reason AIXI doesn’t work in practice is that the “efficient approximations” aren’t really efficient, or aren’t good approximations.
If there is no computable regularity, then no prior can do well. In fact, the Solomonoff prior does just as well asymptotically as any computable prior.
This seems to be a common belief. But see this discussion I had with Eliezer where I offered some good arguments and counterexamples against it.
The link goes to the middle, most relevant part of the discussion. But if you look at the top of it, I’m not arguing against the Solomonoff approach, but instead trying to find a generalization of it that makes more sense.
I’ve linked to that discussion several times in my comments here, but I guess many people still haven’t seen it. Maybe I should make a top-level post about it?
This is a subtle point. The NFL theorem does prohibit any algorithm from doing well over all possible worlds. But Solomonoff induction does well on any world that has any kind of computable regularity.
Okay, fair point. But by smearing its optimality over the entire Platonic space of computable functions, it is significantly worse than those algorithms tuned for this world’s function. And not surprisingly, AIXI has very little practical application.
If there is no computable regularity, then no prior can do well. In fact, the Solomonoff prior does just as well asymptotically as any computable prior.
And that’s unhelpful when, as is likely, you don’t hit that asymptote until the heat death of the universe.
The reason AIXI doesn’t work in practice is that the “efficient approximations” aren’t really efficient, or aren’t good approximations.
My point is that the most efficient approximations can’t be efficient in any absolute sense. In order to make AIXI useful, you have to feed it information about which functions it can safely skip over—in other words, feed it intelligence of type 2), the information about its environment that you already gained through other methods Which just shows that those kinds of intelligence are not the same.
Actually Solomonoff induction is insanely fast. Its generality is not just that it learns everything to as good an extent as anything else, but also that typically it learns everything from indirect observations “almost as fast as directly” (not really, but...). The “only problem” is that Solomonoff induction is not an algorithm, and so its “speed” is for all practical purposes a meaningless statement.
The analog would be to theorem proving. No one claims that knowing the axioms of math gets you to every theorem “very fast”—because the problem of finding a proof/disproof for an arbitrary proposition is also uncomputable.
A “solution” might be that only proofs matter, while theorems (as formulas) are in general meaningless in themselves, only useful as commentary on proofs.
Nevertheless, the original point stands: no one says “I’ve discovered math! Now I can can learn the answer to any math problem very fast.” In contrast, you are saying that because we have Solomonoff induction, we can infer distributions “very fast”.
In contrast, you are saying that because we have Solomonoff induction, we can infer distributions “very fast”.
To be more precise, we can specify the denotation of distributions very close to the real deal from very few data. This technical sense doesn’t allow the analogy with theorem-proving, which is about algorithms, not denotation.
But by smearing its optimality over the entire Platonic space of computable functions, it is significantly worse than those algorithms tuned for this world’s function. And not surprisingly, AIXI has very little practical application.
Let’s do another thought experiment. Say that humanity has finally resolved to send colonists to nearby star systems. The first group is getting ready to head out to Alpha Centauri.
The plan is that after the colonists arrive and set up their initial civilization, they will assemble a data archive of size T about the new world and send it back to Earth for review. Now it is expensive to send data across light-years, so obviously they want to minimize the number of bits they have to send. So the question is: what data format do the two parties agree on at the moment of parting?
If T is small, then it makes sense to think this issue over quite a bit. What should we expect the data to look like? Will it be images, audio, health reports? If we can figure something out about what the data will look like in advance (ie, choose a good prior), then we can develop a good data format and get short codes.
But if T is large (terabytes) then there’s no point in doing that. When the Alpha Centauri people build their data archive, they spend some time analyzing it and figuring out ways to compress it. Finally they find a really good compression format (=prior). Of course, Earth doesn’t know the format—but that doesn’t matter, since the specification for the format can just be prepended to the transmission.
I think this thought experiment is nice because it reveals the pointlessness of a lot of philosophical debates about Solomonoff, Bayes, etc. Of course the colonists have to choose a prior before the moment of parting, and of course if they choose a good prior they will get short codes. And the Solomonoff distribution may not be perfect in some metaphysical sense, but it’s obviously the right prior to choose in the large T regime. Better world-specific formats exist, but their benefit is small compared to T.
I think this thought experiment is nice because it reveals the pointlessness of a lot of philosophical debates about Solomonoff, Bayes, etc. Of course the colonists have to choose a prior before the moment of parting, and of course if they choose a good prior they will get short codes. And the Solomonoff distribution may not be perfect in some metaphysical sense, but it’s obviously the right prior to choose in the large T regime. Better world-specific formats exist, but their benefit is small compared to T.
Well, the thought experiment doesn’t accomplish that. Solomonoff induction is not necessarily optimal (and most probably isn’t optimal) in your scenario, even and especially for large T. The amount of time it takes for any computable Occamian approximation of S/I to find the the optimal encoding, is superexponential in the length of the raw source data. So the fact that it will eventually get to a superior or near-superior encoding is little consolation, when Alpha Centauri and Sol will have long burned out before Solomonoff has converged on a solution.
The inferiority of Solomonoff Occamian induction, of iterating up through shorter generating algorithms until the data is matched, is not some metaphysical or philosophical issue, but rather, deals directly with the real-world time constraints that arise in practical situations.
My point is, any practical attempt to incorporate Solomonoff induction must also make use of knowledge of the data’s regularity that was found some other way, making it questionable whether Solomonoff induction incorporates everything we mean by “intelligence”. This incompleteness also raises the issue of what this-world-specific methods we actually did use to get to our current state of knowledge that makes Bayesian inference actually effective.
This is a subtle point. The NFL theorem does prohibit any algorithm from doing well over all possible worlds. But Solomonoff induction does well on any world that has any kind of computable regularity. If there is no computable regularity, then no prior can do well. In fact, the Solomonoff prior does just as well asymptotically as any computable prior.
As is often the case, thinking in terms of codes can clear up the issue. A world is a big data file. Certainly, an Earth-specific algorithm can get good compression rates if it is fed data that comes from Earth. But as the data file gets large, the Solomonoff general-purpose compression algorithm will achieve compression rates that are nearly as good; in the worst case, just has to prepend the code of the Earth-specific algorithm to its encoded data stream, and it only underperforms by that program size.
The reason AIXI doesn’t work in practice is that the “efficient approximations” aren’t really efficient, or aren’t good approximations.
This seems to be a common belief. But see this discussion I had with Eliezer where I offered some good arguments and counterexamples against it.
The link goes to the middle, most relevant part of the discussion. But if you look at the top of it, I’m not arguing against the Solomonoff approach, but instead trying to find a generalization of it that makes more sense.
I’ve linked to that discussion several times in my comments here, but I guess many people still haven’t seen it. Maybe I should make a top-level post about it?
Okay, fair point. But by smearing its optimality over the entire Platonic space of computable functions, it is significantly worse than those algorithms tuned for this world’s function. And not surprisingly, AIXI has very little practical application.
And that’s unhelpful when, as is likely, you don’t hit that asymptote until the heat death of the universe.
My point is that the most efficient approximations can’t be efficient in any absolute sense. In order to make AIXI useful, you have to feed it information about which functions it can safely skip over—in other words, feed it intelligence of type 2), the information about its environment that you already gained through other methods Which just shows that those kinds of intelligence are not the same.
Actually Solomonoff induction is insanely fast. Its generality is not just that it learns everything to as good an extent as anything else, but also that typically it learns everything from indirect observations “almost as fast as directly” (not really, but...). The “only problem” is that Solomonoff induction is not an algorithm, and so its “speed” is for all practical purposes a meaningless statement.
When someone says “very fast, but uncomputable”, what I hear is “dragon in garage, but invisible”.
Generalize that to a good chunk of classical math.
The analog would be to theorem proving. No one claims that knowing the axioms of math gets you to every theorem “very fast”—because the problem of finding a proof/disproof for an arbitrary proposition is also uncomputable.
A “solution” might be that only proofs matter, while theorems (as formulas) are in general meaningless in themselves, only useful as commentary on proofs.
Nevertheless, the original point stands: no one says “I’ve discovered math! Now I can can learn the answer to any math problem very fast.” In contrast, you are saying that because we have Solomonoff induction, we can infer distributions “very fast”.
To be more precise, we can specify the denotation of distributions very close to the real deal from very few data. This technical sense doesn’t allow the analogy with theorem-proving, which is about algorithms, not denotation.
But the analogy is in terms of the “fast but uncomputable” oxymoron.
Let’s do another thought experiment. Say that humanity has finally resolved to send colonists to nearby star systems. The first group is getting ready to head out to Alpha Centauri.
The plan is that after the colonists arrive and set up their initial civilization, they will assemble a data archive of size T about the new world and send it back to Earth for review. Now it is expensive to send data across light-years, so obviously they want to minimize the number of bits they have to send. So the question is: what data format do the two parties agree on at the moment of parting?
If T is small, then it makes sense to think this issue over quite a bit. What should we expect the data to look like? Will it be images, audio, health reports? If we can figure something out about what the data will look like in advance (ie, choose a good prior), then we can develop a good data format and get short codes.
But if T is large (terabytes) then there’s no point in doing that. When the Alpha Centauri people build their data archive, they spend some time analyzing it and figuring out ways to compress it. Finally they find a really good compression format (=prior). Of course, Earth doesn’t know the format—but that doesn’t matter, since the specification for the format can just be prepended to the transmission.
I think this thought experiment is nice because it reveals the pointlessness of a lot of philosophical debates about Solomonoff, Bayes, etc. Of course the colonists have to choose a prior before the moment of parting, and of course if they choose a good prior they will get short codes. And the Solomonoff distribution may not be perfect in some metaphysical sense, but it’s obviously the right prior to choose in the large T regime. Better world-specific formats exist, but their benefit is small compared to T.
The choice that they will prepend a description (and the format of the description) is a choice of prior.
Well, the thought experiment doesn’t accomplish that. Solomonoff induction is not necessarily optimal (and most probably isn’t optimal) in your scenario, even and especially for large T. The amount of time it takes for any computable Occamian approximation of S/I to find the the optimal encoding, is superexponential in the length of the raw source data. So the fact that it will eventually get to a superior or near-superior encoding is little consolation, when Alpha Centauri and Sol will have long burned out before Solomonoff has converged on a solution.
The inferiority of Solomonoff Occamian induction, of iterating up through shorter generating algorithms until the data is matched, is not some metaphysical or philosophical issue, but rather, deals directly with the real-world time constraints that arise in practical situations.
My point is, any practical attempt to incorporate Solomonoff induction must also make use of knowledge of the data’s regularity that was found some other way, making it questionable whether Solomonoff induction incorporates everything we mean by “intelligence”. This incompleteness also raises the issue of what this-world-specific methods we actually did use to get to our current state of knowledge that makes Bayesian inference actually effective.