I don’t think it would be a good idea to focus obsessively on AIXI. If you want to study AI, study a broad set of topics. To find reading material, do a set of Google searches on obvious keywords and rank the results by citation count.
Regarding the papers, I have a very strong idiosyncratic belief about the path to AI: to succeed, researchers must study the world and not just algorithms. In my view, all results should be expressible in the form: because the real world data exhibits empirical structure X, algorithm Y succeeds in describing/predicting it. The algorithms presented by the papers you linked to probably work only because they happen to take advantage of some structure present in the toy problems they are tested on.
In my view, all results should be expressible in the form: because the real world data exhibits empirical structure X, algorithm Y succeeds in describing/predicting it.
Even if I know the exact probability distribution over images, there is an algorithmic problem (namely, how to do the inference), so your view is definitely at least a little too extreme.
In fact, this algorithmic difficulty is an issue that many researchers are currently grappling with, so in practice you really shouldn’t expect all results to be making a novel statement about how the world works. Applying this standard to current research would stall progress in the directions I (and I think most serious AI researchers) currently believe are most important to actually reaching AI, especially human-comprehensible AI which might possibly be friendly.
Maybe we are wrong, but the argument you gave, and your implications about how NFL should be applied, are not really relevant to that question.
Even if I know the exact probability distribution over images, there is an algorithmic problem (namely, how to do the inference), so your view is definitely at least a little too extreme.
I don’t dispute that the algorithmic problem is interesting and important. I only claim that the empirical question is equally important.
Applying this standard to current research would stall progress in the directions I (and I think most serious AI researchers) currently believe are most important to actually reaching AI
What you’re really saying is that you think a certain direction of research will be fruitful. That’s fine. I disagree, but I doubt we can resolve the debate. Let’s compare notes again in 2031.
In my view, all results should be expressible in the form: because the real world data exhibits empirical structure X, algorithm Y succeeds in describing/predicting it.
I think you are restating the No Free Lunch theorem but that isn’t a rare belief is it?
Sure, many people are aware of the NFL theorem, but they don’t take it seriously. If you don’t believe me, read almost any computer vision paper. Vision researchers study algorithms, not images.
Some, such as Edmonds (2006), argue that universal definitions of intelligence
are impossible due to Wolpert’s so called “No Free Lunch” theorem (Wolpert
and Macready, 1997). However this theorem, or any of the standard variants
on it, cannot be applied to universal intelligence for the simple reason that we
have not taken a uniform distribution over the space of environments. Instead
we have used a highly non-uniform distribution based on Occam’s razor.
The No Free Lunch theorems seem obviously-irrelevant to me. I have never understood why they get cited so much.
Don’t any vision researchers use Bayes? If so, they’d have to be researching the formulation of priors for the true state of the scene, since the likelihood is almost trivial.
I’m not really in the field, but I am vaguely familiar with the literature and this isn’t how it works (though you might get that impression from reading LW).
A vision algorithm might face the following problem: reality picks an underlying physical scene and an image from some joint distribution. The algorithm looks at the image and must infer something about the scene. In this case, you need to integrate over a huge space to calculate likelihoods, which is generally completely intractable and so requires some algorithmic insight. For example, if you want to estimate the probability that there is an apple on the table, you need to integrate over the astronomically many possible scenes in which there is an apple on the table.
I don’t know if this contradicts you, but this is a problem that biological brain/eye systems have to solve (“inverse optics”), and Steven Pinker has an excellect discussion of it from a Bayesian perspective in his book How the Mind Works. He mentions that the brain does heavily rely on priors that match our environment, which significantly narrows down the possible scenes that could “explain” a given retinal image pair. (You get optical illusions when a scene violates these assumptions.)
There are two parts to the problem: one is designing a model that describes the world well, and the other is using that model to infer things about the world from data. I agree that Bayesian is the correct adjective to apply to this process, but not necessarily that modeling the world is the most interesting part.
I think this paper, entitled “Region competition: Unifying snakes, region growing, and Bayes/MDL for multiband image segmentation” is indicative of the overall mindset. Even though the title explicitly mentions Bayes and MDL, the paper doesn’t report any compression results—only segmentation results. Bayes/MDL are viewed as tricks to be used to achieve some other purpose, not as the fundamental principle justifying the research.
In my view, all results should be expressible in the form: because the real world data exhibits empirical structure X, algorithm Y succeeds in describing/predicting it.
AIXI does exactly that—it is based on Occam’s razor.
Furthermore, it seems to me if AGI is the goal, researchers would need to study those features of the world that caused natural general intelligence (instantiated in humans) to arise. So thank goodness they’re not doing that.
I would say researchers need to study the features of the world that make general intelligence possible. What computational structure does the world have that allows us to understand it?
Ockham’s razor principle has been proven to be invaluable for understanding our world. Indeed, it not only seems a necessary but also sufficient founding principle of science. Until other necessary or sufficient principles are found, it is prudent to accept Ockham’s razor as the foundation of inductive reasoning. So far, all attempts to discredit the universal role of Ockham’s razor have failed.
I don’t think we can resolve this debate, but let me try to clarify the differences in our positions (perhaps confusing to nonspecialists, since we both advocate compression).
Hutter/Legg/Tyler/etc (algorithmic approach) : Compression is the best measure of understanding. Therefore, to achieve general intelligence, we should search for general purpose compressors. It is not interesting to build specialized compressors. To achieve compression in spite of the NFL theorem, one must exploit empirical structure in the data, but the only empirical fact we require is that the world is computable. Because the compressors are general purpose, to demonstrate success it is sufficient to show that they work well on simple benchmark problems. There is no need to study the structure of specific datasets. To achieve good text compression, one simply finds a general purpose compressor and applies it to text. The problem is entirely a problem of mathematics and algorithm design.
Burfoot (empirical approach) : Compression is the best measure of understanding. However, general purpose compressors are far out of reach at this stage. Instead, one should develop specialized compressors that target specific data types (text, images, speech, music, etc). To achieve good compression in spite of the NFL, one must study the empirical structure of the respective data sets, and build that knowledge into the compressors. To compress text well, one should study grammar, parsing, word morphology, and related topics in linguistics. To demonstrate success, it is sufficient to show that a new compressor achieves a better compression rate on a standard benchmark. We should expect good compressor to fail when applied to a data type for which it was not designed. Progress is achieved by obtaining a series of increasingly strong compression results (K-complexity upper bounds) on standard databases, while also adding new databases of greater scope and size.
Again, I don’t think this debate can resolved, but I think it’s important to clarify the various positions.
General purpose systems have their attractions. The human brain has done well out of the generality that it has.
However, I do see many virtues in narrower systems. Indeed, if you want to perform some specific task, a narrow expert system focussed on the problem domain will probably do a somewhat better job than a general purpose system. So, I would not say:
It is not interesting to build specialized compressors.
Rather, each specialized compressor encodes a little bit of a more general intelligence.
This is also a bit of a misrepresentation:
but the only empirical fact we require is that the world is computable
Occam’s razor is the critical thing, really. That is an “empirical fact”—and without it we are pretty lost.
We do want general-purpose systems. If we have those, they can build whatever narrow systems we might need.
There are two visions of the path towards machine intelligence—one is of broadening narrow systems, and the other is of general forecasting systems increasing in power: the “forecasting first” scenario. Both seem likely to be important. I tend to promote the second approach partly for technical reasons, but partly because it currently gets so little air time and attention.
I don’t think it would be a good idea to focus obsessively on AIXI. If you want to study AI, study a broad set of topics. To find reading material, do a set of Google searches on obvious keywords and rank the results by citation count.
Regarding the papers, I have a very strong idiosyncratic belief about the path to AI: to succeed, researchers must study the world and not just algorithms. In my view, all results should be expressible in the form: because the real world data exhibits empirical structure X, algorithm Y succeeds in describing/predicting it. The algorithms presented by the papers you linked to probably work only because they happen to take advantage of some structure present in the toy problems they are tested on.
Even if I know the exact probability distribution over images, there is an algorithmic problem (namely, how to do the inference), so your view is definitely at least a little too extreme.
In fact, this algorithmic difficulty is an issue that many researchers are currently grappling with, so in practice you really shouldn’t expect all results to be making a novel statement about how the world works. Applying this standard to current research would stall progress in the directions I (and I think most serious AI researchers) currently believe are most important to actually reaching AI, especially human-comprehensible AI which might possibly be friendly.
Maybe we are wrong, but the argument you gave, and your implications about how NFL should be applied, are not really relevant to that question.
I don’t dispute that the algorithmic problem is interesting and important. I only claim that the empirical question is equally important.
What you’re really saying is that you think a certain direction of research will be fruitful. That’s fine. I disagree, but I doubt we can resolve the debate. Let’s compare notes again in 2031.
I think you are restating the No Free Lunch theorem but that isn’t a rare belief is it?
Sure, many people are aware of the NFL theorem, but they don’t take it seriously. If you don’t believe me, read almost any computer vision paper. Vision researchers study algorithms, not images.
Legg’s thesis says:
The No Free Lunch theorems seem obviously-irrelevant to me. I have never understood why they get cited so much.
Don’t any vision researchers use Bayes? If so, they’d have to be researching the formulation of priors for the true state of the scene, since the likelihood is almost trivial.
I’m not really in the field, but I am vaguely familiar with the literature and this isn’t how it works (though you might get that impression from reading LW).
A vision algorithm might face the following problem: reality picks an underlying physical scene and an image from some joint distribution. The algorithm looks at the image and must infer something about the scene. In this case, you need to integrate over a huge space to calculate likelihoods, which is generally completely intractable and so requires some algorithmic insight. For example, if you want to estimate the probability that there is an apple on the table, you need to integrate over the astronomically many possible scenes in which there is an apple on the table.
I don’t know if this contradicts you, but this is a problem that biological brain/eye systems have to solve (“inverse optics”), and Steven Pinker has an excellect discussion of it from a Bayesian perspective in his book How the Mind Works. He mentions that the brain does heavily rely on priors that match our environment, which significantly narrows down the possible scenes that could “explain” a given retinal image pair. (You get optical illusions when a scene violates these assumptions.)
There are two parts to the problem: one is designing a model that describes the world well, and the other is using that model to infer things about the world from data. I agree that Bayesian is the correct adjective to apply to this process, but not necessarily that modeling the world is the most interesting part.
I think this paper, entitled “Region competition: Unifying snakes, region growing, and Bayes/MDL for multiband image segmentation” is indicative of the overall mindset. Even though the title explicitly mentions Bayes and MDL, the paper doesn’t report any compression results—only segmentation results. Bayes/MDL are viewed as tricks to be used to achieve some other purpose, not as the fundamental principle justifying the research.
AIXI does exactly that—it is based on Occam’s razor.
Furthermore, it seems to me if AGI is the goal, researchers would need to study those features of the world that caused natural general intelligence (instantiated in humans) to arise. So thank goodness they’re not doing that.
I would say researchers need to study the features of the world that make general intelligence possible. What computational structure does the world have that allows us to understand it?
Hutter says:
I don’t think we can resolve this debate, but let me try to clarify the differences in our positions (perhaps confusing to nonspecialists, since we both advocate compression).
Hutter/Legg/Tyler/etc (algorithmic approach) : Compression is the best measure of understanding. Therefore, to achieve general intelligence, we should search for general purpose compressors. It is not interesting to build specialized compressors. To achieve compression in spite of the NFL theorem, one must exploit empirical structure in the data, but the only empirical fact we require is that the world is computable. Because the compressors are general purpose, to demonstrate success it is sufficient to show that they work well on simple benchmark problems. There is no need to study the structure of specific datasets. To achieve good text compression, one simply finds a general purpose compressor and applies it to text. The problem is entirely a problem of mathematics and algorithm design.
Burfoot (empirical approach) : Compression is the best measure of understanding. However, general purpose compressors are far out of reach at this stage. Instead, one should develop specialized compressors that target specific data types (text, images, speech, music, etc). To achieve good compression in spite of the NFL, one must study the empirical structure of the respective data sets, and build that knowledge into the compressors. To compress text well, one should study grammar, parsing, word morphology, and related topics in linguistics. To demonstrate success, it is sufficient to show that a new compressor achieves a better compression rate on a standard benchmark. We should expect good compressor to fail when applied to a data type for which it was not designed. Progress is achieved by obtaining a series of increasingly strong compression results (K-complexity upper bounds) on standard databases, while also adding new databases of greater scope and size.
Again, I don’t think this debate can resolved, but I think it’s important to clarify the various positions.
Thanks for the attempt at a position summary!
General purpose systems have their attractions. The human brain has done well out of the generality that it has.
However, I do see many virtues in narrower systems. Indeed, if you want to perform some specific task, a narrow expert system focussed on the problem domain will probably do a somewhat better job than a general purpose system. So, I would not say:
Rather, each specialized compressor encodes a little bit of a more general intelligence.
This is also a bit of a misrepresentation:
Occam’s razor is the critical thing, really. That is an “empirical fact”—and without it we are pretty lost.
We do want general-purpose systems. If we have those, they can build whatever narrow systems we might need.
There are two visions of the path towards machine intelligence—one is of broadening narrow systems, and the other is of general forecasting systems increasing in power: the “forecasting first” scenario. Both seem likely to be important. I tend to promote the second approach partly for technical reasons, but partly because it currently gets so little air time and attention.