As the comments to that post say, if you can actually look at the hypotheses in question, and you’re completely confident in your own judgement of simplicity, that judgement completely screens off how much data was used in formulating them.
I agree that it wouldn’t matter how much data we gave the scientists if they had fixed a method for turning data into a theory beforehand.
And I agree that such a method should settle on the simplest theory among all candidates. It should implement Occam’s razor.
But we shouldn’t expect the scientists to fix such a method before seeing the data. Occam’s razor is not enough. You first have to have a computationally feasible way to generate good candidate theories from which you choose the simplest one. And we have every reason to expect that cosmologists will eventually come up with better methods for turning cosmological data into good candidate theories. Therefore, it doesn’t make sense to force the cosmologists to bind themselves to a method now. They need the freedom to discover better methods than any that they’ve yet found.
The requirement of “computational feasibility” means that we can expect to have several candidate methods with no a priori way to judge confidently that one is better than the other. We will need recourse to empirical observations to compare the methods.
In this comment of mine to the post linked above, I showed that if a method produces a theory that predicts the whole data set from a subset, then that method is probably superior to a method that uses the whole data set. The proof goes through even if we assume that each method has a step where it applies Occam’s razor:
Define a method to be a map that takes in a batch of evidence and returns a theory. We have two assumptions
ASSUMPTION 1: The theory produced by giving an input batch to a method will at least predict that input. That is, no matter how flawed a method of theory-construction is, it won’t contradict the evidence fed into it. More precisely,
p( M(B) predicts B ) = 1.
[...]
ASSUMPTION 2: If a method M is known to be flawed, then its theories are less likely to make correct predictions of future observations. More precisely, if B2 is not contained in B1, then
And I agree that such a method should settle on the simplest theory among all candidate theories. It should implement Occam’s razor.
It’s not quite that simple in practice. There’s a tradeoff here, between accuracy in retrospect and theory simplicity. The two extreme pathological cases are:
You demand absolute accuracy in retrospect, i.e. P(observed data | hypothesis) = 1. This is the limit case of overfitting, and yields a GLUT, which makes no or completely useless predictions about the future.
You demand maximum simplicity. This is the limit case of underfitting, and yields a maximum-entropy distribution.
You want something inbetween those cases. I don’t know where exactly, but you would have to figure out some way to determine that point if you were, say, building an AGI.
I can’t really follow your earlier post. Specifically, I can’t parse your use of ” predicts ”, which you seem to use as a boolean value. But theories don’t “predict” or “not predict” outcomes in any absolute sense, they just assign probabilities to outcomes. Please explain your use of the phrase.
I can’t really follow your earlier post. Specifically, I can’t parse your use of ” predicts ”, which you seem to use as a boolean value. But theories don’t “predict” or “not predict” outcomes in any absolute sense, they just assign probabilities to outcomes. Please explain your use of the phrase.
Sorry, the earlier post was in the context of a toy problem in which predictions were boolean. I should have mentioned that. (I had made this assumption explicit in an earlier comment.)
My argument shows that, in the limiting case of boolean predictions, we should trust successful theories constructed using a subset of the data over theories constructed using all the data, even if all the theories were constructed using Occam’s razor. This at least strongly suggests the same possibility in more realistic cases where the theories assign probability distributions.
Ok, I think I get your earlier post now. I think you might be overcomplicating things here.
Sure, if you’re not confident what the correct simplicity prior is, you can get real evidence about which theory is likely to be stronger by observing things like their ability to correctly predict the outcome of new experiments. And to the extent that this tells you something about the way the originating scientist generates theories, there should even be some shifting of probability mass regarding the power of other theories proiduced by the same scientist. But that’s quite a lot of indirection, and there’s significant unknown factors that will dilute these shifts.
Attempting this is somewhat like trying to estimate the probability of a scientist being right about a famous problem in their field based on their prestige. There’s a signal, but it’s quite noisy.
If you know what simplicity looks like (and of course that’s uncomputable, but you can always approximate) - and how much it’s worth in terms of probability mass—you can make a much better guess as to which hypothesis is stronger by just looking at the actual hypotheses.
Looking at things like “how many experimental results did this hypothesis actually predict correctly” is only informative to the extent that your understanding of simplicity and its value is lacking. Note that the phrase lacking understanding of simplicity isn’t meant to be especially disparaging; good understanding of simplicity is hard. There’s a reason the scientific process includes an inelegant workaround instead.
I agree that it wouldn’t matter how much data we gave the scientists if they had fixed a method for turning data into a theory beforehand.
And I agree that such a method should settle on the simplest theory among all candidates. It should implement Occam’s razor.
But we shouldn’t expect the scientists to fix such a method before seeing the data. Occam’s razor is not enough. You first have to have a computationally feasible way to generate good candidate theories from which you choose the simplest one. And we have every reason to expect that cosmologists will eventually come up with better methods for turning cosmological data into good candidate theories. Therefore, it doesn’t make sense to force the cosmologists to bind themselves to a method now. They need the freedom to discover better methods than any that they’ve yet found.
The requirement of “computational feasibility” means that we can expect to have several candidate methods with no a priori way to judge confidently that one is better than the other. We will need recourse to empirical observations to compare the methods.
In this comment of mine to the post linked above, I showed that if a method produces a theory that predicts the whole data set from a subset, then that method is probably superior to a method that uses the whole data set. The proof goes through even if we assume that each method has a step where it applies Occam’s razor:
(See the comment for a proof.)
It’s not quite that simple in practice. There’s a tradeoff here, between accuracy in retrospect and theory simplicity. The two extreme pathological cases are:
You demand absolute accuracy in retrospect, i.e. P(observed data | hypothesis) = 1. This is the limit case of overfitting, and yields a GLUT, which makes no or completely useless predictions about the future.
You demand maximum simplicity. This is the limit case of underfitting, and yields a maximum-entropy distribution.
You want something inbetween those cases. I don’t know where exactly, but you would have to figure out some way to determine that point if you were, say, building an AGI.
I can’t really follow your earlier post. Specifically, I can’t parse your use of ” predicts ”, which you seem to use as a boolean value. But theories don’t “predict” or “not predict” outcomes in any absolute sense, they just assign probabilities to outcomes. Please explain your use of the phrase.
Sorry, the earlier post was in the context of a toy problem in which predictions were boolean. I should have mentioned that. (I had made this assumption explicit in an earlier comment.)
My argument shows that, in the limiting case of boolean predictions, we should trust successful theories constructed using a subset of the data over theories constructed using all the data, even if all the theories were constructed using Occam’s razor. This at least strongly suggests the same possibility in more realistic cases where the theories assign probability distributions.
Ok, I think I get your earlier post now. I think you might be overcomplicating things here.
Sure, if you’re not confident what the correct simplicity prior is, you can get real evidence about which theory is likely to be stronger by observing things like their ability to correctly predict the outcome of new experiments. And to the extent that this tells you something about the way the originating scientist generates theories, there should even be some shifting of probability mass regarding the power of other theories proiduced by the same scientist. But that’s quite a lot of indirection, and there’s significant unknown factors that will dilute these shifts.
Attempting this is somewhat like trying to estimate the probability of a scientist being right about a famous problem in their field based on their prestige. There’s a signal, but it’s quite noisy.
If you know what simplicity looks like (and of course that’s uncomputable, but you can always approximate) - and how much it’s worth in terms of probability mass—you can make a much better guess as to which hypothesis is stronger by just looking at the actual hypotheses.
Looking at things like “how many experimental results did this hypothesis actually predict correctly” is only informative to the extent that your understanding of simplicity and its value is lacking. Note that the phrase lacking understanding of simplicity isn’t meant to be especially disparaging; good understanding of simplicity is hard. There’s a reason the scientific process includes an inelegant workaround instead.