Is model selection really a big problem? I thought that there was a conceptually simple way to incorporate this into a model (just add a model index parameter), though it might be computationally tricky sometimes. As JohnDavidBustard points out below, the real difficulty seems like model creation. Though I suppose you can frame this as model selection if you have some prior over a broad enough category of models (say all turing machines).
It depends on what you mean by model selection. If you mean e.g. figuring out whether to use quadratics or cubics, then the standard solution that people cite is to use Bayesian Occam’s razor, i.e. compute
Where we compute the probabilities on the right-hand side by marginalizing over all cubics and quadratics. But the number you get out of this will depend strongly on how quickly the tails decay on your distribution over cubics and quadratics, so I don’t find this particularly satisfying. (I’m not alone in this, although there are people who would disagree with me or propose various methods for choosing the prior distributions appropriately.)
If you mean something else, like figuring out what specific model to pick out from your entire space (e.g. picking a specific function to fit your data), then you can run into problems like having to compare probability masses to probability densities, or comparing measures with different dimensionality (e.g. densities on the line versus the plane); a more fundamental issue is that picking a specific model potentially ignores other features of your posterior distribution, like how concentrated the probability mass is about that model.
I would say that the most principled way to get a single model out at the end of the day is variational inference, which basically attempts to set parameters in order to minimize the relative entropy between the distribution implied by the parameters and the actual posterior distribution. I don’t know a whole lot about this area, other than a couple papers I read, but it does seem like a good way to perform inference if you’d like to restrict yourself to considering a single model at a time.
OK, so you’re saying that a big problem in model selection is coming up with good prior distributions for different classes of models, specifically those with different tail decays (it sounds like you think it could also be that the standard bayes framework is missing something). This is an interesting idea which I had heard about before, but didn’t understand till now. Thank you for telling me about it.
I would say that when you have a somewhat dispersed posterior it is simply misleading to pick any specific model+parameters as your fit. The correct thing to do is average over possible models+parameters.
It’s only when you have a relatively narrow posterior or the errors bars on the estimate you give for some parameter or prediction don’t matter that it’s OK to select a single model.
I think I basically agree with you on that; whenever feasible the full posterior (as opposed to the maximum-likelihood model) is what you should be using. So instead of using “Bayesian model selection” to decide whether to pick cubics or quadratics, and then fitting the best cubic or the best quadratic depending on the answer, the “right” thing to do is to just look at the posterior distribution over possible functions f, and use that to get a posterior distribution over f(x) for any given x.
The problem is that this is not always reasonable for the application you have in mind, and I’m not sure if we have good general methods for coming up with the right way to get a good approximation. But certainly an average over the models is what we should be trying to approximate.
Is model selection really a big problem? I thought that there was a conceptually simple way to incorporate this into a model (just add a model index parameter), though it might be computationally tricky sometimes. As JohnDavidBustard points out below, the real difficulty seems like model creation. Though I suppose you can frame this as model selection if you have some prior over a broad enough category of models (say all turing machines).
It depends on what you mean by model selection. If you mean e.g. figuring out whether to use quadratics or cubics, then the standard solution that people cite is to use Bayesian Occam’s razor, i.e. compute
p(Cubic | Data)/p(Quadratic | Data) = p(Data | Cubic)/p(Data | Quadratic) * p(Cubic)/p(Quadratic)
Where we compute the probabilities on the right-hand side by marginalizing over all cubics and quadratics. But the number you get out of this will depend strongly on how quickly the tails decay on your distribution over cubics and quadratics, so I don’t find this particularly satisfying. (I’m not alone in this, although there are people who would disagree with me or propose various methods for choosing the prior distributions appropriately.)
If you mean something else, like figuring out what specific model to pick out from your entire space (e.g. picking a specific function to fit your data), then you can run into problems like having to compare probability masses to probability densities, or comparing measures with different dimensionality (e.g. densities on the line versus the plane); a more fundamental issue is that picking a specific model potentially ignores other features of your posterior distribution, like how concentrated the probability mass is about that model.
I would say that the most principled way to get a single model out at the end of the day is variational inference, which basically attempts to set parameters in order to minimize the relative entropy between the distribution implied by the parameters and the actual posterior distribution. I don’t know a whole lot about this area, other than a couple papers I read, but it does seem like a good way to perform inference if you’d like to restrict yourself to considering a single model at a time.
OK, so you’re saying that a big problem in model selection is coming up with good prior distributions for different classes of models, specifically those with different tail decays (it sounds like you think it could also be that the standard bayes framework is missing something). This is an interesting idea which I had heard about before, but didn’t understand till now. Thank you for telling me about it.
I would say that when you have a somewhat dispersed posterior it is simply misleading to pick any specific model+parameters as your fit. The correct thing to do is average over possible models+parameters.
It’s only when you have a relatively narrow posterior or the errors bars on the estimate you give for some parameter or prediction don’t matter that it’s OK to select a single model.
I think I basically agree with you on that; whenever feasible the full posterior (as opposed to the maximum-likelihood model) is what you should be using. So instead of using “Bayesian model selection” to decide whether to pick cubics or quadratics, and then fitting the best cubic or the best quadratic depending on the answer, the “right” thing to do is to just look at the posterior distribution over possible functions f, and use that to get a posterior distribution over f(x) for any given x.
The problem is that this is not always reasonable for the application you have in mind, and I’m not sure if we have good general methods for coming up with the right way to get a good approximation. But certainly an average over the models is what we should be trying to approximate.