Cross-Validation vs Bayesian Model Comparison

Sup­pose we need to pre­dict the out­comes of simu­lated dice rolls as a pro­ject for a ma­chine learn­ing class. We have two mod­els: an “un­bi­ased” model, which as­signs equal prob­a­bil­ity to all out­comes, and a “bi­ased” model, which learns each out­come fre­quency from the data.

In a ma­chine learn­ing class, how would we com­pare the perfor­mance of these two mod­els?

We’d prob­a­bly use a pro­ce­dure like this:

  • Train both mod­els on the first 80% of the data (al­though train­ing is triv­ial for the first model).

  • Run both mod­els on the re­main­ing 20%, and keep whichever one performs bet­ter.

This method is called cross-val­i­da­tion (along with its gen­er­al­iza­tions). It’s sim­ple, it’s in­tu­itive, and it’s widely-used.

So far in this se­quence, we’ve talked about Bayesian model com­par­i­son: to com­pare two mod­els, calcu­late the pos­te­rior for each model. How do cross-val­i­da­tion and Bayesian model com­par­i­son differ?

Bi­ased/​Un­bi­ased Die Simulation

Let’s run a simu­la­tion. We’ll roll a 100-sided die N times, us­ing both a bi­ased die and an un­bi­ased die. We’ll ap­ply both cross-val­i­da­tion and Bayesian model com­par­i­son to the data, and see which model each one picks. Speci­fics:

  • We’ll use N-fold cross-val­i­da­tion with log like­li­hood loss: for each data point , we learn the max­i­mum-like­li­hood pa­ram­e­ters based on all data ex­cept . To get the a fi­nal met­ric, we then sum log like­li­hood over all points:

  • For Bayesian model com­par­i­son, we’ll com­pute for an un­bi­ased model, and a model with uniform prior on the bi­ases, just like we did for Wolf’s Dice

  • The simu­lated bi­ased die has prob­a­bil­ity 1200 on half the faces and 3200 on the other half

We’ll plot the differ­ence in score/​ev­i­dence/​what­ever-you-want-to-call-it as­signed to each model by each method, as the num­ber of data points N ranges from 1 up to 10000.

Here are the re­sults from one run:

First and most im­por­tant: both cross-val­i­da­tion and Bayesian model com­par­i­son as­sign more ev­i­dence to the bi­ased model (i.e. line above zero) when the die is bi­ased, and more ev­i­dence to the un­bi­ased model (i.e. line be­low zero) when the die is un­bi­ased.

The most strik­ing differ­ence be­tween the meth­ods is in the case of an un­bi­ased die: Bayesian model com­par­i­son as­signs lower and lower prob­a­bil­ity to the bi­ased model, whereas cross-val­i­da­tion is ba­si­cally flat. In the­ory, as , the cross-val­i­da­tion met­ric will be ran­dom with roughly zero mean.

Why the differ­ence? Be­cause cross-val­i­da­tion and Bayesian model com­par­i­son an­swer differ­ent ques­tions.

Differ­ent Questions

Com­pare:

  • Cross-val­i­da­tion: how ac­cu­rately will this model pre­dict fu­ture data gath­ered the same way?

  • Bayesian: How likely is this model given the data? Or equiv­a­lently, via Bayes’ rule: how well does this model pre­dict the data we’ve seen?

So one is ask­ing how well the model can pre­dict fu­ture data, while the other is ask­ing how well the model pre­dicted past data. To see the differ­ence, think about the in­ter­est­ing case from the simu­la­tion: bi­ased vs un­bi­ased model run­ning on data from an un­bi­ased die. As , the bi­ased model learns the true (un­bi­ased) fre­quen­cies, so the two mod­els will make the same pre­dic­tions go­ing for­ward. With the same pre­dic­tions, cross-val­i­da­tion is in­differ­ent.

For cross-val­i­da­tion pur­poses, a model which gave wrong an­swers early but even­tu­ally learned the cor­rect an­swers is just as good as a model which gave cor­rect an­swers from the start.

If all we care about is pre­dict­ing fu­ture data, then we don’t re­ally care whether a model made cor­rect pre­dic­tions from the start or took a while to learn. In that case, cross-val­i­da­tion works great (and it’s cer­tainly much com­pu­ta­tion­ally eas­ier than the Bayesian method). On the other hand, if one model made cor­rect pre­dic­tions from the start and the other took a long time to learn, then that’s Bayesian ev­i­dence in fa­vor of the model which was cor­rect from the start.

That differ­ence be­comes im­por­tant in cases like Wolf’s Dice II, where we wanted to de­duce the phys­i­cal asym­me­tries of a die. In that case, the fully-gen­eral model and our fi­nal model both make the same pre­dic­tions about fu­ture rolls once they have enough data. But they differ on pre­dic­tions about what the world looks like aside from the data it­self—for in­stance, they make differ­ent pre­dic­tions about what we would find if we took out some cal­ipers and ac­tu­ally mea­sured the di­men­sions of Wolf’s white die.

Pre­dic­tion vs Understanding

Years ago, while work­ing in the lab of a com­pu­ta­tional biol­o­gist, he and I got into an ar­gu­ment about the ob­jec­tive of “un­der­stand­ing”. I ar­gued that, once some data can be pre­dicted, there is noth­ing else left to un­der­stand about it. Whether it’s be­ing pre­dicted by a de­tailed phys­i­cal simu­la­tion or a sim­ple ab­stract model or a neu­ral net­work is not rele­vant.

To­day, I no longer be­lieve that.

Wolf’s Dice II is an ex­cel­lent counter-ex­am­ple which high­lights the prob­lem. If two mod­els always make the same pre­dic­tions about ev­ery­thing, then sure, there’s no im­por­tant differ­ence be­tween them. But don’t con­fuse “make the same pre­dic­tions about ev­ery­thing” with “make the same pre­dic­tions about the data” or “make the same pre­dic­tions about fu­ture data of this form”. Even if two mod­els even­tu­ally come to the ex­act same con­clu­sion about the out­come dis­tri­bu­tion from rolls of a par­tic­u­lar die, they can still make differ­ent pre­dic­tions about the phys­i­cal prop­er­ties of the die it­self.

If two mod­els make differ­ent pre­dic­tions about some­thing out in the world, then it can be use­ful to eval­u­ate the prob­a­bil­ities of the two mod­els—even if they make the same pre­dic­tions about fu­ture data of the same form as the train­ing data.

Phys­i­cal prop­er­ties of a die are one ex­am­ple, but we can ex­tend this to e.g. gen­er­al­iza­tion prob­lems. If we have mod­els which make similar pre­dic­tions about fu­ture data from the train­ing dis­tri­bu­tion, but make differ­ent pre­dic­tions more gen­er­ally, then we can ap­ply Bayesian model com­par­i­son to (hope­fully) avoid gen­er­al­iza­tion er­ror. Of course, Bayesian model com­par­i­son is not a guaran­tee against gen­er­al­iza­tion prob­lems—even in prin­ci­ple it can only work if there’s any gen­er­al­iza­tion-rele­vant ev­i­dence in the data at all. But it should work in al­most any case where cross-val­i­da­tion is suffi­cient, and many other cases as well. (I’m hedg­ing a bit with “al­most any”; it is pos­si­ble for cross-val­i­da­tion to “get lucky” and out­perform some­times, but that should be rare as long as our pri­ors are rea­son­ably ac­cu­rate.)

Con­clu­sion?

In sum­mary:

  • Cross-val­i­da­tion tells us how well a model will pre­dict fu­ture data of the same form as the train­ing data. If that’s all you need to know, then use cross-val­i­da­tion; it’s much eas­ier com­pu­ta­tion­ally than Bayesian model com­par­i­son.

  • Bayesian model com­par­i­son tells us how well a model pre­dicted past data, and thus the prob­a­bil­ity of the model given the data. If want to eval­u­ate mod­els which make differ­ent pre­dic­tions about the world even if they con­verge to similar pre­dic­tions about fu­ture data, then use Bayesian model com­par­i­son.

One fi­nal word of cau­tion un­re­lated to the main point of this post. One prac­ti­cal dan­ger of cross-val­i­da­tion is that it will overfit if we try to com­pare too many differ­ent mod­els. As an ex­treme ex­am­ple, imag­ine us­ing one model for ev­ery pos­si­ble bias of a coin—a whole con­tinuum of mod­els. Bayesian model com­par­i­son, in that case, would sim­ply yield the pos­te­rior dis­tri­bu­tion of the bias; the max­i­mum-pos­te­rior model would likely be overfit (de­pend­ing on the prior), but the full dis­tri­bu­tion would be cor­rect. This is an in­her­ent dan­ger of max­i­miza­tion as an epistemic tech­nique: it’s always an ap­prox­i­ma­tion to the pos­te­rior, so it will fail when­ever the pos­te­rior isn’t dom­i­nated by the max­i­mal point.