Gelman Against Parsimony

In two posts, Bayesian stats guru An­drew Gel­man ar­gues against par­si­mony, though it seems to be fa­vored ’round these parts, in par­tic­u­lar Solomonoff In­duc­tion and BIC as im­perfect for­mal­iza­tions of Oc­cam’s Ra­zor.

Gel­man says:

I’ve never seen any good gen­eral jus­tifi­ca­tion for par­si­mony...

Maybe it’s be­cause I work in so­cial sci­ence, but my feel­ing is: if you can ap­prox­i­mate re­al­ity with just a few pa­ram­e­ters, fine. If you can use more pa­ram­e­ters to fold in more in­for­ma­tion, that’s even bet­ter.

In prac­tice, I of­ten use sim­ple mod­els–be­cause they are less effort to fit and, es­pe­cially, to un­der­stand. But I don’t kid my­self that they’re bet­ter than more com­pli­cated efforts!

My fa­vorite quote on this comes from Rad­ford Neal‘s book, Bayesian Learn­ing for Neu­ral Net­works, pp. 103-104: “Some­times a sim­ple model will out­perform a more com­plex model . . . Nev­er­the­less, I be­lieve that de­liber­ately limit­ing the com­plex­ity of the model is not fruit­ful when the prob­lem is ev­i­dently com­plex. In­stead, if a sim­ple model is found that out­performs some par­tic­u­lar com­plex model, the ap­pro­pri­ate re­sponse is to define a differ­ent com­plex model that cap­tures what­ever as­pect of the prob­lem led to the sim­ple model perform­ing well.”

...

...ideas like min­i­mum-de­scrip­tion-length, par­si­mony, and Akaike’s in­for­ma­tion crite­rion, are par­tic­u­larly rele­vant when mod­els are es­ti­mated us­ing least squares, max­i­mum like­li­hood, or some other similar op­ti­miza­tion method.

When us­ing hi­er­ar­chi­cal mod­els, we can avoid overfit­ting and get good de­scrip­tions with­out us­ing par­si­mony–the idea is that the many pa­ram­e­ters of the model are them­selves mod­eled. See here for some dis­cus­sion of Rad­ford Neal’s ideas in fa­vor of com­plex mod­els, and see here for an ex­am­ple from my own ap­plied re­search.