From Laplace to BIC

The pre­vi­ous post out­lined Laplace ap­prox­i­ma­tion, one of the most com­mon tools used to ap­prox­i­mate hairy prob­a­bil­ity in­te­grals. In this post, we’ll use Laplace ap­prox­i­ma­tion to de­rive the Bayesian In­for­ma­tion Cri­te­rion (BIC), a pop­u­lar com­plex­ity penalty method for com­par­ing mod­els with more free pa­ram­e­ters to mod­els with fewer free pa­ram­e­ters.

The BIC is pretty sim­ple:

  • Start with the max­i­mum log like­li­hood

  • Sub­tract to pe­nal­ize for com­plex­ity, where N is the num­ber of data points and k is the num­ber of free pa­ram­e­ters (i.e. di­men­sion of ).

Thus: . Us­ing this magic num­ber, we com­pare any two mod­els we like.

Let’s de­rive that.

BIC Derivation

As usual, we’ll start from . (Cau­tion: don’t for­get that what we re­ally care about is ; we can jump to only as long as our pri­ors are close enough to be swamped by the ev­i­dence.) This time, we’ll as­sume that we have N in­de­pen­dent data points , all with the same un­ob­served pa­ram­e­ters—e.g. N die rolls with the same un­ob­served bi­ases. In that case, we have

Next, ap­ply Laplace ap­prox­i­ma­tion and take the log.

where the Hes­sian ma­trix H is given by

Now for the main trick: how does each term scale as the num­ber of data points N in­creases?

  • The max log like­li­hood is a sum over data points, so it should scale roughly pro­por­tion­ally to N.

  • The prior den­sity and the are con­stant with re­spect to N.

  • H is an­other sum over data points, so it should also scale roughly pro­por­tion­ally to N.

Let’s go ahead and write H as , to pull out the N-de­pen­dence. Then, if we can re­mem­ber how de­ter­mi­nants scale:

so we can re-write our Laplace ap­prox­i­ma­tion as

where con­tains all the terms which are roughly con­stant with re­spect to N. The first two terms are the BIC.

In other words, the BIC is just the Laplace ap­prox­i­ma­tion, but ig­nor­ing all the terms which don’t scale up as the num­ber of data points in­creases.

When Does BIC Work?

What con­di­tions need to hold for BIC to work? Let’s go back through the deriva­tion and list out the key as­sump­tions be­hind our ap­prox­i­ma­tions.

  • First, in or­der to jump from to , our mod­els should have roughly similar prior prob­a­bil­ities - i.e. within a few or­ders of mag­ni­tude.

  • Se­cond, in or­der for any point ap­prox­i­ma­tion to work, the pos­te­rior pa­ram­e­ter dis­tri­bu­tion needs to be pointy and uni­modal—most of the pos­te­rior prob­a­bil­ity mass must be near . In other words, we need enough data to get a pre­cise es­ti­mate of the un­ob­served pa­ram­e­ters.

  • Third, we must have N large enough that (the small­est term we’re keep­ing) is much larger than the terms we’re throw­ing away.

That last con­di­tion is the big one. BIC is a large-N ap­prox­i­ma­tion, so N needs to be large for it to work. How large? That de­pends how big is—N needs to be ex­po­nen­tially larger than that. We’ll see an ex­am­ple in the next post.

Next post will talk more about rel­a­tive ad­van­tages of BIC, Laplace, and ex­act calcu­la­tion for com­par­ing mod­els. We’ll see a con­crete ex­am­ple of when BIC works/​fails.