# Laplace Approximation

The last cou­ple posts com­pared some spe­cific mod­els for 20000 rolls of a die. This post will step back, and talk about more gen­eral the­ory for Bayesian model com­par­i­son.

The main prob­lem is to calcu­late for some model. The model will typ­i­cally give the prob­a­bil­ity of ob­served data (e.g. die rolls) based on some un­ob­served pa­ram­e­ter val­ues (e.g. the ’s in the last two posts), along with a prior dis­tri­bu­tion over . We then need to compute

which will be a hairy high-di­men­sional in­te­gral.

Some spe­cial model struc­tures al­low us to sim­plify the prob­lem, typ­i­cally by fac­tor­ing the in­te­gral into a product of one-di­men­sional in­te­grals. But in gen­eral, we need some method for ap­prox­i­mat­ing these in­te­grals.

The two most com­mon ap­prox­i­ma­tion meth­ods used in prac­tice are Laplace ap­prox­i­ma­tion around the max­i­mum-like­li­hood point, and MCMC (see e.g. here for ap­pli­ca­tion of MCMC to Bayes fac­tors). We’ll mainly talk about Laplace ap­prox­i­ma­tion here—in prac­tice MCMC mostly works well in the same cases, as­sum­ing the un­ob­served pa­ram­e­ters are con­tin­u­ous.

## Laplace Approximation

Here’s the idea of Laplace ap­prox­i­ma­tion. First, pos­te­rior dis­tri­bu­tions tend to be very pointy. This is mainly be­cause in­de­pen­dent prob­a­bil­ities mul­ti­ply, so prob­a­bil­ities tend to scale ex­po­nen­tially with the num­ber of data points. Think of the prob­a­bil­ities we calcu­lated in the last two posts, with val­ues like or - that’s the typ­i­cal case. If we’re in­te­grat­ing over a func­tion with val­ues like that, we can ba­si­cally just pay at­ten­tion to the re­gion around the high­est value—other re­gions will have ex­po­nen­tially small weight.

Laplace’ trick is to use a sec­ond-or­der ap­prox­i­ma­tion within that high-val­ued re­gion. Speci­fi­cally, since prob­a­bil­ities nat­u­rally live on a log scale, we’ll take a sec­ond or­der-ap­prox­i­ma­tion of the log like­li­hood around its max­i­mum point. Thus:

If we as­sume that the prior is uniform (i.e. ), then this looks like a nor­mal dis­tri­bu­tion on with mean and var­i­ance given by the in­verse Hes­sian ma­trix of the log-like­li­hood. (It turns out that, even for non-uniform , we can just trans­form so that the prior looks uniform near , and trans­form it back when we’re done.) The re­sult:

Let’s walk through each of those pieces:

• is the usual max­i­mum like­li­hood: the largest prob­a­bil­ity as­signed to the data by any par­tic­u­lar value of .
• is the prior prob­a­bil­ity den­sity of the max­i­mum-like­li­hood point.
• is that an­noy­ing con­stant fac­tor which shows up when­ever we deal with nor­mal dis­tri­bu­tions; k is the di­men­sion of .
• is the de­ter­mi­nant of the “Fisher in­for­ma­tion ma­trix”; it quan­tifies how wide or skinny the peak is.

A bit more de­tail on that last piece: in­tu­itively, each eigen­value of the Fisher in­for­ma­tion ma­trix tells us the ap­prox­i­mate width of the peak in a par­tic­u­lar di­rec­tion. Since the ma­trix is the in­verse var­i­ance (in one di­men­sion ) of our ap­prox­i­mate nor­mal dis­tri­bu­tion, and “width” of the peak of a nor­mal dis­tri­bu­tion cor­re­sponds to the stan­dard de­vi­a­tion , we use an in­verse square root (i.e. the power of ) to ex­tract a width from each eigen­value. Then, to find how much vol­ume the peak cov­ers, we mul­ti­ply to­gether the widths along each di­rec­tion—thus the de­ter­mi­nant, which is the product of eigen­val­ues.

Why do we need eigen­val­ues? The di­a­gram above shows the gen­eral idea: for the func­tion shown, the two ar­rows would be eigen­vec­tors of the Hes­sian at the peak. Un­der a sec­ond-or­der ap­prox­i­ma­tion, these are prin­ci­pal axes of the func­tion’s level sets (the el­lipses in the di­a­gram). They are the nat­u­ral di­rec­tions along which to mea­sure the width. The eigen­value as­so­ci­ated with each eigen­vec­tor tells us the width, and then tak­ing their product (via the de­ter­mi­nant) gives a vol­ume. In the pic­ture above, the de­ter­mi­nant would be pro­por­tional to the vol­ume of any of the el­lipses.

Al­to­gether, then, the Laplace ap­prox­i­ma­tion takes the height of the peak (i.e. ) and mul­ti­plies by the vol­ume of -space which the peak oc­cu­pies, based on a sec­ond-or­der ap­prox­i­ma­tion of the like­li­hood around its peak.

## Laplace Com­plex­ity Penalty

The Laplace ap­prox­i­ma­tion con­tains our first ex­am­ple of an ex­plicit com­plex­ity penalty.

The idea of a com­plex­ity penalty is that we first find the max­i­mum log like­li­hood , maybe add a term for our -prior , and that’s the “score” of our model. But more gen­eral mod­els, with more free pa­ram­e­ters, will always score higher, lead­ing to overfit. To coun­ter­bal­ance that, we calcu­late some nu­mer­i­cal penalty which is larger for more com­plex mod­els (i.e. those with more free pa­ram­e­ters) and sub­tract that penalty from the raw score.

In the case of Laplace ap­prox­i­ma­tion, a nat­u­ral com­plex­ity penalty drops out as soon as we take the log of the ap­prox­i­ma­tion for­mula:

The last two terms are the com­plex­ity penalty. As we saw above, they give the (log) vol­ume of the like­li­hood peak in -space. The wider the peak, the larger the chunk of -space which ac­tu­ally pre­dicts the ob­served data.

There are two main prob­lems with this com­plex­ity penalty:

• First, there’s the usual is­sues with ap­prox­i­mat­ing a pos­te­rior dis­tri­bu­tion by look­ing at a sin­gle point. Mul­ti­modal dis­tri­bu­tions are a prob­lem, in­suffi­ciently-pointy dis­tri­bu­tions are a prob­lem. Th­ese prob­lems ap­ply to ba­si­cally any com­plex­ity penalty method.

• Se­cond, al­though the log de­ter­mi­nant of the Hes­sian can be com­puted via back­prop­a­ga­tion and lin­ear alge­bra, that com­pu­ta­tion takes . That’s a lot bet­ter than the ex­po­nen­tial time re­quired for high-di­men­sional in­te­grals, but still too slow to be prac­ti­cal for large-scale mod­els with mil­lions of pa­ram­e­ters.

His­tor­i­cally, a third is­sue was the math/​cod­ing work in­volved in calcu­lat­ing a Hes­sian, but mod­ern back­prop tools like Ten­sorflow or au­to­grad make that pretty easy; I ex­pect in the next few years we’ll see a lot more peo­ple us­ing a Laplace-based com­plex­ity penalty di­rectly. The run­time re­mains a se­ri­ous prob­lem for large-scale mod­els, how­ever, and that prob­lem is un­likely to be solved any time soon: a lin­ear-time method for com­put­ing the Hes­sian log de­ter­mi­nant would yield an ma­trix mul­ti­pli­ca­tion al­gorithm.

• For an in­tro­duc­tion to MCMC aimed at a similar level tar­get au­di­ence, I found this ex­pla­na­tion helpful.

• Thanks for this se­quence, I’ve read each post 3 or 4 times to try to prop­erly get it.

Am I right in think­ing that in or­der to re­place we not only re­quire a uniform prior but also that span unit vol­ume?

• Cor­rect. In gen­eral, is the prob­a­bil­ity den­sity of , so if it’s uniform on a unit vol­ume then .

The main ad­van­tage of this no­ta­tion is that it’s pa­ram­e­ter­i­za­tion-in­de­pen­dent. For ex­am­ple: in a coin-flip­ping ex­am­ple, we could have a uniform prior over the fre­quency of heads , so . But then, we could re-write that fre­quency in terms of the odds , so we’d get and

So the prob­a­bil­ity den­sity is equiv­a­lent to the den­sity . (That first step, , is be­cause these two vari­ables con­tain ex­actly the same in­for­ma­tion in two differ­ent forms—that’s the pa­ram­e­ter­i­za­tion in­de­pen­dence. After that, it’s math: sub­sti­tute and differ­en­ti­ate.)

(No­tice that the uniform prior on is not uniform over . This is one of the main rea­sons why “use a uniform prior” is not a good gen­eral-pur­pose rule for choos­ing pri­ors: it de­pends on what pa­ram­e­ters we choose. Carte­sian and po­lar co­or­di­nate give differ­ent “uniform” pri­ors.)

The moral of the story is that, when deal­ing with con­tin­u­ous prob­a­bil­ity den­si­ties, the fun­da­men­tal “thing” is not the den­sity func­tion but the den­sity times the differ­en­tial , which we call . This is im­por­tant mainly when chang­ing co­or­di­nates: if we have some co­or­di­nate change , then , but .

If any­body wants an ex­er­cise with this: try trans­form­ing to a differ­ent co­or­di­nate sys­tem. Ap­ply Laplace’ ap­prox­i­ma­tion in both sys­tems, and con­firm that they yield the same an­swer. (This should mainly in­volve ap­ply­ing the chain rule twice to the Hes­sian; if you get stuck, re­mem­ber that is a max­i­mum point and con­sider what that im­plies.)