Kal­man Fil­ter for Bayesians

Sum­mary: the Kal­man Fil­ter is Bayesian up­dat­ing ap­plied to sys­tems that are chan­ging over time, as­sum­ing all our dis­tri­bu­tions are Gaus­si­ans and all our trans­form­a­tions are lin­ear.

Pre­amble—the gen­eral Bayesian ap­proach to es­tim­a­tion: the Kal­man fil­ter is an ap­proach to es­tim­at­ing mov­ing quant­it­ies. When I think about a Bayesian ap­proach to es­tim­a­tion, I think about passing around prob­ab­il­ity dis­tri­bu­tions: we have some dis­tri­bu­tion as our prior, we gather some evid­ence, and we have a new dis­tri­bu­tion as our pos­terior. In gen­eral, the mean of our dis­tri­bu­tion meas­ures our best guess of the un­der­ly­ing value, and the vari­ance rep­res­ents our un­cer­tainty.

In the Kal­man fil­ter, the only dis­tri­bu­tion we use is the nor­mal/​Gaus­sian dis­tri­bu­tion. One im­port­ant prop­erty of this is that it can be para­met­er­ized com­pletely by the mean and vari­ance (or co­v­ari­ance in the multi-vari­ate case.) If you know those two val­ues, you know everything about the dis­tri­bu­tion.

As a res­ult, people of­ten talk about the Kal­man fil­ter as though it’s es­tim­at­ing means and vari­ances at dif­fer­ent points, but I find it easier to think of it as out­put­ting a dis­tri­bu­tion rep­res­ent­ing our cur­rent know­ledge at any point.

The simplest case: tak­ing mul­tiple meas­ure­ments of a fixed quant­ity with an ac­cur­ate but im­pre­cise sensor. For ex­ample, say we’re try­ing to meas­ure the tem­per­at­ure with a ther­mo­meter that we be­lieve is ac­cur­ate but has a vari­ance of 5 de­grees.

We’re very bad at es­tim­at­ing tem­per­at­ures by hand, so let’s say our prior dis­tri­bu­tion is that the tem­per­at­ure is some­where around 70 de­grees with a vari­ance of 20, or . We take one readout from the ther­mo­meter, which (by as­sump­tion) yields a nor­mal dis­tri­bu­tion centered around the true tem­per­at­ure with vari­ance 5: N(t, 5). The ther­mo­meter reads 78. What’s our new es­tim­ate?

Well, it turns out there’s a simple rule for com­bin­ing Normal dis­tri­bu­tions with known vari­ance: if our prior is and our ob­ser­va­tion is then the pos­terior has mean

(1)

(2) , where

(3) is called the Kal­man gain.

So if our first read­ing is 72, then is , , and . If we take an­other read­ing, we’d ap­ply the same set of cal­cu­la­tions, ex­cept our prior would be .

Some in­tu­ition: let’s look at the Kal­man gain. First, note that its value is al­ways between 0 or 1. Se­cond, note that the gain is close to 0 if is large com­pared to , and close to 1 in the op­pos­ite case. In­tu­it­ively, we can think of the Kal­man gain as a ra­tio of how much we trust our new ob­ser­va­tion re­l­at­ive to our prior, where the vari­ances are a meas­ure of un­cer­tainty.

What hap­pens to the mean? It moves along the line from our prior mean to the ob­ser­va­tion. If we trust the ob­ser­va­tion a lot, is nearly 1, and we move al­most all the way. If we trust the prior much more than the ob­ser­va­tion, we ad­just our es­tim­ate very little. And if we trust them equally, we take the av­er­age of the two.

Also note that the vari­ance al­ways goes down. Once again, if we trust the new in­form­a­tion a lot, the vari­ance goes down a bunch. If we trust the new in­form­a­tion and our prior equally, then the vari­ance is halved.

Fin­ally, as a last tid­bit, it doesn’t mat­ter whether which dis­tri­bu­tion is the prior and which is the ob­ser­va­tion in this case—we’ll get ex­actly the same pos­terior if we switch them around.

Ad­ding a sensor: none of the math above as­sumes we’re al­ways us­ing the same sensor. As long as we as­sume all our sensors draw from dis­tri­bu­tions centered around the true mean and with a known (or es­tim­ated) vari­ance, we can up­date on ob­ser­va­tions from any num­ber of sensors, us­ing the same up­date rule.

Meas­ur­ing mul­tiple quant­it­ies: what if we want to meas­ure two or more quant­it­ies, such as tem­per­at­ure and hu­mid­ity? Then we now have mul­tivari­ate nor­mal dis­tri­bu­tions. While a single-vari­able Gaus­sian is para­met­er­ized by its mean and vari­ance, an -vari­able Gaus­sian is para­met­er­ized by a vec­tor of means and an co­v­ari­ance mat­rix: .

Our up­date equa­tions are the mul­tivari­ate ver­sions of the equa­tions above: given a prior dis­tri­bu­tion and a meas­ure­ment from a sensor with co­v­ari­ance mat­rix , our pos­terior dis­tri­bu­tion is with:

(4)

(5)

(6)

These are ba­sic­ally just the mat­rix ver­sions of equa­tions (1), (2), and (3).


Ad­ding pre­dict­able change over time: so far, we’ve covered Bayesian up­dates when you’re mak­ing mul­tiple meas­ure­ments of some static set of quant­it­ies. But what about when things are chan­ging? A clas­sic ex­ample is a mov­ing car. For this case, let’s as­sume we’re meas­ur­ing two quant­it­ies – po­s­i­tion and ve­lo­city.

For a bit more de­tail, say at time our vec­tor where is the po­s­i­tion and is ve­lo­city. Then at time , we might ex­pect the po­s­i­tion to be , and the ve­lo­city to be the same on av­er­age. We can rep­res­ent this with a mat­rix: , where is the mat­rix .

More gen­er­ally, say our be­lief at time is . Then our be­lief at time , be­fore we make any new ob­ser­va­tions, should be . For­tunately there’s a simple for­mula for this: .

Put­ting it all to­gether, say our be­lief at time is , and at time we meas­ure a value from a sensor with co­v­ari­ance mat­rix , then we per­form the Bayesian up­date with as the prior and as the pos­terior:

(7)

(8)

(9)

And that’s the main idea! We just ad­just our prior by ap­ply­ing a trans­ition func­tion/​mat­rix to it first In prac­tice, the Kal­man fil­ter tends to quickly con­verge to true val­ues, and is widely used in ap­plic­a­tions such as GPS track­ing.