Kalman Filter for Bayesians

Summary: the Kalman Filter is Bayesian updating applied to systems that are changing over time, assuming all our distributions are Gaussians and all our transformations are linear.

Preamble—the general Bayesian approach to estimation: the Kalman filter is an approach to estimating moving quantities. When I think about a Bayesian approach to estimation, I think about passing around probability distributions: we have some distribution as our prior, we gather some evidence, and we have a new distribution as our posterior. In general, the mean of our distribution measures our best guess of the underlying value, and the variance represents our uncertainty.

In the Kalman filter, the only distribution we use is the normal/Gaussian distribution. One important property of this is that it can be parameterized completely by the mean and variance (or covariance in the multi-variate case.) If you know those two values, you know everything about the distribution.

As a result, people often talk about the Kalman filter as though it’s estimating means and variances at different points, but I find it easier to think of it as outputting a distribution representing our current knowledge at any point.

The simplest case: taking multiple measurements of a fixed quantity with an accurate but imprecise sensor. For example, say we’re trying to measure the temperature with a thermometer that we believe is accurate but has a variance of 5 degrees $^{2}$ .

We’re very bad at estimating temperatures by hand, so let’s say our prior distribution is that the temperature is somewhere around 70 degrees with a variance of 20, or $N (70, 20)$ . We take one readout from the thermometer, which (by assumption) yields a normal distribution centered around the true temperature with variance 5: N(t, 5). The thermometer reads 78. What’s our new estimate?

Well, it turns out there’s a simple rule for combining Normal distributions with known variance: if our prior is $N (μ_{0}, σ_{0}^{2})$ and our observation is $N (μ_{1}, σ_{1}^{2})$ then the posterior has mean

(1) $μ^{'} = μ_{0} + k (μ_{1} - μ_{0})$

(2) $σ^{^{'} 2} = σ_{0}^{2} - k σ_{0}^{2}$ , where

(3) $k = \frac{σ_{0}^{2}}{σ_{0}^{2} + σ_{1}^{2}}$ is called the Kalman gain.

So if our first reading is 72, then $k$ is $\frac{20}{25} = .8$ , $σ^{' 2} = 20 - .8 * 20 = 4$ , and $μ^{'} = 70 + .8 * (78 - 70) = 76.4$ . If we take another reading, we’d apply the same set of calculations, except our prior would be $N (76.4, 4)$ .

Some intuition: let’s look at the Kalman gain. First, note that its value is always between 0 or 1. Second, note that the gain is close to 0 if $σ_{1}^{2}$ is large compared to $σ_{0}^{2}$ , and close to 1 in the opposite case. Intuitively, we can think of the Kalman gain as a ratio of how much we trust our new observation relative to our prior, where the variances are a measure of uncertainty.

What happens to the mean? It moves along the line from our prior mean to the observation. If we trust the observation a lot, $k$ is nearly 1, and we move almost all the way. If we trust the prior much more than the observation, we adjust our estimate very little. And if we trust them equally, we take the average of the two.

Also note that the variance always goes down. Once again, if we trust the new information a lot, the variance goes down a bunch. If we trust the new information and our prior equally, then the variance is halved.

Finally, as a last tidbit, it doesn’t matter whether which distribution is the prior and which is the observation in this case—we’ll get exactly the same posterior if we switch them around.

Adding a sensor: none of the math above assumes we’re always using the same sensor. As long as we assume all our sensors draw from distributions centered around the true mean and with a known (or estimated) variance, we can update on observations from any number of sensors, using the same update rule.

Measuring multiple quantities: what if we want to measure two or more quantities, such as temperature and humidity? Then we now have multivariate normal distributions. While a single-variable Gaussian is parameterized by its mean and variance, an $n$ -variable Gaussian is parameterized by a vector of $n$ means and an $n \times n$ covariance matrix: $N (\to μ, Σ)$ .

Our update equations are the multivariate versions of the equations above: given a prior distribution $N (_{0}, Σ_{0})$ and a measurement $_{1}$ from a sensor with covariance matrix $Σ_{1}$ , our posterior distribution is $N (^{'}, Σ^{'})$ with:

(4) $^{'} =_{0} + K_{1}$

(5) $Σ^{'} = Σ_{0} - K Σ_{0}$

(6) $K = Σ_{0} (Σ_{0} + Σ_{1})^{- 1}$

These are basically just the matrix versions of equations (1), (2), and (3).

Adding predictable change over time: so far, we’ve covered Bayesian updates when you’re making multiple measurements of some static set of quantities. But what about when things are changing? A classic example is a moving car. For this case, let’s assume we’re measuring two quantities – position and velocity.

For a bit more detail, say at time $0$ our vector $_{0} = (\begin{matrix} x_{0} v_{0} \end{matrix})$ where $x_{0}$ is the position and $v_{0}$ is velocity. Then at time $τ$ , we might expect the position to be $x_{0} + τ \cdot v_{0}$ , and the velocity to be the same on average. We can represent this with a matrix: $^{'} = F_{0}$ , where $F$ is the matrix $(\begin{matrix} 1 τ 01 \end{matrix})$ .

More generally, say our belief at time $t$ is $N (_{0}, Σ_{0})$ . Then our belief at time $t + τ$ , before we make any new observations, should be $F N (_{0}, Σ_{0})$ . Fortunately there’s a simple formula for this: $F N (_{0}, Σ_{0}) = N (F μ_{0}, F Σ_{0} F^{T})$ .

Putting it all together, say our belief at time $t$ is $N (_{0}, Σ_{0})$ , and at time $t + τ$ we measure a value $_{1}$ from a sensor with covariance matrix $Σ_{1}$ , then we perform the Bayesian update with $F N (_{0}, Σ_{0}) = N (F μ_{0}, F Σ_{0} F^{T})$ as the prior and $N (_{1}, Σ_{1})$ as the posterior:

(7) $^{'} = - - \to F μ_{0} + K_{1}$

(8) $Σ^{'} = F Σ_{0} F^{T} - K Σ_{0}$

(9) $K = F Σ_{0} F^{T} (F Σ_{0} F^{T} + Σ_{1})^{- 1}$

And that’s the main idea! We just adjust our prior by applying a transition function/matrix to it first In practice, the Kalman filter tends to quickly converge to true values, and is widely used in applications such as GPS tracking.