This post will detail a first-principles derivation of the discrete-time Kalman Filter, beginning with some probabalistic background and ending with a linear-algebraic interpretation. We’ll also take a look at some practical considerations of applying the Kalman Filter - what if the state space is nonlinear? How are rotational states - for exampling, those belonging to \(SO(2)\) or \(SO(3)\) - handled?

The majority of this content comes from my personal notes; it’s a bit denser than other posts, but I find this the most satisfying derivation and insightful interpretation.

We can derive the Kalman Filter in continuous-time from a control theory perspective, but I find this discrete-time, probabalistic derivation to be a little more accessible. The resulting filter update equations are the same as the continuous time version.

Consider the following discrete time model in state-space form.

\[\begin{align*} x_{k+1} &= F_{k}x_{k} + G_{k}w_{k}\\ z_{k} &= y_{k} + v_{k} = H_{k}x_{k} + v_{k} \end{align*}\]where \(w_{k}\) denotes process noise and \(v_{k}\) denotes measurement noise. The
goal is to solve the *filtering* problem; that is, to produce an estimate at
time \(k\) of the state \(x_{k}\) given measurements \(z_{0}, z_{1}, \ldots, z_{k}\)
leading up to time \(k\).

We assume that the processes \(\{w_{k}\}\) and \(\{v_{k}\}\) are

- individually white, meaning that for any \(k \not= l\), \(v_{k}\) and \(v_{l}\) are independent of one another and \(w_{k}\) and \(w_{l}\) are independent of one another.
- individually zero mean, gaussian processes with known covariances \(Q_{k}\) and \(R_{k}\), respectively.
- independent processes

In summary, it is assumed that the processes \(\{w_{k}\}\) and \(\{v_{k}\}\) are zero mean, independent gaussian processes with covariances denoted by \(Q_{k}\) and \(R_{k}\), respectively.

The initial state of the system \(x_{0}\) is assumed to be a gaussian random variable with known mean and covariance given by

\[\begin{align*} E[x_{0}] &= \bar{x}_{0}\\ E\{[x_{0} - \bar{x}_{0}][x_{0} - \bar{x}_{0}]^{T}\} &= P_{0} \end{align*}\]It is also assumed that \(x_{0}\) is independent of \(w_{k}\) and \(v_{k}\) for all \(k\).

In this section, we’ll work through some necessary background material needed for the Kalman Filter derivation.

Let’s derive how means and covariances propagate through the dynamics of our system. The solution to the difference equation given above is

\[x_{k} = \Phi_{k,0}x_{0} + \sum_{l=0}^{k-1}\Phi_{k,l+1}G_{l}w_{l}\]where

\[\Phi_{k,l} = F_{k}F_{k-1}\cdots F_{l} \quad (k>l)\]is the state-transition matrix which, in the (homogeneous) case of no input (noise), transfers the state from \(x_{l}\) to \(x_{k}\). Note that \(\Phi_{k,l} = \Phi_{k,m}\Phi_{m,l}\) and \(\Phi_{k,k} = I\).

Since \(x_{0},w_{0},w_{1},\ldots,w_{k-1}\) are jointly gaussian random vectors, it follows that \(x_{k}\) - which is a linear combination of these vectors - is also gaussian. This is due to the fact that linear transformations of gaussian random variables are gaussian. In addition, note that \(\{x_{k}\}\) is a Markov process.

Taking the expected value of both sides of the solution to the original difference equation yields

\[E[x_{k}] = \Phi_{k,0}\bar{x}_{0}\]and therefore, from the definition of the state-transition matrix,

\[E[x_{k+1}] = F_{k}E[x_{k}] \quad \square\]Additionally, from the measurement model, we have

\[E[z_{k}] = H_{k}E[x_{k}]\]Propogating the covariance through the system is not as straightforward as for the mean. The covariance of \(x_{k}\) for \(k\geq l\) is

\[\begin{align*} P_{k,l} &= E\left\{[x_{k}-\bar{x}_{k}][x_{l}-\bar{x}_{l}]^{T}\right\}\\ &=E\left\{\left[\left(\Phi_{k,0}x_{0} + \sum_{m=0}^{k-1}\Phi_{k,m+1}G_{m}w_{m}\right) - \Phi_{k,0}\bar{x}_{0}\right] \left[\left(\Phi_{l,0}x_{0} + \sum_{n=0}^{l-1}\Phi_{l,n+1}G_{n}w_{n}\right) - \Phi_{l,0}\bar{x}_{0}\right]^{T}\right\}\\ &=E\left\{\left[\Phi_{k,0}(x_{0}-\bar{x}_{0}) + \sum_{m=0}^{k-1}\Phi_{k,m+1}G_{m}w_{m}\right] \left[\Phi_{l,0}(x_{0} - \bar{x}_{0}) + \sum_{n=0}^{l-1}\Phi_{l,n+1}G_{n}w_{n}\right]^{T}\right\} \end{align*}\]Now, since \(x_{0}-\bar{x}_{0},w_{0},\ldots,w_{k-1}\) are all independent, the cross terms of the expression inside the expectation disappear, leaving

\[P_{k,l} = \Phi_{k,0}E\{[x_{0}-\bar{x}_{0}][x_{0}-\bar{x}_{0}]^{T}\}\Phi_{l,0}^{T} + \sum_{m=0}^{l-1}\Phi_{k,m+1}G_{m}E[w_{m}w_{m}^{T}]G_{m}^{T}\Phi_{l,m+1}^{T}\\\]The second term in the above expression is the result of the whiteness of \(\{w_{k}\}\); in the product of the two sums in the original expression, only those terms for which \(m=n\) have nonzero expectations.

Substituting the definitions for \(P_{0}\) and \(Q_{m}\) into the preceding expression and using properties of the state transition matrix yields

\[P_{k,l} = \Phi_{k,l}\left\{\Phi_{l,0}P_{0}\Phi_{l,0}^{T} + \sum_{m=0}^{l-1}\Phi_{l,m+1}G_{m}Q_{m}G_{m}^{T}\Phi_{l,m+1}^{T}\right\}\]This is the general expression for the covariance of the state at different times. When \(k=l\) this simplifies to

\[P_{k} = P_{k,k} = \Phi_{k,0}P_{0}\Phi_{k,0}^{T} + \sum_{m=0}^{k-1}\Phi_{k,m+1}G_{m}Q_{m}G_{m}^{T}\Phi_{k,m+1}^{T}\]which implies that

\[P_{k,l} = \Phi_{k,l}P_{l}, \quad k\geq l\]Note also that \(P_{k,l} = P_{l,k}^{T}\) and thus

\[P_{k,l} = P_{k,k}\Phi_{l,k}^{T}, \quad k\leq l\]We now seek to derive a recursive equation for \(P_{k}=P_{k,k}\). Using the definition of the state-transition matrix, we can write the above as

\[P_{k} = F_{k-1}\left[\Phi_{k-1,0}P_{0}\Phi_{k,0}^{T}\right]F_{k-1}^{T} + \left[\sum_{m=0}^{k-2}\Phi_{k-1,m+1}G_{m}Q_{m}G_{m}^{T}\Phi_{k-1,m+1}^{T}\right] + \Phi_{k,k}G_{k}Q_{k}G_{k}^{T}\Phi_{k,k}^{T}\]where we have factored out the final term from the sum. Again using the definition of \(\Phi\) and the property that \(\Phi_{k,k}=I\) we have

\[P_{k} = F_{k-1}\left[\Phi_{k-1,0}P_{0}\Phi_{k,0}^{T} + \sum_{m=0}^{k-2}\Phi_{k-2,m+1}G_{m}Q_{m}G_{m}^{T}\Phi_{k-2,m+1}^{T}\right]F_{k-1}^{T} + G_{k}Q_{k}G_{k}^{T}\]The expression inside the parentheses is the definition of \(P_{k-1}\), leading to the difference equation

\[P_{k} = F_{k-1}P_{k-1}F_{k-1}^{T} + G_{k}Q_{k}G_{k}^{T}\]which recursively describes the time evolution of the covariance of the state. \(\square\)

The covariance of the measurement is computed similarly as follows.

\[\begin{align*} S_{k,l} &= E\left\{[z_{k}-\bar{z}_{k}][z_{l}-\bar{z}_{l}]^{T}\right\}\\ &= E\left\{[(H_{k}x_{k}+v_{k})-H_{k}\bar{x}_{k}][(H_{l}x_{l}+v_{l})-H_{l}\bar{x}_{l}]^{T}\right\}\\ &= E\left\{[(H_{k}(x_{k}-\bar{x}_{k})+v_{k}][H_{l}(x_{l}-\bar{x}_{l}) +v_{l}]^{T}\right\}\\ \end{align*}\]Note that \(\{v_{k}\}\) is independent of \(\{x_{k}-\bar{x}_{k}\}\) because the latter process is determined entirely by \(X_{0}\) and \(\{w_{k}\}\), both of which were assumed to be independent of \(\{v_{k}\}\). Thus the cross terms in the above expression disappear. This yields

\[S_{k,l} = H_{k}E\left\{[x_{k}-\bar{x}_{k}][x_{l}-\bar{x}_{l}]^{T}\right\}H_{l}^{T} + E[v_{k}v_{l}^{T}]\]and thus

\[S_{k,l} = H_{k}P_{k,l}H_{l}^{T} + R_{k}\delta_{kl}\]In addition, we can compute the covariance between \(x_{k}\) and \(z_{k}\) as

\[\begin{align*} cov(x_{k},z_{k}) &= E\{[x_{k}-\bar{x}_{k}][z_{k}-\bar{z}_{k}]^{T}\}\\ &= E\{[x_{k}-\bar{x}_{k}][H_{k}(x_{k}-\bar{x}_{k})]^{T}\}\\ &= E\{[x_{k}-\bar{x}_{k}][(x_{k}-\bar{x}_{k})]^{T}\}H_{k}^{T}\\ &= P_{k}H_{k}^{T} \end{align*}\]Likewise, we find that

\[cov(z_{k},x_{k}) = H_{k}P_{k}\]We have already stated that \(x_{k}\) is gaussian since it is a linear combination of \(x_{0}\) and \(w_{0},\ldots,w_{k-1}\) (all of which are gaussian); we also know that \(z_{k}\) is gaussian since \(x_{k}\) and \(v_{k}\) are gaussian.

Our goal in filtering, however, is to produce an estimate of \(x_{k}\) by conditioning on \(z_{k}\). That is, we seek to use the information provided by process \(\{z_{k}\}\) to estimate the state. What form does \(x_{k}\) conditioned on \(z_{k}\) have?

Let the pair of vectors \(X\) and \(Y\) be *jointly
gaussian*, ie the vector composed as \(Z=[X^{T} Y^{T}]^{T}\) is gauassian with
mean and covariance

It can be shown that \(X\) conditioned on \(Y\) is also gaussian with
*conditional mean* and *conditional covariance* given by

These parameters describe the distribution \(p(x|y)\) which, of course, is a
function; given a value of \(y\), \(x\) is regarded as a variable. Since we seek to
estimate \(x\) from \(y\), we might logically wish to determine the value of \(x\)
which maximizes this probability given \(Y=y\). This is the *maximum a
posteriori (MAP)* estimate of \(x\). This is not the only (nor necessarily the
best) way to choose \(x\), however!

Instead of seeking the estimate of \(x\) which maximizes the conditional
probability density given above, we wish to find the estimate \(\hat{x}\) which is
closest to the true value of \(x\) on average. That is, we seek the
*minimum variance estimate* \(\hat{x}\) for which

for all other vectors \(z\) which are determined in a different way from \(y\) than was \(x\) (but are not dependent on \(x\)). For any such \(z\) we can write

\[\begin{align*} E\{||X-z||^{2}|Y=y\} &= \int_{-\infty}^{\infty}(x-z)^{T}(x-z)p(x|y)dx\\ &= \int_{-\infty}^{\infty}x^{T}xp(x|y)dx - 2z^{T}\int_{-\infty}^{\infty}xp(x|y)dx + z^{T}z\int_{-\infty}^{\infty}p(x|y)dx\\ &= \left[z^{T} - \int_{-\infty}^{\infty}x^{T}p(x|y)dx\right]\left[z - \int_{-\infty}^{\infty}xp(x|y)dx\right] + \cdots\\ &+ \int_{-\infty}^{\infty}x^{T}xp(x|y)dx + \left|\left|\int_{-\infty}^{\infty}xp(x|y)dx\right|\right|^{2} \end{align*}\]Clearly, the above expression is minimized when the first term disappears; this follows from the choice

\[z = \int_{-\infty}^{\infty}xp(x|y)dx\]Hence, the minimum variance estimate of \(x\) is the conditional mean estimate

\[\hat{x} = E\{X|Y=y\}\]With this choice, the variance becomes

\[\begin{align*} E\{||X-z||^{2}|Y=y\} &= \int_{-\infty}^{\infty}x^{T}xp(x|y)dx + \left|\left|\int_{-\infty}^{\infty}xp(x|y)dx\right|\right|^{2}\\ &= E\{||X||^{2}|Y=y\} - ||\hat{x}||^{2} \end{align*}\]The variance is also given by the trace of the corresponding error covariance matrix which, since the estimate is simply the conditional mean, is the covariance matrix

\[\hat{\Sigma} = \Sigma_{xx} - \Sigma_{xx}\Sigma_{yy}^{-1}\Sigma_{yx}\]associated with the conditional density.

Note that an estimate is termed *unbiased* when the expected value of the
the estimation error \(e = x-\hat{x}\) given \(y\) is zero. Since the minimum
variance estimate is equal to the conditional mean, this estimate must be
unbiased.

The first-principles derivation of the Kalman Filter follows directly from the results of previous sections. We seek to form an estimate \(\hat{x}_{k}\) of the state of the system at time \(k\) from the measurements \(z_{0},\ldots,z_{k}\). We begin with the initial state and use the results of preceding sections to develop general formulas.

The initial state \(x_{0}\) is assumed to have known mean \(\bar{x}_{0} = \hat{x}_{0}^{-}\) and covariance \(P_{0}^{-}\). As noted previously, \(x_{k}\) and \(z_{k}\) are jointly gaussian for all \(k\) with mean and covariance

\[\begin{align*} mean = \begin{pmatrix}\bar{x}_{k}\\ H_{k}\bar{x}_{k} \end{pmatrix} \quad & cov = \begin{pmatrix} P_{k} & P_{k}H_{k}^{T}\\ H_{k}P_{k} & H_{k}P_{k}H_{k}^{T}+R_{k} \end{pmatrix} \end{align*}\]Therefore for \(k=0\), \(x_{0}\) conditioned on \(z_{0}\) is also gaussian with mean

\[\hat{x}_{0}^{+} = \hat{x}_{0}^{-} + P_{0}H_{0}^{T}\left(H_{0}P_{0}^{-}H_{0}^{T}+R_{0}\right)^{-1}(z_{0}-H_{0}\hat{x}_{0}^{-})\]and covariance

\[P_{0}^{+} = P_{0}^{-} - P_{0}^{-}H_{0}^{T}\left(H_{0}P_{0}^{-}H_{0}^{T}+R_{0}\right)^{-1}H_{0}P_{0}^{-}\]Based on how how the mean and covariance of the state propagate through the system, it follows that \(x_{1}\) conditioned on \(z_{0}\) is gaussian with mean and covariance

\[\begin{align*} \hat{x}_{1}^{-} = F_{0}\hat{x}_{0}^{+} \quad & \mbox{and} \quad P_{1}^{-} = F_{0}P_{0}^{+}F_{0}^{T} + G_{0}Q_{0}G_{0}^{T} \end{align*}\]Now, the random variable \([x_{1}^{T} z_{1}^{T}]\) conditioned on \(z_{0}\) is gaussian with mean and covariance

\[\begin{align*} \begin{pmatrix}\hat{x}_{1}^{-}\\ H_{1}\hat{x}_{1}^{-}\end{pmatrix} \quad & \mbox{and} \quad \begin{pmatrix} P_{1}^{-} & P_{1}^{-}H_{1}^{T}\\ H_{1}P_{1}^{-} & H_{1}P_{1}^{-}H_{1}^{T}+R_{1} \end{pmatrix} \end{align*}\]It follows that \(x_{1}\) conditioned on \(z_{0},z_{1}\) then has mean

\[\hat{x}_{1}^{+} = \hat{x}_{1}^{-} + P_{1}H_{1}^{T}\left(H_{1}P_{1}^{-}H_{1}^{T}+R_{1}\right)^{-1}(z_{1}-H_{1}\hat{x}_{1}^{-})\]and covariance

\[P_{1}^{+} = P_{1}^{-} - P_{1}^{-}H_{1}^{T}\left(H_{1}P_{1}^{-}H_{1}^{T}+R_{1}\right)^{-1}H_{1}P_{1}^{-}\]In general, then, we can write in the same way that

\[\begin{align*} \hat{x}_{k}^{-} &= F_{k-1}\hat{x}_{k-1}^{+}\\ P_{k}^{-} &= F_{k-1}P_{k-1}^{+}F_{k-1}^{T} + G_{k-1}Q_{k-1}G_{k-1}^{T}\\ K_{k} &= P_{k}H_{k}^{T}\left(H_{k}P_{k}^{-}H_{k}^{T}+R_{k}\right)^{-1}\\ \hat{x}_{k}^{+} &= \hat{x}_{k}^{-} + K_{k}(z_{k}-H_{k}\hat{x}_{k}^{-})\\ P_{k}^{+} &= P_{k}^{-} - K_{k}H_{k}P_{k}^{-} \end{align*}\]The above are the discrete-time Kalman Filter update equations. \(\square\)

Recall the problem of solving \(Ax=b\) given matrix \(A\in R^{m\times n}\) and vector \(b\in R^{m}\) for the solution vector \(x\in R^{n}\). When \(A\) is *invertible*, we can of course compute \(x=A^{-1}b\) - however, when \(A\) is *singular*, \(A^{-1}\) does not exist and we must seek an alternative solution.

Naturally, we seek a solution vector \(x\) for which \(Ax\) is **as close as possible** to \(b\). In other words, we wish to solve:

where \(\|v\|\) denotes the \(L^{2}\) or *Euclidean* norm of \(v\). As shown in a previous post, we can expand out the expression in the minimization to

We then solve for the \(x\) which minimizes the expression by setting its *gradient* equal to zero:

which leads to the least-squares solution

\[x = (A^{T}A)^{-1}A^{T}b\quad \square\]The matrix \((A^{T}A)^{-1}A^{T}\) thus replaces \(A^{-1}\) in solving \(Ax=b\) - we call this matrix the **left pseudoinverse** of \(A\). This solution is clearly only valid when the matrix \((A^{T}A)^{-1}\) is invertible, which is the case as long as \(A\) has **full column rank**. Typically, this is true of problems where \(m>n\) which we call **overdetermined** because there are more equations than unknowns.

On the other hand, if the matrix \(A\) is has **full row rank** (typically when \(n>m\)) then we call the system **underdetermined** as there are more unknowns than equations. In this case, there exist an *infinite* number of solution vectors \(x\) which satisfy \(Ax=b\) exactly. Which one, then, do we choose? A natural choice is the solution which has the smallest norm \(\|x\|\); for example, if we’re solving for a control input vector then we’d certainly want to use as little effort as possible. The minimization thus becomes:

This itself is a least-squares problem, but with *equality constraints* to be imposed. Equality-constrained least-squares can be solved using a *Lagrange multiplier* vector \(\lambda\); essentially, we solve the simultaneous equations:

for \(x\) (and \(\lambda\)), with the result

\[x = A^{T}(AA^{T})^{-1}b\]Similar to the overconstrained case, in the underconstrained case the matrix \(A^{T}(AA^{T})^{-1}\) is called the **right pseudoinverse** of \(A\).

Returning to the overconstrained case, recall that we chose to use the \(L^{2}\) (Euclidean) norm in the minimization. However, we can choose *any* valid 2-norm instead, for example one which weights different dimensions of the space more or less:

where \(W\) is a positive-definite weighting matrix (often diagonal) such that \(\|v\|_{W}^{2} = v^{T}Wv\). The purpose of this modification is to emphasize different dimensions of the data forming the problem, biasing the least-squares solution to minimize errors in these dimensions over others. Without going into detail, the solution has the form

\[x = (A^{T}WA)^{-1}A^{T}Wb\]Now consider the case in which \(A\) has **neither full row nor column rank** - we seek a least-squares solution, but there are inifinite such approximate solutions. We modify the least-squares problem to have the form

where \(\Lambda\) is a suitable regularization matrix (often chosen as \(\Lambda = \alpha I\)). The solution to this problem has the form

\[x = (A^{T}A + \Lambda^{T}\Lambda)^{-1}A^{T}b\]The purpose of regularization in regression is often to bias the solution vector \(x\) toward zero, which is achieved by shifting the eigenvalues of \(A^{T}A\) to improve its *conditioning*. A poorly-conditioned matrix causes the solution to “blow up” without regularization.

Recall that the Kalman Filter gain matrix has the form

\[K = PH^{T}\left(HPH^{T}+R\right)^{-1}\]which, as we’ve seen above, is a **right pseudoinverse** solving the relationship \(Hx=z\) between the state \(x\) and the measurement \(z\). The solution uses \(P^{1/2}\) as a **weighting matrix** and \(R^{1/2}\) as a **regularization matrix**.

By weighting the least-squares norm using the state covariance matrix \(P\), **states with higher variance have more impact on the gain**; small changes in these states cause larger changes in the measurement.

By regularizing the problem using the measurement noise matrix \(R\), **states associated with measurements having higher noise variance have less impact on the gain**; high-variance measurements thus cause smaller changes in the state during Kalman updates.

Feel free to share!