The most fundamental problem in linear algebra is also seemingly the most innocuous: solve for the unknown vector given a matrix and a vector .

Yet, by examing this one problem for all different kinds of matrices , you can teach an entire course spanning everything from simple Gaussian elimination to eigenvalues, SVD, PCA and beyond. If you’re unsure what these words mean and haven’t checked out Gilbert Strang’s text and accompanying online course Introduction to Linear Algebra, I *highly* recommend you take the time to work through it. The intuition I gained from it was truly transformative in my robotics studies!

Let’s start simple. *Reaaaally* simple. Consider the following problem:

Viewing matrix multiplication as a linear combination of columns of , we rewrite this as

In other words, our matrix is composed of column vectors and in (two-dimensional vector space of real numbers), and we’re looking for scalar multipliers and which scale and add these column vectors into the vector . This is illustrated geometrically below.

It’s clear that with the right combination of and we can produce ; the solution vector has its effect illustrated below:

However, this is just one possible problem for which a graphical solution is easy - but is there always a solution to this form of problem? Take a look at the linear system illustrated below:

In this case, the columns of are multiples of one another - in linear algebra terms, they are *linearly dependent*. This means that the possible vectors we can reach via a combination of and is limited to the red line. Unless our happens to fall along this line, ther is **no solution**.

Let’s take it a step further - this situation is the same as if we only had *one* of the column vectors in , making become:

as shown below:

We obviously can’t solve *exactly* because we can’t scale into this , but we can solve it *approximately* by finding a scalar which gets us to , a vector *as close as possible* to along the direction of .

That begs the question: what does it mean for ? Naturally, we want the **distance** between the true and the approximate to be minimized. This begs another question: how do we define distance between vectors? There’s actually a lot to explain here which we won’t go into, but suffice to say that there are *infinite* ways to define what’s called a **distance metric** on a vector space - the most popular, which you’re probably familiar with, is the **Euclidean distance**:

for our specific case in , or more generally:

for vectors in . This is also known as the distance, norm or simply the 2-norm, and reflects the way we naturally think about distance in a physical space. Since this norm is the square root of the sum of squares of differences in coordinates, minimizing it to solve gives us the name of the **least squares** solution.

The general case is the distance/norm:

where the notation
denotes the *p-norm* of the vector . See below for more details on what this general norm means, or skip ahead to the least-squares solution in 2d.

## A detour on vector norms

Remember how I mentioned there are

infinitepossible distance metrics? Well, each metric is defined by itsnorm, which is the distance function for the space and must satisfy a number of conditions to be a proper metric. The class of norms is just one infinite set within the infinite possible norms, but certainly the most common.Here, the scalar is arbitrary - for the Euclidean norm it’s , but for example, the so-called

Manhattan normhas . The choice of basically warps the sense of distance in a space, which can be useful for different applications. For example, amplifies the distance of outliers in a set of points due to the quadratic term, making it a good choice in clustering over the norm - which just takes the sum of absolute values of all vector entries, based on the above general form - for very scattered data. In the example of penalized regression problems, the 1-norm is used to impose anL1 penaltywhich encourages solution vectors having many zeros. And so on.Other than the 1-norm and 2-norm, the -norm is also common; in the limit, this norm is simply a

maxoperation:Below are lines of constant norm for distance in 2d space, for the common 1, 2, and norms.

These three vector norms are likely the only ones you’ll encounter in practice. Without going into detail, though, the concept of a distance metric is more general than just vector norms. How do you compute the distance between two matrices? How about the distance between two rotations? These are also possible, but we’ve detoured far enough for now.

Recall that we were looking for an *approximate* solution to the problem for the case in which doesn’t lie along the direction of , and argued that the best *approximate* solution minimizes *distance* to .

Another way to frame the problem takes a geometric approach. The approximate solution is the *projection* of along the direction of ; minimizing the distance between and is the same as minimizing the *projection error*, .

How do we minimize ? Geometrically, you can think of it this way: the projection error should be in a direction *perpendicular* to the direction of . This means the approximate solution captures as much information as possible along the direction of , so whatever’s “left over”, , must be in a direction perpendicular to :

as depicted below:

To write this out mathematically, we’ll need to use the *dot product* between vectors; see below for more details, or skip past to the solution.

## The dot (inner) product

What does it mean for two vectors to be perpendicular, or in the language of linear algebra,

orthogonal? First we need to introduce thedot product(orinnerproduct in higher dimensions) between two vectors:We can also write the dot product using matrix math:

The dot product allows us to compute the angle between two vectors

Since as , we see that

the dot product goes to zero when the two vectors are orthogonal.

To find such that the projection error is orthogonal to , we require

Using this approximate solution, we can compute the projected point as follows:

By rearranging the solution, we can form the *projection matrix* which projects *any* vector along ! We can thus compute once and use it for any that comes along; this is a common “trick” in practice for problems which retain the same structure (matrix ) but potentially different *right-hand-side* vectors .

What does the projection error vector look like in terms of the solution?

We thus find the *complementary* projection matrix which projects onto the direction orthogonal to , giving us whatever’s “left over”. These are useful projections to be aware of in practice.

Now let’s look at the three-dimensional case of with , so we only have two *linearly-independent* column vectors:

which multiplies (viewing matrix multiplication as a linear combination of columns) to

This is depicted in the figure below.

As in the two-dimensional case, we seek a linear combination of the columns of - denoted and - which is equal to the vector . These column vectors are said to *span* the red plane (which extends infinitely in all directions) - this means that every point on the plane can be written as a *linear combination* of and . It follows that any point outside the plane - for example, here - cannot be written as such, meaning that there is no exact solution - only an *approximate* one. In the previous case, we projected a point onto a line defined by a vector ; now, we’re projecting a point onto a *plane* defined by the columns of a matrix . You can start to see how this problem scales to higher dimensions.

We seek an approximate solution whose projection error is orthogonal to both and , and thus orthogonal or *normal* to the plane as depicted. The problem is thus to find such that

Which leads to simultaneous equations

which we can combine and solve as follows:

This form of solution is known as the *normal equations* because the projection error is *normal* to the plane. It’s also known the *least-squares* solution to because it minimizes the sum of squares of projection error as shown in the next section.

We’ve now derived the normal equations in two dimensions (projection onto a *line*, where had size ) and three dimensions (projection onto a *plane*, where had size ). What about higher dimensions? In these cases of , we seek a projection onto a *hyperplane* of arbitrary dimension. While we can’t draw these situations, the **solution has exactly the same form**; let’s approach the arbitrary dimension case using calculus to solve it.

Earlier, we discussed how to define distance in a vector space, and have settled on the or *Euclidean* norm. For the problem where with arbitrarty , we thus seek an which minimizes

To put the problem in a nicer form, we can instead minimize the *squared* distance, since the minimizing distance will certainly minimize squared distance as well. We thus wish to minimize:

where we multiplied out the problem and used the fact that . From calculus, we know that to minimize or maximize a function with respect to , we need to find the point(s) at which its *gradient* (generalized derivative), denoted , are zero. We thus set and solve for :

Here I’m going to defer to the fantastic Matrix Cookbook to explain how to take the gradients above. The result is

since because is symmetric. Setting the above equal to zero and solving for the minimum :

This is identical to the solution we obtained above by inspecting the geometry of the lower-dimensional case.

While the previous section shows the typical method for deriving the normal equations, it requires some knowledge of multivariable calculus. I find linear algebra to be much more intuitive when understood geometrically, so I wanted to present the two approaches side-by-side. Again, for a fantastic intuitive introduction to linear algebra which relies heavily on geometry, be sure to read through Gilbert Strang’s textbook and/or follow the accompanying open-source course.

Written on December 5th, 2019 by Nick RotellaFeel free to share!