Gottlieb t Freitag

The magic that is least squares

October 29, 2019 - lutz

If you’ve had training in math, statistics, computer science or any other number crunching field you already know linear regression. Linear regression is just one tool that employs a wonderful technique: least squares

Least squares appears in many applications (more about that in posts to come) and I’d love to point out what a marvelous tool it is by the example of linear regression. Once you’re familiar with the underlying least squares approach, you’ll see it everywhere!

Anyways, this post is more on the mathy side… I hope somebody finds it useful.


Befor I I jump into the super mathy derivation, I’d like to showcase the workings of linear regression by means of an example. But right after the example based introduction I get to the nerdy detailed math. Yey!

In a Nutshell

Think about having a function f(x) that produces a number given x . The function has some intrinsic parameters that you would like to tune in order to have the function pass through some points. An example might be:

f(x) = a + bx + cx^2

This function has three parameters: a , b , c . With linear regression we can find optimal choices for those parameters given a set of points P_n = (x_n, y_n) . To do this we need to calculate the derivatives of \frac{d}{d \beta_k}f where \beta_k represents the k -th parameter of f . That means we have to calculate as many derivatives of f as there are parameters. When looking at the example above, this means: \beta_1 = a , \beta_1 = b , \beta_3 = c . We can now arrange the derivatives in a column vector where the k -th column contains the derivative of f by its k -th parameter and call this vector \hat{X} :

% <![CDATA[ \hat{X} = \left( \begin{array}[cccc] {} \frac{d}{d \beta_1}f & \frac{d}{d \beta_2}f & \cdots & \frac{d}{d \beta_K}f\end{array} \right) %]]>

With the example function above this becomes:

% <![CDATA[ \hat{X} = \left( \begin{array}[cccc] {} \frac{d}{d \beta_2}f & \frac{d}{d \beta_2}f & \frac{d}{d \beta_3}f\end{array} \right) = \left( \begin{array}[cccc] {} 1 & x & x^2\end{array} \right) %]]>

For every point we want our function to pass through we can generate an X vector. Lets assume we want to tweak the parameters of our function above in a way that it passes though the points: P_1 = (-1, 1)^T , P_1 = (0, 0)^T , P_1 = (1, 1)^T . For every point P_n we can calculate a corresponding \hat{X}_n and simply vertically stack those together to get X :

% <![CDATA[ X = \left( \begin{array}[c] {} \hat{X}_1 \\ \hat{X}_2 \\ \hat{X}_3 \end{array} \right) = \left( \begin{array}[ccc] {} \frac{d}{da}f(-1) & \frac{d}{db}f(-1) & \frac{d}{dc}f(-1) \\ \frac{d}{da}f( 0) & \frac{d}{db}f( 0) & \frac{d}{dc}f( 0) \\ \frac{d}{da}f( 1) & \frac{d}{db}f( 1) & \frac{d}{dc}f( 1) \end{array} \right) = \left( \begin{array}[ccc] {} 1 & -1 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \end{array} \right) %]]>

enter the magic formula that will be derived further below

\beta = (X^T X)^{-1}X^T y

The vector \beta contains the parameters of our function, and y is simply all y -components of the Points vertically stacked together: % <![CDATA[ y = \left( \begin{array}[cccc] {} y_1 & y_2 & \cdots & y_n \end{array} \right)^T %]]> In our example this means % <![CDATA[ y = \left( \begin{array}[ccc] {} 1 & 0 & 1 \end{array} \right)^T %]]>

% <![CDATA[ \beta = \left( \left( \begin{array}[ccc] {} 1 & 1 & 1 \\ -1 & 0 & 1 \\ 1 & 0 & 1 \end{array} \right) \left( \begin{array}[ccc] {} 1 & -1 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \end{array} \right) \right)^{-1} \left( \begin{array}[ccc] {} 1 & 1 & 1 \\ -1 & 0 & 1 \\ 1 & 0 & 1 \end{array} \right) \left( \begin{array}[c] {} 1 \\ 0 \\ 1 \end{array} \right) = \left( \begin{array}[c] {} 0 \\ .5 \\ 1 \end{array} \right) %]]>

This means that the quadratic function that passes through the points P_1 = (-1, 1)^T , P_2 = (0, 0)^T , P_3 = (1, 1)^T is:

f(x) = .5x + x^2

I hope this example driven explanation helps in following the derivation below. However, it should also raise some questions like: Where does that magic formula come from?! So here is the more abstract derivation:

The Proper Derivation

First we need to change the definition of our function f and of the points P_i a bit:

f: \mathbb{R}^i \rightarrow \mathbb{R}^j

x is now a vector and f generates a vector. Likewise the x and y components of the points have to be vectors as well: P_k = \left(\vec{x}, \vec{y} \right) where \vec{x} \in \mathbb{R}^i and accordingly \vec{y} \in \mathbb{R}^j .

The function that passes through the given points, thus minimizing the distance to the points, is the function we are looking for. To express this we introduce a square loss function \mathcal{L}_d(\beta) that gives a value for any parameterized f and set of points by summing the squared distances together where the function should go through ( \vec{y} ) and where it actually went through ( f_{\beta}(\vec{x}) ). We actually want to minimize this square function, hence the name least squares!

\mathcal{L}_d(\beta) = \left\| \vec{y} - f_{\beta}(\vec{x}) \right\|^2_C

The notation \left\| a \right\|^2_B is a shorthand for a^T B a which is the weighted (by B ) sum of squares of the elements of a . For (squared) weighting matrices to make sense, they have to be symmetric and positive semi-definite, i.e., a weighting matrix has to be construcible by squaring two matrices (e.g., B=A^T A where A is a root of B ). The weighting matrix C can be used to weight dimensions along which errors are less acceptable than along others. I’d bet that in at least 99% of all usages of linear regression C will always be I . However, in situations where \beta shall remain in a certain space, C can be employed to express exactly that. More about this further below. Also, I’d like to introduce a similar loss function that minimizes the magnitude of \beta :

\mathcal{L}_{m}(\beta) = \left\| \beta \right\|^2_W

W is the weighting matrix that can be used to “regulate” or “dampen” the calculation of \beta . This is especially necessary when there are more parameters in f than there are points to fit the function. In that case a multitude of solutions for \beta exist. W expresses parts of \beta which can grow big or shall remain small (and combinations thereof).

The sum of both loss functions shall be the loss function from where the derivation shall be done. Note that when combining the loss terms, the weighting matrices also weight against each other. Since the overall goal is to fit f into a set of points, C should be chosen to be of much higher magnitude than W .

% <![CDATA[ \begin{align} \mathcal{L}(\beta) &= \mathcal{L}_{d}(\beta) + \mathcal{L}_{m}(\beta) \\ \end{align} %]]>

In general it is not possible to derive a solution for \beta when starting with the previous equation. E.g., it is not possible to directly find a \beta when \frac{d}{d \beta} f is still dependent on \beta ; when f is non-trivial with respect to \mathcal{L} . In most usages of linear regression this seems almost esoteric but for the sake of generalization and to show the similarity of linear regression to techniques employed in other situations I’d like to derive the linear regression formula where f can be nontrivial. This means that there is no closed form solution for \beta but we have to employ a gradient descent along the locally linearly derived representation of f at \beta : \frac{\delta}{\delta \beta}f(x) . In practice this means for sufficiently small \Delta\beta we assume the following:

f_{\beta + \Delta\beta}(x) = f_{\beta}(x) + \frac{\delta}{\delta \beta}f(x) \Delta\beta

The loss function shall now qualify \Delta\beta instead of \beta . Note that when f is trivial, then \frac{\delta}{\delta \beta}f(x) = \frac{d}{d \beta}f(x) . \frac{\delta}{\delta \beta}f(x) has the form of a matrix and (for a trivial f ) is in fact the X matrix introduced above! However, for a non-trivial f , X is actually dependent on \beta , therefore in the following the matrix X_{\beta} will be used to indicate this dependency. Also note that when f is trivial and \beta=\vec{0} , then solving the loss function for \Delta\beta is actually the same thing as solving for \beta directly. Anyways, solving for \Delta\beta instead of \beta means that instead of asking “what’s \beta ?” we ask: “what change to \beta minimizes the loss?”. The difference is subtle but there is a difference!

% <![CDATA[ \begin{align} \mathcal{L}(\Delta\beta) &= \left\| \vec{y} - f_{\beta+\Delta\beta}(\vec{x}) \right\|^2_C + \left\| \Delta\beta \right\|^2_W \\ &= \left\| \vec{y} - f_{\beta}(x) - \frac{\delta}{\delta \beta}f(x) \Delta\beta \right\|^2_C + \left\| \Delta\beta \right\|^2_W \\ &= \left\| \vec{y} - f_{\beta}(x) - X_{\beta} \Delta\beta \right\|^2_C + \left\| \Delta\beta \right\|^2_W \\ \end{align} %]]>

The term \vec{y} - f_{\beta}(x) can be described as the error vector e . It’s the difference between the desired values (for every Point) of f and the values of f_{\beta} at the x components of the points.

% <![CDATA[ \begin{align} \mathcal{L}(\Delta\beta) &= \left\| e - X_{\beta} \Delta\beta \right\|^2_C + \left\| \Delta\beta \right\|^2_W \\ &= e^T C e - e^T C X_{\beta} \Delta\beta - (X_{\beta} \Delta\beta)^T C e + (X_{\beta} \Delta\beta)^T C X_{\beta} \Delta\beta + \Delta\beta^T W \Delta\beta\\ &= e^T C e - 2 e^T C X_{\beta} \Delta\beta + \Delta\beta^T X_{\beta}^T C X_{\beta} \Delta\beta + \Delta\beta^T W \Delta\beta\\ \end{align} %]]>

Since the weighting matrices are symmetric the following identity holds:

e^T C X_{\beta} \Delta\beta = (X_{\beta} \Delta\beta)^T C e

To find the \Delta\beta that minimizes \mathcal{L}(\Delta\beta) we have to derive \mathcal{L}(\Delta\beta) for \Delta\beta and find the zero crossing. Note that this is actually the very same process you probably did in calculus at school when you had to look for extremes of squared functions.

% <![CDATA[ \begin{align} 0 &= \frac{d}{d \beta} \mathcal{L}(\Delta\beta) \\ 0 &= 2 e^T C X_{\beta} - 2 \Delta\beta^T X_{\beta}^T C X_{\beta} + 2 \Delta\beta^T W \\ 0 &= e^T C X_{\beta} - \Delta\beta^T (X_{\beta}^T C X_{\beta} + W) \\ e^T C X_{\beta} &= \Delta\beta^T (X_{\beta}^T C X_{\beta} + W) \\ X_{\beta}^T C^T e &= (X_{\beta}^T C X_{\beta} + W)^T \Delta\beta\\ X_{\beta}^T C^T e &= (X_{\beta}^T C^T X_{\beta}^T + W^T) \Delta\beta\\ \Delta\beta &= (X_{\beta}^T C^T X_{\beta}^T + W^T)^{-1} X_{\beta}^T C^T e\\ \end{align} %]]>

And that is practically everything there is to it!


At any least squares formula you’ll find an expression that resembles either (A^T A)^{-1}A^T or A^T (A A^T)^{-1} . Those are left and right pseudoinverses respectively. I’ll denote the pseudoinverse of A as A^{\dagger} They have quite cool properties:

also when A is square and of full rank then:

A^{\dagger} = A^{-1}

Where is the magic?

The post’s title indicates there is some magic. Up to this point is has actually been mostly dry-ass math. Apologies for that.

However, I’d like to recap the magically appearing formula from above:

\beta = (X^T X)^{-1}X^T y

This is actually the very same formula as

\Delta\beta = (X_{\beta}^T C^T X_{\beta}^T + W^T)^{-1} X_{\beta}^T C^T e

where f is a trivial function, \beta=\vec{0} therefore e = y , C=I and W=0 . In other words: the above example is a trivial case for the more general expression.

Cool, eh?

There are actually two more details I’d like to emphasize. Those details are what I consider the coolest and most magical part.


The W matrix “regulates” the inverse. In practice this means that for ill-conditioned X matrices a non-zero W leads to computationally stable calculations of pseudoinverses! Also they don’t explode when the singular values of X become small! Further the dampening can be used to manipulate the magnitude of \Delta\beta . The bigger W , the smaller \Delta\beta . When you build something that does a gradient descent, you can control the step width with W and take care of situations where you would shoot over actual solutions.


The C matrix can be chosen in a way that \Delta\beta resides within a certain subspace of the \beta -space. Think about a system where you have a squared polynomial that has to cross through a single point. Also the polynomial shall be chosen as to minimize its distance to a set of three different Points.

To get this done you can calculate a \Delta\beta_1 to solve for the first task. Further, you can calculate a C to constrain the second task. Namely C has to be the nullspace of X of the first task!

C = I - X^{\dagger}X

Note that when solving for the second task the previously calculated \Delta\beta of the first task has to be employed.

When combining dampening and constraining you get the dampened and constrained least squares technique.

Other stuff that uses similar formulas

Maybe I have not pushed it enough: I’m rather fond of the dampened, least squares approach to inverse kinematics. At the heart of the math behind it you’ll find the following formula:

\Delta q = \mathcal{N}J^T (J \mathcal{N} J^T + \epsilon I)^{-1} e

The similarity is quite striking. \mathcal{N} serves a similar purpose as C , just as \epsilon I is very related to W . And J is the local linearization of a function, as is X . The remaining difference is that in inverse kinematics a right pseudoinverse instead of a left pseudoinverse is employed.

Further when you recap the formulas in the update step of the extended Kalman filter, you’ll find something familiar: The Kalman gain K_k calculates like this:

% <![CDATA[ \begin{aligned} S_k &= H_k P_{k|k-1} H^T_k + R \\ K_k &= P_{k|k-1} H^T_k S^{-1}_k \\ K_k &= P_{k|k-1} H^T_k (H_k P_{k|k-1} H^T_k + R)^{-1} \end{aligned} %]]>

Quite clearly the Kalman gain can also be described as a right pseudoinverse of H . The inner weighting matrix is P_{k|k-1} and the regularization R .


Dampened, constrained least squares is the shit. I mean… For real!

This is the shit!