Gottlieb t Freitag

The magic that is least squares

October 29, 2019 - lutz

If you’ve had training in math, statistics, computer science or any other number crunching field you already know linear regression. Linear regression is just one tool that employs a wonderful technique: least squares

Least squares appears in many applications (more about that in posts to come) and I’d love to point out what a marvelous tool it is by the example of linear regression. Once you’re familiar with the underlying least squares approach, you’ll see it everywhere!

Anyways, this post is more on the mathy side… I hope somebody finds it useful.


Befor I I jump into the super mathy derivation, I’d like to showcase the workings of linear regression by means of an example. But right after the example based introduction I get to the nerdy detailed math. Yey!

In a Nutshell

Think about having a function f(x)f(x) that produces a number given xx. The function has some intrinsic parameters that you would like to tune in order to have the function pass through some points. An example might be:

f(x)=a+bx+cx2 f(x) = a + bx + cx^2

This function has three parameters: aa, bb, cc. With linear regression we can find optimal choices for those parameters given a set of points Pn=(xn,yn)P_n = (x_n, y_n). To do this we need to calculate the derivatives of ddβkf\frac{d}{d \beta_k}f where βk\beta_k represents the kk-th parameter of ff. That means we have to calculate as many derivatives of ff as there are parameters. When looking at the example above, this means: β1=a\beta_1 = a, β1=b\beta_1 = b, β3=c\beta_3 = c. We can now arrange the derivatives in a column vector where the kk-th column contains the derivative of ff by its kk-th parameter and call this vector X^\hat{X}:

X^=(ddβ1fddβ2fddβKf) \hat{X} = \left( \begin{array}{cccc} {} \frac{d}{d \beta_1}f & \frac{d}{d \beta_2}f & \cdots & \frac{d}{d \beta_K}f\end{array} \right)

With the example function above this becomes:

X^=(ddβ2fddβ2fddβ3f)=(1xx2) \hat{X} = \left( \begin{array}{cccc} {} \frac{d}{d \beta_2}f & \frac{d}{d \beta_2}f & \frac{d}{d \beta_3}f\end{array} \right) = \left( \begin{array}{cccc} {} 1 & x & x^2\end{array} \right)

For every point we want our function to pass through we can generate an XX vector. Lets assume we want to tweak the parameters of our function above in a way that it passes though the points: P1=(1,1)TP_1 = (-1, 1)^T, P1=(0,0)TP_1 = (0, 0)^T, P1=(1,1)TP_1 = (1, 1)^T. For every point PnP_n we can calculate a corresponding X^n\hat{X}_n and simply vertically stack those together to get XX:

X=(X^1X^2X^3)=(ddaf(1)ddbf(1)ddcf(1)ddaf(0)ddbf(0)ddcf(0)ddaf(1)ddbf(1)ddcf(1))=(111100111) X = \left( \begin{array}{c} {} \hat{X}_1 \\ \hat{X}_2 \\ \hat{X}_3 \end{array} \right) = \left( \begin{array}{ccc} {} \frac{d}{da}f(-1) & \frac{d}{db}f(-1) & \frac{d}{dc}f(-1) \\ \frac{d}{da}f( 0) & \frac{d}{db}f( 0) & \frac{d}{dc}f( 0) \\ \frac{d}{da}f( 1) & \frac{d}{db}f( 1) & \frac{d}{dc}f( 1) \end{array} \right) = \left( \begin{array}{ccc} {} 1 & -1 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \end{array} \right)

enter the magic formula that will be derived further below

β=(XTX)1XTy\beta = (X^T X)^{-1}X^T y

The vector β\beta contains the parameters of our function, and yy is simply all yy-components of the Points vertically stacked together: y=(y1y2yn)T y = \left( \begin{array}{cccc} {} y_1 & y_2 & \cdots & y_n \end{array} \right)^T In our example this means y=(101)Ty = \left( \begin{array}{ccc} {} 1 & 0 & 1 \end{array} \right)^T.

β=((111101101)(111100111))1(111101101)(101)=(0.51) \beta = \left( \left( \begin{array}{ccc} {} 1 & 1 & 1 \\ -1 & 0 & 1 \\ 1 & 0 & 1 \end{array} \right) \left( \begin{array}{ccc} {} 1 & -1 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \end{array} \right) \right)^{-1} \left( \begin{array}{ccc} {} 1 & 1 & 1 \\ -1 & 0 & 1 \\ 1 & 0 & 1 \end{array} \right) \left( \begin{array}{c} {} 1 \\ 0 \\ 1 \end{array} \right) = \left( \begin{array}{c} {} 0 \\ .5 \\ 1 \end{array} \right)

This means that the quadratic function that passes through the points P1=(1,1)TP_1 = (-1, 1)^T, P2=(0,0)TP_2 = (0, 0)^T, P3=(1,1)TP_3 = (1, 1)^T is:

f(x)=.5x+x2 f(x) = .5x + x^2

I hope this example driven explanation helps in following the derivation below. However, it should also raise some questions like: Where does that magic formula come from?! So here is the more abstract derivation:

The Proper Derivation

First we need to change the definition of our function ff and of the points PiP_i a bit:

f:RiRjf: \mathbb{R}^i \rightarrow \mathbb{R}^j

xx is now a vector and ff generates a vector. Likewise the xx and yy components of the points have to be vectors as well: Pk=(x,y)P_k = \left(\vec{x}, \vec{y} \right) where xRi\vec{x} \in \mathbb{R}^i and accordingly yRj\vec{y} \in \mathbb{R}^j.

The function that passes through the given points, thus minimizing the distance to the points, is the function we are looking for. To express this we introduce a square loss function Ld(β)\mathcal{L}_d(\beta) that gives a value for any parameterized ff and set of points by summing the squared distances together where the function should go through (y\vec{y}) and where it actually went through (fβ(x)f_{\beta}(\vec{x})). We actually want to minimize this square function, hence the name least squares!

Ld(β)=yfβ(x)C2\mathcal{L}_d(\beta) = \left\| \vec{y} - f_{\beta}(\vec{x}) \right\|^2_C

The notation aB2\left\| a \right\|^2_B is a shorthand for aTBa a^T B a which is the weighted (by BB) sum of squares of the elements of aa. For (squared) weighting matrices to make sense, they have to be symmetric and positive semi-definite, i.e., a weighting matrix has to be construcible by squaring two matrices (e.g., B=ATAB=A^T A where AA is a root of BB). The weighting matrix CC can be used to weight dimensions along which errors are less acceptable than along others. I’d bet that in at least 99% of all usages of linear regression CC will always be II. However, in situations where β\beta shall remain in a certain space, CC can be employed to express exactly that. More about this further below. Also, I’d like to introduce a similar loss function that minimizes the magnitude of β\beta:

Lm(β)=βW2\mathcal{L}_{m}(\beta) = \left\| \beta \right\|^2_W

WW is the weighting matrix that can be used to “regulate” or “dampen” the calculation of β\beta. This is especially necessary when there are more parameters in ff than there are points to fit the function. In that case a multitude of solutions for β\beta exist. WW expresses parts of β\beta which can grow big or shall remain small (and combinations thereof).

The sum of both loss functions shall be the loss function from where the derivation shall be done. Note that when combining the loss terms, the weighting matrices also weight against each other. Since the overall goal is to fit ff into a set of points, CC should be chosen to be of much higher magnitude than WW.

L(β)=Ld(β)+Lm(β) \begin{aligned} \mathcal{L}(\beta) &= \mathcal{L}_{d}(\beta) + \mathcal{L}_{m}(\beta) \\ \end{aligned}

In general it is not possible to derive a solution for β\beta when starting with the previous equation. E.g., it is not possible to directly find a β\beta when ddβf\frac{d}{d \beta} f is still dependent on β\beta; when ff is non-trivial with respect to L\mathcal{L}. In most usages of linear regression this seems almost esoteric but for the sake of generalization and to show the similarity of linear regression to techniques employed in other situations I’d like to derive the linear regression formula where ff can be nontrivial. This means that there is no closed form solution for β\beta but we have to employ a gradient descent along the locally linearly derived representation of ff at β\beta: δδβf(x)\frac{\delta}{\delta \beta}f(x). In practice this means for sufficiently small Δβ\Delta\beta we assume the following:

fβ+Δβ(x)=fβ(x)+δδβf(x)Δβf_{\beta + \Delta\beta}(x) = f_{\beta}(x) + \frac{\delta}{\delta \beta}f(x) \Delta\beta

The loss function shall now qualify Δβ\Delta\beta instead of β\beta. Note that when ff is trivial, then δδβf(x)=ddβf(x)\frac{\delta}{\delta \beta}f(x) = \frac{d}{d \beta}f(x). δδβf(x)\frac{\delta}{\delta \beta}f(x) has the form of a matrix and (for a trivial ff) is in fact the XX matrix introduced above! However, for a non-trivial ff, XX is actually dependent on β\beta, therefore in the following the matrix XβX_{\beta} will be used to indicate this dependency. Also note that when ff is trivial and β=0\beta=\vec{0}, then solving the loss function for Δβ\Delta\beta is actually the same thing as solving for β\beta directly. Anyways, solving for Δβ\Delta\beta instead of β\beta means that instead of asking “what’s β\beta?” we ask: “what change to β\beta minimizes the loss?”. The difference is subtle but there is a difference!

L(Δβ)=yfβ+Δβ(x)C2+ΔβW2=yfβ(x)δδβf(x)ΔβC2+ΔβW2=yfβ(x)XβΔβC2+ΔβW2 \begin{aligned} \mathcal{L}(\Delta\beta) &= \left\| \vec{y} - f_{\beta+\Delta\beta}(\vec{x}) \right\|^2_C + \left\| \Delta\beta \right\|^2_W \\ &= \left\| \vec{y} - f_{\beta}(x) - \frac{\delta}{\delta \beta}f(x) \Delta\beta \right\|^2_C + \left\| \Delta\beta \right\|^2_W \\ &= \left\| \vec{y} - f_{\beta}(x) - X_{\beta} \Delta\beta \right\|^2_C + \left\| \Delta\beta \right\|^2_W \\ \end{aligned}

The term yfβ(x)\vec{y} - f_{\beta}(x) can be described as the error vector ee. It’s the difference between the desired values (for every Point) of ff and the values of fβf_{\beta} at the xx components of the points.

L(Δβ)=eXβΔβC2+ΔβW2=eTCeeTCXβΔβ(XβΔβ)TCe+(XβΔβ)TCXβΔβ+ΔβTWΔβ=eTCe2eTCXβΔβ+ΔβTXβTCXβΔβ+ΔβTWΔβ \begin{aligned} \mathcal{L}(\Delta\beta) &= \left\| e - X_{\beta} \Delta\beta \right\|^2_C + \left\| \Delta\beta \right\|^2_W \\ &= e^T C e - e^T C X_{\beta} \Delta\beta - (X_{\beta} \Delta\beta)^T C e + (X_{\beta} \Delta\beta)^T C X_{\beta} \Delta\beta + \Delta\beta^T W \Delta\beta\\ &= e^T C e - 2 e^T C X_{\beta} \Delta\beta + \Delta\beta^T X_{\beta}^T C X_{\beta} \Delta\beta + \Delta\beta^T W \Delta\beta\\ \end{aligned}

Since the weighting matrices are symmetric the following identity holds:

eTCXβΔβ=(XβΔβ)TCee^T C X_{\beta} \Delta\beta = (X_{\beta} \Delta\beta)^T C e

To find the Δβ\Delta\beta that minimizes L(Δβ)\mathcal{L}(\Delta\beta) we have to derive L(Δβ)\mathcal{L}(\Delta\beta) for Δβ\Delta\beta and find the zero crossing. Note that this is actually the very same process you probably did in calculus at school when you had to look for extremes of squared functions.

0=ddβL(Δβ)0=2eTCXβ2ΔβTXβTCXβ+2ΔβTW0=eTCXβΔβT(XβTCXβ+W)eTCXβ=ΔβT(XβTCXβ+W)XβTCTe=(XβTCXβ+W)TΔβXβTCTe=(XβTCTXβT+WT)ΔβΔβ=(XβTCTXβT+WT)1XβTCTe \begin{aligned} 0 &= \frac{d}{d \beta} \mathcal{L}(\Delta\beta) \\ 0 &= 2 e^T C X_{\beta} - 2 \Delta\beta^T X_{\beta}^T C X_{\beta} + 2 \Delta\beta^T W \\ 0 &= e^T C X_{\beta} - \Delta\beta^T (X_{\beta}^T C X_{\beta} + W) \\ e^T C X_{\beta} &= \Delta\beta^T (X_{\beta}^T C X_{\beta} + W) \\ X_{\beta}^T C^T e &= (X_{\beta}^T C X_{\beta} + W)^T \Delta\beta\\ X_{\beta}^T C^T e &= (X_{\beta}^T C^T X_{\beta}^T + W^T) \Delta\beta\\ \Delta\beta &= (X_{\beta}^T C^T X_{\beta}^T + W^T)^{-1} X_{\beta}^T C^T e\\ \end{aligned}

And that is practically everything there is to it!


At any least squares formula you’ll find an expression that resembles either (ATA)1AT(A^T A)^{-1}A^T or AT(AAT)1A^T (A A^T)^{-1}. Those are left and right pseudoinverses respectively. I’ll denote the pseudoinverse of AA as AA^{\dagger} They have quite cool properties:

also when AA is square and of full rank then:

A=A1A^{\dagger} = A^{-1}

Where is the magic?

The post’s title indicates there is some magic. Up to this point is has actually been mostly dry-ass math. Apologies for that.

However, I’d like to recap the magically appearing formula from above:

β=(XTX)1XTy\beta = (X^T X)^{-1}X^T y

This is actually the very same formula as

Δβ=(XβTCTXβT+WT)1XβTCTe\Delta\beta = (X_{\beta}^T C^T X_{\beta}^T + W^T)^{-1} X_{\beta}^T C^T e

where ff is a trivial function, β=0\beta=\vec{0} therefore e=ye = y, C=IC=I and W=0W=0. In other words: the above example is a trivial case for the more general expression.

Cool, eh?

There are actually two more details I’d like to emphasize. Those details are what I consider the coolest and most magical part.


The WW matrix “regulates” the inverse. In practice this means that for ill-conditioned XX matrices a non-zero WW leads to computationally stable calculations of pseudoinverses! Also they don’t explode when the singular values of XX become small! Further the dampening can be used to manipulate the magnitude of Δβ\Delta\beta. The bigger WW, the smaller Δβ\Delta\beta. When you build something that does a gradient descent, you can control the step width with WW and take care of situations where you would shoot over actual solutions.


The CC matrix can be chosen in a way that Δβ\Delta\beta resides within a certain subspace of the β\beta-space. Think about a system where you have a squared polynomial that has to cross through a single point. Also the polynomial shall be chosen as to minimize its distance to a set of three different Points.

To get this done you can calculate a Δβ1\Delta\beta_1 to solve for the first task. Further, you can calculate a CC to constrain the second task. Namely CC has to be the nullspace of XX of the first task!

C=IXXC = I - X^{\dagger}X

Note that when solving for the second task the previously calculated Δβ\Delta\beta of the first task has to be employed.

When combining dampening and constraining you get the dampened and constrained least squares technique.

Other stuff that uses similar formulas

Maybe I have not pushed it enough: I’m rather fond of the dampened, least squares approach to inverse kinematics. At the heart of the math behind it you’ll find the following formula:

Δq=NJT(JNJT+ϵI)1e \Delta q = \mathcal{N}J^T (J \mathcal{N} J^T + \epsilon I)^{-1} e

The similarity is quite striking. N\mathcal{N} serves a similar purpose as CC, just as ϵI\epsilon I is very related to WW. And JJ is the local linearization of a function, as is XX. The remaining difference is that in inverse kinematics a right pseudoinverse instead of a left pseudoinverse is employed.

Further when you recap the formulas in the update step of the extended Kalman filter, you’ll find something familiar: The Kalman gain KkK_k calculates like this:

Sk=HkPkk1HkT+RKk=Pkk1HkTSk1Kk=Pkk1HkT(HkPkk1HkT+R)1 \begin{aligned} S_k &= H_k P_{k|k-1} H^T_k + R \\ K_k &= P_{k|k-1} H^T_k S^{-1}_k \\ K_k &= P_{k|k-1} H^T_k (H_k P_{k|k-1} H^T_k + R)^{-1} \end{aligned}

Quite clearly the Kalman gain can also be described as a right pseudoinverse of HH. The inner weighting matrix is Pkk1P_{k|k-1} and the regularization R R .


Dampened, constrained least squares is the shit. I mean… For real!

This is the shit!