The magic that is least squares
October 29, 2019 - lutz
If you’ve had training in math, statistics, computer science or any other number crunching field you already know linear regression. Linear regression is just one tool that employs a wonderful technique: least squares
Least squares appears in many applications (more about that in posts to come) and I’d love to point out what a marvelous tool it is by the example of linear regression. Once you’re familiar with the underlying least squares approach, you’ll see it everywhere!
Anyways, this post is more on the mathy side… I hope somebody finds it useful.
Derivation
Before I jump into the super mathy derivation, I’d like to showcase the workings of linear regression by means of an example. But right after the example based introduction I get to the nerdy detailed math. Yey!
In a Nutshell
Think about having a function that produces a number given . The function has some intrinsic parameters that you would like to tune in order to have the function pass through some points. An example might be:
This function has three parameters: , , . With linear regression we can find optimal choices for those parameters given a set of points . To do this we need to calculate the derivatives of where represents the -th parameter of . That means we have to calculate as many derivatives of as there are parameters. When looking at the example above, this means: , , . We can now arrange the derivatives in a column vector where the -th column contains the derivative of by its -th parameter and call this vector :
With the example function above this becomes:
For every point we want our function to pass through we can generate an vector. Lets assume we want to tweak the parameters of our function above in a way that it passes though the points: , , . For every point we can calculate a corresponding and simply vertically stack those together to get :
enter the magic formula that will be derived further below
The vector contains the parameters of our function, and is simply all -components of the Points vertically stacked together: In our example this means .
This means that the quadratic function that passes through the points , , is:
I hope this example driven explanation helps in following the derivation below. However, it should also raise some questions like: Where does that magic formula come from?! So here is the more abstract derivation:
The Proper Derivation
First we need to change the definition of our function and of the points a bit:
is now a vector and generates a vector. Likewise the and components of the points have to be vectors as well: where and accordingly .
The function that passes through the given points, thus minimizing the distance to the points, is the function we are looking for. To express this we introduce a square loss function that gives a value for any parameterized and set of points by summing the squared distances together where the function should go through () and where it actually went through (). We actually want to minimize this square function, hence the name least squares!
The notation is a shorthand for which is the weighted (by ) sum of squares of the elements of . For (squared) weighting matrices to make sense, they have to be symmetric and positive semi-definite, i.e., a weighting matrix has to be construcible by squaring two matrices (e.g., where is a root of ). The weighting matrix can be used to weight dimensions along which errors are less acceptable than along others. I’d bet that in at least 99% of all usages of linear regression will always be . However, in situations where shall remain in a certain space, can be employed to express exactly that. More about this further below. Also, I’d like to introduce a similar loss function that minimizes the magnitude of :
is the weighting matrix that can be used to “regulate” or “dampen” the calculation of . This is especially necessary when there are more parameters in than there are points to fit the function. In that case a multitude of solutions for exist. expresses parts of which can grow big or shall remain small (and combinations thereof).
The sum of both loss functions shall be the loss function from where the derivation shall be done. Note that when combining the loss terms, the weighting matrices also weight against each other. Since the overall goal is to fit into a set of points, should be chosen to be of much higher magnitude than .
In general it is not possible to derive a solution for when starting with the previous equation. E.g., it is not possible to directly find a when is still dependent on ; when is non-trivial with respect to . In most usages of linear regression this seems almost esoteric but for the sake of generalization and to show the similarity of linear regression to techniques employed in other situations I’d like to derive the linear regression formula where can be nontrivial. This means that there is no closed form solution for but we have to employ a gradient descent along the locally linearly derived representation of at : . In practice this means for sufficiently small we assume the following:
The loss function shall now qualify instead of . Note that when is trivial, then . has the form of a matrix and (for a trivial ) is in fact the matrix introduced above! However, for a non-trivial , is actually dependent on , therefore in the following the matrix will be used to indicate this dependency. Also note that when is trivial and , then solving the loss function for is actually the same thing as solving for directly. Anyways, solving for instead of means that instead of asking “what’s ?” we ask: “what change to minimizes the loss?”. The difference is subtle but there is a difference!
The term can be described as the error vector . It’s the difference between the desired values (for every Point) of and the values of at the components of the points.
Since the weighting matrices are symmetric the following identity holds:
To find the that minimizes we have to derive for and find the zero crossing. Note that this is actually the very same process you probably did in calculus at school when you had to look for extremes of squared functions.
And that is practically everything there is to it!
Pseudoinverses
At any least squares formula you’ll find an expression that resembles either or . Those are left and right pseudoinverses respectively. I’ll denote the pseudoinverse of as They have quite cool properties:
also when is square and of full rank then:
Where is the magic?
The post’s title indicates there is some magic. Up to this point is has actually been mostly dry-ass math. Apologies for that.
However, I’d like to recap the magically appearing formula from above:
This is actually the very same formula as
where is a trivial function, therefore , and . In other words: the above example is a trivial case for the more general expression.
Cool, eh?
There are actually two more details I’d like to emphasize. Those details are what I consider the coolest and most magical part.
Dampening
The matrix “regulates” the inverse. In practice this means that for ill-conditioned matrices a non-zero leads to computationally stable calculations of pseudoinverses! Also they don’t explode when the singular values of become small! Further the dampening can be used to manipulate the magnitude of . The bigger , the smaller . When you build something that does a gradient descent, you can control the step width with and take care of situations where you would shoot over actual solutions.
Constraining
The matrix can be chosen in a way that resides within a certain subspace of the -space. Think about a system where you have a squared polynomial that has to cross through a single point. Also the polynomial shall be chosen as to minimize its distance to a set of three different Points.
To get this done you can calculate a to solve for the first task. Further, you can calculate a to constrain the second task. Namely has to be the nullspace of of the first task!
Note that when solving for the second task the previously calculated of the first task has to be employed.
When combining dampening and constraining you get the dampened and constrained least squares technique.
Other stuff that uses similar formulas
Maybe I have not pushed it enough: I’m rather fond of the dampened, least squares approach to inverse kinematics. At the heart of the math behind it you’ll find the following formula:
The similarity is quite striking. serves a similar purpose as , just as is very related to . And is the local linearization of a function, as is . The remaining difference is that in inverse kinematics a right pseudoinverse instead of a left pseudoinverse is employed.
Further when you recap the formulas in the update step of the extended Kalman filter, you’ll find something familiar: The Kalman gain calculates like this:
Quite clearly the Kalman gain can also be described as a right pseudoinverse of . The inner weighting matrix is and the regularization .
TL;DR
Dampened, constrained least squares is the shit. I mean… For real!
This is the shit!