You can use any other prior distribution for your parameters to create more interesting regularizations. You can even say that your parameters w are normally distributed but correlated with some correlation matrix Σ.
Let us assume that Σ is positive-definite, i.e. we are in the non-degenerate case. Otherwise, there is no density p(w).
If you do the math, you will find out that we then have to optimize
for some matrix Γ. Note: Γ is invertible and we have Σ⁻¹ = ΓᵀΓ. This is also called Tikhonov regularization.
Hint: start with the fact that
and remember that positive-definite matrices can be decomposed into a product of some invertible matrix and its transpose.
Great, so we defined our model and know what we want to optimize. But how can we optimize it, i.e. learn the best parameters that minimize the loss function? And when is there a unique solution? Let’s find out.
Ordinary Least Squares
Let us assume that we do not regularize and don’t use sample weights. Then, the MSE can be written as
This is quite abstract, so let us write it differently as
Using matrix calculus, you can take the derivative of this function with respect to w (we assume that the bias term b is included there).
If you set this gradient to zero, you end up with
If the (n × k)-matrix X has a rank of k, so does the (k × k)-matrix XᵀX, i.e. it is invertible. Why? It follows from rank(X) = rank(XᵀX).
In this case, we get the unique solution
Note: Software packages do not optimize like this but instead use gradient descent or other iterative techniques because it is faster. Still, the formula is nice and gives us some high-level insights about the problem.
But is this really a minimum? We can find out by computing the Hessian, which is XᵀX. The matrix is positive-semidefinite since wᵀXᵀXw = |Xw|² ≥ 0 for any w. It is even strictly positive-definite since XᵀX is invertible, i.e. 0 is not an eigenvector, so our optimal w is indeed minimizing our problem.
Perfect Multicollinearity
That was the friendly case. But what happens if X has a rank smaller than k? This might happen if we have two features in our dataset where one is a multiple of the other, e.g. we use the features height (in m) and height (in cm) in our dataset. Then we have height (in cm) = 100 * height (in m).
It can also happen if we one-hot encode categorical data and do not drop one of the columns. For example, if we have a feature color in our dataset that can be red, green, or blue, then we can one-hot encode and end up with three columns color_red, color_green, and color_blue. For these features, we have color_red + color_green + color_blue = 1, which induces perfect multicollinearity as well.
In these cases, the rank of XᵀX is also smaller than k, so this matrix is not invertible.
End of story.
Or not? Actually, no, because it can mean two things: (XᵀX)w = Xᵀy has
- no solution or
- infinitely many solutions.
It turns out that in our case, we can obtain one solution using the Moore-Penrose inverse. This means that we are in the case of infinitely many solutions, all of them giving us the same (training) mean squared error loss.
If we denote the Moore-Penrose inverse of A by A⁺, we can solve the linear system of equations as
To get the other infinitely many solutions, just add the null space of XᵀX to this specific solution.
Minimization With Tikhonov Regularization
Recall that we could add a prior distribution to our weights. We then had to minimize
for some invertible matrix Γ. Following the same steps as in ordinary least squares, i.e. taking the derivative with respect to w and setting the result to zero, the solution is
The neat part:
XᵀX + ΓᵀΓ is always invertible!
Let us find out why. It suffices to show that the null space of XᵀX + ΓᵀΓ is only {0}. So, let us take a w with (XᵀX + ΓᵀΓ)w = 0. Now, our goal is to show that w = 0.
From (XᵀX + ΓᵀΓ)w = 0 it follows that
which in turn implies |Γw| = 0 → Γw = 0. Since Γ is invertible, w has to be 0. Using the same calculation, we can see that the Hessian is also positive-definite.