Kernel Ridge Regression

A blog post that describes the derivation of Kernel Ridge Regression

This is a post about Kernel Ridge Regression. We will first recall about the solution for Linear and Ridge Regressions, and then hop in to Kernel Ridge Regression.

We denote The data matrix as \(X \in \mathbb{R}^{n \times p}\), where \(n\) is the number of samples in the data and \(p\) is the data dimension. The labels are denotes as \(Y \in \mathbb{R}^n\).

Linear Regression

We discuss the problem of linear regression for the Mean Square Error (MSE) loss. We wish to find the solution for the following problem: \begin{equation}\label{eq:linear_regression} \beta^{\star} = f(\beta) = \mathop{\mathrm{argmin}}\limits_\beta ||Y - X\beta||_2^2 \end{equation} Notice we do not care about the intercept term as we can avoid it by increasing the data dimension by one and enclosing it with the same matrix notation. The closed solution is:

\[beta^{\star} = (X^TX)^{\dagger}X^T Y\]

Where \(\dagger\) denotes the pseudoinverse of the matrix.

Solution details

The solution is found by deriving \(f(\cdot)\), equating to zero and finding the minimal value:

\[0 = \frac{d}{d\beta}f(\beta) = -2X^T(Y - X\beta)\] \[\Downarrow\]

\(\beta^{\star} = (X^TX)^{\dagger}X^T Y\) Since \(X^TX\) can be a not full ranked matrix, the solution to the equation is true using the pseudoinverse.

Ridge Regression

Now we extend the setting introduced in \(\eqref{eq:linear_regression}\) to include a regularization term: \begin{equation} \beta^{\star} = \mathop{\mathrm{argmin}}\limits_\beta ||Y - X\beta||_2^2 + R(\lambda, \beta) \end{equation}

where \(\lambda\) controls the intensity of the regularizer.

For ridge regression, the regularization is euclidean-2 norm, i.e.:

\begin{equation} \beta^{\star} = h(\beta) = \mathop{\mathrm{argmin}}\limits_\beta ||Y - X\beta||_2^2 + \lambda ||\beta||_2^2 \end{equation}

And the solution has a closed form similarly to linear regression:

\begin{equation}\label{eq:ridge_solution} \beta^{\star} = (X^TX + \lambda I_p )^{-1} X^T Y \end{equation}

Solution details

We can express the regularizer as a dot product, i.e.

\[h(\beta) = \mathop{\mathrm{argmin}}\limits_\beta ||Y - X\beta||_2^2 + \lambda \beta^T \beta\]

And now derive exactly as we did for linear regression:

\[0 = \frac{d}{d\beta}h(\beta) = -2X^T(Y - X\beta) + 2\lambda \beta\] \[\Downarrow\] \[\beta^{\star} = (X^TX + \lambda I_p )^{-1} X^T Y\]

Notice that here this is the actual inverse of \(X^TX + \lambda I_p\) as it is a full rank matrix (see here for more info).
Furthermore, notice that \(X^TX\beta + \lambda \beta = (X^TX + \lambda I_p)\beta\).

Kernel Ridge Regression

Now we get to the interesting part. We’ll use matrices identities to transform the ridge regression solution \eqref{eq:ridge_solution} to the dimension of the data (\(n\)) rather than the parameters (\(p\)).

Given matrices \(A\in \mathbb{R}^{p \times n}, B\in \mathbb{R}^{n \times p}\), it holds:

\[(AB+ \lambda I_p)^{-1}A = A(BA + \lambda I_n)^{-1}\]

This is derived from the Woodbury matrix identity, and is easily proven: \((AB+ \lambda I_p)^{-1}A = A(BA + \lambda I_n)^{-1}\)

\[\Updownarrow \cdot (AB+ \lambda I_p)\] \[A = (AB+ \lambda I_p)A(BA + \lambda I_n)^{-1}\] \[\Updownarrow \cdot (BA + \lambda I_n)\] \[A(BA + \lambda I_n) = (AB+ \lambda I_p)A\] \[\Updownarrow\] \[ABA + \lambda A = ABA + \lambda A\]

So for our case, if we look on the ridge regression solution, and set \(A = X^T, B = X\), we get:

\[\beta^{\star} = (X^TX + \lambda I_p )^{-1} X^T Y = X^T(XX^T + \lambda I_n)^{-1}Y\]

Without getting into the details of the kernel trick and the representer theorem, we can express \(XX^T\) as a kernel matrix \(K\) using a kernel function \(k(\cdot, \cdot)\) that measures similarity between pairs of data points using an inner product of their features using a function \(\phi(\cdot)\):

\[k(\cdot, \cdot): \mathbb{R}^p \times \mathbb{R}^p \rightarrow \mathbb{R}\] \[\phi(\cdot): \mathbb{R}^p \rightarrow \mathbb{R}^d\] \[k(x_1, x_2) = \langle \phi(x_1), \phi(x_2) \rangle\] \[K_{i, j} = k(x_i, x_j)\]

And so in the new space, we can express our solution using the features and kernel functions.
If we denote the features in a vectorized manner as:

\[\Phi = \left( \begin{matrix} - \phi(x_1)^T - \\ . \\ . \\ . \\ - \phi(x_n)^T - \end{matrix} \right)\]

We get:

\[X \rightarrow \Phi\] \[K = \Phi\Phi^T\] \[XX^T \rightarrow K\]

And so the solution takes this new kernelized shape:

\[\beta^\star = \Phi^T(\Phi\Phi^T + \lambda I_n)^{-1}Y = \Phi^T(K + \lambda I_n)^{-1}Y\]

And so, given a sample \(x \in \mathbb{R}^p\), all we do is multiply by the coefficients to receive:

\[\hat{y} = x^T \beta^\star = x^T \Phi^T(K + \lambda I_n)^{-1}Y\]

So the point of all this mess, is that we can choose in what dimension to solve our problem:


Additional Resources

  1. Andrew Ng’s lecture notes on SVM