When I was taking classes on machine learning, I learned about SVMs and the kernel trick to get non-linear SVMs. The last slide usually said something along the lines of “you can also kernelize other methods” without giving anymore hints as to which methods this refers to that could be rewritten in terms of inner products and thus kernelized. So it came as a surprise to me when I began reading Gaussian Processes for Machine Learning and learned that not only is Bayesian linear regression (BLR) one of those methods but kernelizing it gets you Gaussian Processes (GPs). The distill journal has a nice visual introduction to GPs.
Even after Rasmussen introduced the kernelization of BLR as the weight-view of GPs, I was still sceptical and expected to read that this is kernel-BLR and one can generalize it further to get GPs. But no, kernelized BLR and GPs are actually identical. To really convince myself of this fact, I filled in the details in the kernelization process that Rasmussen omits.
In Bayesian linear regression we have a point $\vx \in \R^{D}$ and a noisy observation $y \in \R$ and we assume that the observation was generated by the model $y = \vw\T \vx + \epsilon$. Here $\epsilon \sim \N(0, \sigma^2)$ is independent Gaussian noise and $\vw \sim \N(\vzero, \mSigma_p)$ is a latent variable that describes the linear relationship between $\vx$ and $y$. Making a prediction $\hat{y}$ at a new point $\hat{\vx}$ given $n$ known data points $\mathcal{D} = (\mX, \vy) \in \R^{D \times n} \times \R^n$ in the Bayesian setting means to marginalize over the latent variables, i.e.
\[\def\p{\mathrm{p}} \p(\hat{y} \mid \hat{\vx}, \mathcal{D}) = \int \p(\hat{y} \mid \hat{\vx}, \vw)\, \p(\vw \mid \mathcal{D}) \,\mathrm{d}\vw.\]
Some quite laborious transformations (or a high-level argument) will get you the result that $\hat{y}$ follows another Gaussian distribution
\[\hat{y} \sim \N\big( \underbrace{\sigma^{-2}\hat{\vx}\T\mA\inv\mX\vy}_{\hat{\mu}}, \underbrace{\sigma^2 + \hat{\vx}\T\mA\inv\hat{\vx}}_{\hat{\sigma}} \big)\]
where $\mA = \sigma^{-2}\mX\mX\T + \mSigma_{p}\inv$.
To kernelize the predictive distribution and thus BLR, we have to rewrite both $\hat{\mu}$ and $\hat{\sigma}$ using only inner products between data points. Let’s begin with the mean.
\[\hat{\mu} = \sigma^{-2}\hat{\vx}\T\mA\inv\mX\vy = \sigma^{-2}\hat{\vx}\T \left(\sigma^{-2}\mX\mX\T + \mSigma_{p}\inv\right) \inv\mX\vy\]
Now we pull out $\mSigma_p$ using $(\mA + \mB\inv)\inv = \mB(\mA\mB + \mI)\inv$.
\[\hat{\mu} = \sigma^{-2}\hat{\vx}\T \mSigma_{p}\left(\sigma^{-2}\mX\mX\T\mSigma_{p} + \mI\right) \inv\mX\vy\]
Next, we move $\mX$ to form inner products using the push-through identity $(\mI + \mA\mB)\inv\mA = \mA(\mI + \mB\mA)\inv$ and push $\sigma$ inside.
\[\hat{\mu} = \hat{\vx}\T \mSigma_{p}\mX \left(\mX\T\mSigma_{p}\mX + \sigma^{2}\mI\right) \inv\vy\]
At this point we can introduce the kernel $k(\vx, \vy) = \langle \phi(\vx), \phi(\vy) \rangle$ with the feature mapping $\phi(\vx) = \mL\T\vx$ where $\mL\mL\T = \mSigma_{p}$ is the Cholesky decomposition of $\mSigma_{p}$. Expanding the kernel, we see that $k(\vx, \vy) = (\mL\T\vx)\T(\mL\T\vy) = \vx\mL\mL\T\vy = \vx\mSigma_{p}\vy$ and get the final kernelized mean
\[\hat{\mu} = k(\hat{\vx}, \mX) (k(\mX, \mX) + \sigma^2\mI)\inv \vy.\]
Kernelizing the variance begins in the same way with pulling out $\mSigma_{p}$.
\[\begin{aligned} \hat{\sigma} & = \sigma^2 + \hat{\vx}\T \left( \mSigma_{p}\inv + \sigma^{-2}\mX\mX\T \right) \inv\hat{\vx}\\ & = \sigma^2 + \hat{\vx}\T\mSigma_{p} \left( \mI + \sigma^{-2}\mX\mX\T\mSigma_{p} \right) \inv\hat{\vx} \end{aligned}\]
This looks quite similar to what we had before but we are missing an $\mX$ on the right side that prevents us from forming inner products. Instead we need to invoke the Woodbury matrix identity
\[(\mA + \mU\mC\mV)\inv = \mA\inv - \mA\inv\mU(\mC\inv + \mV\mA\inv\mU)\inv\mV\mA\inv\]
with $\mA = \mI$, $\mU = \mX$, $\mC = \sigma^{-2}\mI$ and $\mV = \mX\T\mSigma_{p}$.
\[\begin{aligned} \hat{\sigma} & = \sigma^2 + \hat{\vx}\T\mSigma_{p} \left( \mI - \mX \left( \sigma^2\mI + \mX\T\mSigma_{p}\mX \right)\inv \mX\T\mSigma_{p} \right) \hat{\vx}\\ & = \sigma^2 + \hat{\vx}\T\mSigma_{p}\hat{\vx} - \hat{\vx}\T\mSigma_{p}\mX \left( \sigma^2\mI + \mX\T\mSigma_{p}\mX \right)\inv \mX\T\mSigma_{p}\hat{\vx} \end{aligned}\]
At this point, the expression admits the same kernel formulation as the mean did before and we get
\[\hat{\sigma} = \sigma^2 + k(\hat{\vx}, \hat{\vx}) - k(\hat{\vx}, \mX) \left( \sigma^2\mI + k(\mX, \mX) \right)\inv k(\mX, \hat{\vx}).\]
Compare the equations for $\hat{\mu}$ and $\hat{\sigma}$ with equations (2.22)-(2.24) in GPML for GP prediction and you will find them equivalent except for additional prediction noise on our part.
It is known that GPs are equivalent to deep neural networks in the infinite-width limitApparently, GPs are also equivalent to spline smoothing but that looks to be more of a theoretical result. . Speaking informally, this shows that “Bayesianization” and kernelizing can generalize linear regression into something as powerful as deep learning. Goes to show that I underestimated the increase in “power” that kernelizing a method brings.