When I was taking classes on machine learning, I learned about SVMs and the kernel trick
to get non-linear SVMs. The last slide usually said something along the lines of "you can
also kernelize other methods" without giving anymore hints as to which methods this refers
to that could be rewritten in terms of inner products and thus kernelized. So it came as a
surprise to me when I began reading Gaussian Processes for Machine Learning and
learned that not only is Bayesian linear regression (BLR) one of those methods but
kernelizing it gets you Gaussian Processes (GPs).
Even after Rasmussen introduced the kernelization of BLR as the weight-view of GPs, I was
still sceptical and expected to read that this is kernel-BLR and one can generalize it
further to get GPs. But no, kernelized BLR and GPs are actually identical. To really
convince myself of this fact, I filled in the details in the kernelization process that
Rasmussen omits.
In Bayesian linear regression we have a point x∈RD and a noisy
observation y∈R and we assume that the observation was generated by the model
y=w⊺x+ϵ. Here ϵ∼N(0,σ2) is
independent Gaussian noise and w∼N(0,Σp) is a latent
variable that describes the linear relationship between x and y.
Making a prediction y^ at a new point x^ given n
known data points D=(X,y)∈(RD×n,Rn)
in the Bayesian setting means to marginalize over the latent variables, i.e.
p(y^∣x^,D)=∫p(y^∣x^,w)p(w∣D)dw.Some quite laborious transformations (or a high-level argument) will get you the result
that y^ follows another Gaussian distribution
y^∼N(μ^σ−2x^⊺A−1Xy,σ^σ2+x^⊺A−1x^)where A=σ−2XX⊺+Σp−1.
To kernelize the predictive distribution and thus BLR, we have to rewrite both
μ^ and σ^ using only inner products between data
points. Let's begin with the mean.
μ^=σ−2x^⊺A−1Xy=σ−2x^⊺(σ−2XX⊺+Σp−1)−1XyNow we pull out Σp using (A+B−1)−1=B(AB+I)−1.
μ^=σ−2x^⊺Σp(σ−2XX⊺Σp+I)−1XyNext, we move X to form inner products using the push-through identity
(I+AB)−1A=A(I+BA)−1 and push σ inside.
μ^=x^⊺ΣpX(X⊺ΣpX+σ2I)−1yAt this point we can introduce the kernel k(x,y)=⟨ϕ(x),ϕ(y)⟩ with the feature mapping ϕ(x)=L⊺x where
LL⊺=Σp is the Cholesky decomposition of
Σp. Expanding the kernel, we see that k(x,y)=(L⊺x)⊺(L⊺y)=xLL⊺y=xΣpy and
get the final kernelized mean
μ^=k(x^,X)(k(X,X)+σ2I)−1y.Kernelizing the variance begins in the same way with pulling out Σp.
σ^=σ2+x^⊺(Σp−1+σ−2XX⊺)−1x^=σ2+x^⊺Σp(I+σ−2XX⊺Σp)−1x^This looks quite similar to what we had before but we are missing an X on the
right side that prevents us from forming inner products. Instead we need to invoke the
Woodbury matrix identity
(A+UCV)−1=A−1−A−1U(C−1+VA−1U)−1VA−1with A=I, U=X, C=σ−2I and
V=X⊺Σp.
σ^=σ2+x^⊺Σp(I−X(σ2I+X⊺ΣpX)−1X⊺Σp)x^=σ2+x^⊺Σpx^−x^⊺ΣpX(σ2I+X⊺ΣpX)−1X⊺Σpx^At this point, the expression admits the same kernel formulation as the mean did before and we get
σ^=σ2+k(x^,x^)−k(x^,X)(σ2I+k(X,X))−1k(X,x^).Compare the equations for μ^ and σ^ with equations
(2.22)-(2.24) in GPML for GP prediction and you will find them equivalent except for
additional prediction noise on our part.
It is known that GPs are equivalent to deep neural networks in the infinite-width
limit. Speaking informally, this shows that "Bayesianization" and kernelizing can
generalize linear regression into something as powerful as deep learning. Goes to show
that I underestimated the increase in "power" that kernelizing a method brings.