marten

Gaussian Processes are Bayesian Linear Regression

When I was taking classes on machine learning, I learned about SVMs and the kernel trick to get non-linear SVMs. The last slide usually said something along the lines of "you can also kernelize other methods" without giving anymore hints as to which methods this refers to that could be rewritten in terms of inner products and thus kernelized. So it came as a surprise to me when I began reading Gaussian Processes for Machine Learning and learned that not only is Bayesian linear regression (BLR) one of those methods but kernelizing it gets you Gaussian Processes (GPs).1

Even after Rasmussen introduced the kernelization of BLR as the weight-view of GPs, I was still sceptical and expected to read that this is kernel-BLR and one can generalize it further to get GPs. But no, kernelized BLR and GPs are actually identical. To really convince myself of this fact, I filled in the details in the kernelization process that Rasmussen omits.

In Bayesian linear regression we have a point xRD\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \v{x} \in \R^{D} and a noisy observation yR\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} y \in \R and we assume that the observation was generated by the model y=wx+ϵ\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} y = \v{w}\T \v{x} + \epsilon. Here ϵN(0,σ2)\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \epsilon \sim \N(0, \sigma^2) is independent Gaussian noise and wN(0,Σp)\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \v{w} \sim \N(\v{0}, \m{\Sigma}_p) is a latent variable that describes the linear relationship between x\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \v{x} and y\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} y. Making a prediction y^\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{y} at a new point x^\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{\v{x}} given n\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} n known data points D=(X,y)(RD×n,Rn)\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \data = (\obs{\m{X}}, \obs{\v{y}}) \in (\R^{D \times n}, \R^n) in the Bayesian setting means to marginalize over the latent variables, i.e.

p(y^x^,D)=p(y^x^,w)p(wD)dw.\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \p(\pred{y} \mid \pred{\v{x}}, \data) = \int \p(\pred{y} \mid \pred{\v{x}}, \v{w})\, \p(\v{w} \mid \data) \,\d\v{w}.

Some quite laborious transformations (or a high-level argument) will get you the result that y^\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{y} follows another Gaussian distribution

y^N(σ2x^A1Xyμ^,σ2+x^A1x^σ^)\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{y} \sim \N\big( \underbrace{\sigma^{-2}\pred{\v{x}}\T\m{A}\inv\obs{\m{X}}\obs\v{y}}_{\pred{\mu}}, \underbrace{\sigma^2 + \pred{\v{x}}\T\m{A}\inv\pred{\v{x}}}_{\pred{\sigma}} \big)

where A=σ2XX+Σp1\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \m{A} = \sigma^{-2}\obs{\m{X}}\obs{\m{X}}\T + \m{\Sigma}_{p}\inv.

To kernelize the predictive distribution and thus BLR, we have to rewrite both μ^\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{\mu} and σ^\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{\sigma} using only inner products between data points. Let's begin with the mean.

μ^=σ2x^A1Xy=σ2x^(σ2XX+Σp1)1Xy\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{\mu} = \sigma^{-2}\pred{\v{x}}\T\m{A}\inv\obs{\m{X}}\obs\v{y} = \sigma^{-2}\pred{\v{x}}\T \left(\sigma^{-2}\obs{\m{X}}\obs{\m{X}}\T + \m{\Sigma}_{p}\inv\right) \inv\obs{\m{X}}\obs\v{y}

Now we pull out Σp\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \m{\Sigma_p} using (A+B1)1=B(AB+I)1\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} (\m{A} + \m{B}\inv)\inv = \m{B}(\m{A}\m{B} + \m{I})\inv.

μ^=σ2x^Σp(σ2XXΣp+I)1Xy\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{\mu} = \sigma^{-2}\pred{\v{x}}\T \m{\Sigma}_{p}\left(\sigma^{-2}\obs{\m{X}}\obs{\m{X}}\T\m{\Sigma}_{p} + \m{I}\right) \inv\obs{\m{X}}\obs\v{y}

Next, we move X\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \m{X} to form inner products using the push-through identity (I+AB)1A=A(I+BA)1\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} (\m{I} + \m{A}\m{B})\inv\m{A} = \m{A}(\m{I} + \m{B}\m{A})\inv and push σ\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \sigma inside.

μ^=x^ΣpX(XΣpX+σ2I)1y\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{\mu} = \pred{\v{x}}\T \m{\Sigma}_{p}\obs{\m{X}} \left(\obs{\m{X}}\T\m{\Sigma}_{p}\obs{\m{X}} + \sigma^{2}\m{I}\right) \inv\obs\v{y}

At this point we can introduce the kernel k(x,y)=ϕ(x),ϕ(y)\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} k(\v{x}, \v{y}) = \langle \phi(\v{x}), \phi(\v{y}) \rangle with the feature mapping ϕ(x)=Lx\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \phi(\v{x}) = \m{L}\T\v{x} where LL=Σp\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \m{L}\m{L}\T = \m{\Sigma}_{p} is the Cholesky decomposition of Σp\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \m{\Sigma}_{p}. Expanding the kernel, we see that k(x,y)=(Lx)(Ly)=xLLy=xΣpy\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} k(\v{x}, \v{y}) = (\m{L}\T\v{x})\T(\m{L}\T\v{y}) = \v{x}\m{L}\m{L}\T\v{y} = \v{x}\m{\Sigma}_{p}\v{y} and get the final kernelized mean

μ^=k(x^,X)(k(X,X)+σ2I)1y.\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{\mu} = k(\pred{\v{x}}, \obs{\m{X}}) (k(\obs{\m{X}}, \obs{\m{X}}) + \sigma^2\m{I})\inv \obs{\v{y}}.

Kernelizing the variance begins in the same way with pulling out Σp\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \m{\Sigma}_{p}.

σ^=σ2+x^(Σp1+σ2XX)1x^=σ2+x^Σp(I+σ2XXΣp)1x^\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \begin{aligned} \pred{\sigma} & = \sigma^2 + \pred{\v{x}}\T \left( \m{\Sigma}_{p}\inv + \sigma^{-2}\obs{\m{X}}\obs{\m{X}}\T \right) \inv\pred{\v{x}}\\ & = \sigma^2 + \pred{\v{x}}\T\m{\Sigma}_{p} \left( \m{I} + \sigma^{-2}\obs{\m{X}}\obs{\m{X}}\T\m{\Sigma}_{p} \right) \inv\pred{\v{x}} \end{aligned}

This looks quite similar to what we had before but we are missing an X\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \m{X} on the right side that prevents us from forming inner products. Instead we need to invoke the Woodbury matrix identity

(A+UCV)1=A1A1U(C1+VA1U)1VA1\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} (\m{A} + \m{U}\m{C}\m{V})\inv = \m{A}\inv - \m{A}\inv\m{U}(\m{C}\inv + \m{V}\m{A}\inv\m{U})\inv\m{V}\m{A}\inv

with A=I\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \m{A} = \m{I}, U=X\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \m{U} = \obs{\m{X}}, C=σ2I\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \m{C} = \sigma^{-2}\m{I} and V=XΣp\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \m{V} = \obs{\m{X}}\T\m{\Sigma}_{p}.

σ^=σ2+x^Σp(IX(σ2I+XΣpX)1XΣp)x^=σ2+x^Σpx^x^ΣpX(σ2I+XΣpX)1XΣpx^\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \begin{aligned} \pred{\sigma} & = \sigma^2 + \pred{\v{x}}\T\m{\Sigma}_{p} \left( \m{I} - \obs{\m{X}} \left( \sigma^2\m{I} + \obs{\m{X}}\T\m{\Sigma}_{p}\obs{\m{X}} \right)\inv \obs{\m{X}}\T\m{\Sigma}_{p} \right) \pred{\v{x}}\\ & = \sigma^2 + \pred{\v{x}}\T\m{\Sigma}_{p}\pred{\v{x}} - \pred{\v{x}}\T\m{\Sigma}_{p}\obs{\m{X}} \left( \sigma^2\m{I} + \obs{\m{X}}\T\m{\Sigma}_{p}\obs{\m{X}} \right)\inv \obs{\m{X}}\T\m{\Sigma}_{p}\pred{\v{x}} \end{aligned}

At this point, the expression admits the same kernel formulation as the mean did before and we get

σ^=σ2+k(x^,x^)k(x^,X)(σ2I+k(X,X))1k(X,x^).\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{\sigma} = \sigma^2 + k(\pred{\v{x}}, \pred{\v{x}}) - k(\pred{\v{x}}, \obs{\m{X}}) \left( \sigma^2\m{I} + k(\obs{\m{X}}, \obs{\m{X}}) \right)\inv k(\obs{\m{X}}, \pred{\v{x}}).

Compare the equations for μ^\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{\mu} and σ^\newcommand{\nth}[2]{#1^{\left(#2\right)}} \newcommand{\el}[2]{#1_{#2}} \newcommand{\col}[2]{#1_{:,#2}} \newcommand{\row}[2]{#1_{#2,:}} \renewcommand{\v}[1]{{\bm{#1}}} \newcommand{\m}[1]{{\bm{#1}}} \newcommand{\s}[1]{{\mathcal{#1}}} \renewcommand{\set}[1]{{\left\{#1\right\}}} % Symbols \def\d{{\mathrm{d}}} \def\R{{\mathbb{R}}} \def\Rplus{{\pos{\R}}} \def\N{{\mathcal{N}}} % Operators \def\E{\operatorname*{\mathbb{E}}} \def\T{^{\intercal}} \def\inv{^{-1}} % Functions \def\sign{\operatorname{sign}} \def\Tr{\operatorname{Tr}} \def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\sp{\operatorname{sp}} \def\p{\mathrm{p}} \def\data{\mathcal{D}} \newcommand{\pred}[1]{\hat{#1}} \newcommand{\obs}[1]{#1} \pred{\sigma} with equations (2.22)-(2.24) in GPML for GP prediction and you will find them equivalent except for additional prediction noise on our part.

It is known that GPs are equivalent to deep neural networks in the infinite-width limit2. Speaking informally, this shows that "Bayesianization" and kernelizing can generalize linear regression into something as powerful as deep learning. Goes to show that I underestimated the increase in "power" that kernelizing a method brings.

1 The distill journal has a nice visual introduction to GPs.

2 Apparently, GPs are also equivalent to spline smoothing but that looks to be more of a theoretical result.

Want to comment or get in touch? @martenlienen or email me