Gaussian Processes are Bayesian Linear Regression
Jan 23, 2021
When I was taking classes on machine learning, I learned about SVMs and the kernel trick
to get non-linear SVMs. The last slide usually said something along the lines of "you can
also kernelize other methods" without giving anymore hints as to which methods this refers
to that could be rewritten in terms of inner products and thus kernelized. So it came as a
surprise to me when I began reading Gaussian Processes for Machine Learning and
learned that not only is Bayesian linear regression (BLR) one of those methods but
kernelizing it gets you Gaussian Processes (GPs).
Even after Rasmussen introduced the kernelization of BLR as the weight-view of GPs, I was
still sceptical and expected to read that this is kernel-BLR and one can generalize it
further to get GPs. But no, kernelized BLR and GPs are actually identical. To really
convince myself of this fact, I filled in the details in the kernelization process that
Rasmussen omits.
In Bayesian linear regression we have a point x ∈ R D \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\v{x} \in \R^{D} x ∈ R D and a noisy
observation y ∈ R \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
y \in \R y ∈ R and we assume that the observation was generated by the model
y = w ⊺ x + ϵ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
y = \v{w}\T \v{x} + \epsilon y = w ⊺ x + ϵ . Here ϵ ∼ N ( 0 , σ 2 ) \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\epsilon \sim \N(0, \sigma^2) ϵ ∼ N ( 0 , σ 2 ) is
independent Gaussian noise and w ∼ N ( 0 , Σ p ) \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\v{w} \sim \N(\v{0}, \m{\Sigma}_p) w ∼ N ( 0 , Σ p ) is a latent
variable that describes the linear relationship between x \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\v{x} x and y \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
y y .
Making a prediction y ^ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{y} y ^ at a new point x ^ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{\v{x}} x ^ given n \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
n n
known data points D = ( X , y ) ∈ ( R D × n , R n ) \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\data = (\obs{\m{X}}, \obs{\v{y}}) \in (\R^{D \times n}, \R^n) D = ( X , y ) ∈ ( R D × n , R n )
in the Bayesian setting means to marginalize over the latent variables, i.e.
p ( y ^ ∣ x ^ , D ) = ∫ p ( y ^ ∣ x ^ , w ) p ( w ∣ D ) d w . \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\p(\pred{y} \mid \pred{\v{x}}, \data) = \int \p(\pred{y} \mid \pred{\v{x}}, \v{w})\, \p(\v{w} \mid \data) \,\d\v{w}. p ( y ^ ∣ x ^ , D ) = ∫ p ( y ^ ∣ x ^ , w ) p ( w ∣ D ) d w . Some quite laborious transformations (or a high-level argument) will get you the result
that y ^ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{y} y ^ follows another Gaussian distribution
y ^ ∼ N ( σ − 2 x ^ ⊺ A − 1 X y ⏟ μ ^ , σ 2 + x ^ ⊺ A − 1 x ^ ⏟ σ ^ ) \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{y} \sim \N\big( \underbrace{\sigma^{-2}\pred{\v{x}}\T\m{A}\inv\obs{\m{X}}\obs\v{y}}_{\pred{\mu}}, \underbrace{\sigma^2 + \pred{\v{x}}\T\m{A}\inv\pred{\v{x}}}_{\pred{\sigma}} \big) y ^ ∼ N ( μ ^ σ − 2 x ^ ⊺ A − 1 X y , σ ^ σ 2 + x ^ ⊺ A − 1 x ^ ) where A = σ − 2 X X ⊺ + Σ p − 1 \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\m{A} = \sigma^{-2}\obs{\m{X}}\obs{\m{X}}\T + \m{\Sigma}_{p}\inv A = σ − 2 X X ⊺ + Σ p − 1 .
To kernelize the predictive distribution and thus BLR, we have to rewrite both
μ ^ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{\mu} μ ^ and σ ^ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{\sigma} σ ^ using only inner products between data
points. Let's begin with the mean.
μ ^ = σ − 2 x ^ ⊺ A − 1 X y = σ − 2 x ^ ⊺ ( σ − 2 X X ⊺ + Σ p − 1 ) − 1 X y \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{\mu} = \sigma^{-2}\pred{\v{x}}\T\m{A}\inv\obs{\m{X}}\obs\v{y} = \sigma^{-2}\pred{\v{x}}\T \left(\sigma^{-2}\obs{\m{X}}\obs{\m{X}}\T + \m{\Sigma}_{p}\inv\right) \inv\obs{\m{X}}\obs\v{y} μ ^ = σ − 2 x ^ ⊺ A − 1 X y = σ − 2 x ^ ⊺ ( σ − 2 X X ⊺ + Σ p − 1 ) − 1 X y Now we pull out Σ p \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\m{\Sigma_p} Σ p using ( A + B − 1 ) − 1 = B ( A B + I ) − 1 \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
(\m{A} + \m{B}\inv)\inv =
\m{B}(\m{A}\m{B} + \m{I})\inv ( A + B − 1 ) − 1 = B ( A B + I ) − 1 .
μ ^ = σ − 2 x ^ ⊺ Σ p ( σ − 2 X X ⊺ Σ p + I ) − 1 X y \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{\mu} = \sigma^{-2}\pred{\v{x}}\T \m{\Sigma}_{p}\left(\sigma^{-2}\obs{\m{X}}\obs{\m{X}}\T\m{\Sigma}_{p} + \m{I}\right) \inv\obs{\m{X}}\obs\v{y} μ ^ = σ − 2 x ^ ⊺ Σ p ( σ − 2 X X ⊺ Σ p + I ) − 1 X y Next, we move X \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\m{X} X to form inner products using the push-through identity
( I + A B ) − 1 A = A ( I + B A ) − 1 \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
(\m{I} + \m{A}\m{B})\inv\m{A} = \m{A}(\m{I} + \m{B}\m{A})\inv ( I + A B ) − 1 A = A ( I + B A ) − 1 and push σ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\sigma σ inside.
μ ^ = x ^ ⊺ Σ p X ( X ⊺ Σ p X + σ 2 I ) − 1 y \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{\mu} = \pred{\v{x}}\T \m{\Sigma}_{p}\obs{\m{X}} \left(\obs{\m{X}}\T\m{\Sigma}_{p}\obs{\m{X}} + \sigma^{2}\m{I}\right) \inv\obs\v{y} μ ^ = x ^ ⊺ Σ p X ( X ⊺ Σ p X + σ 2 I ) − 1 y At this point we can introduce the kernel k ( x , y ) = ⟨ ϕ ( x ) , ϕ ( y ) ⟩ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
k(\v{x}, \v{y}) = \langle \phi(\v{x}),
\phi(\v{y}) \rangle k ( x , y ) = ⟨ ϕ ( x ) , ϕ ( y )⟩ with the feature mapping ϕ ( x ) = L ⊺ x \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\phi(\v{x}) = \m{L}\T\v{x} ϕ ( x ) = L ⊺ x where
L L ⊺ = Σ p \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\m{L}\m{L}\T = \m{\Sigma}_{p} L L ⊺ = Σ p is the Cholesky decomposition of
Σ p \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\m{\Sigma}_{p} Σ p . Expanding the kernel, we see that k ( x , y ) = ( L ⊺ x ) ⊺ ( L ⊺ y ) = x L L ⊺ y = x Σ p y \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
k(\v{x}, \v{y}) =
(\m{L}\T\v{x})\T(\m{L}\T\v{y}) = \v{x}\m{L}\m{L}\T\v{y} = \v{x}\m{\Sigma}_{p}\v{y} k ( x , y ) = ( L ⊺ x ) ⊺ ( L ⊺ y ) = x L L ⊺ y = x Σ p y and
get the final kernelized mean
μ ^ = k ( x ^ , X ) ( k ( X , X ) + σ 2 I ) − 1 y . \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{\mu} = k(\pred{\v{x}}, \obs{\m{X}}) (k(\obs{\m{X}}, \obs{\m{X}}) + \sigma^2\m{I})\inv \obs{\v{y}}. μ ^ = k ( x ^ , X ) ( k ( X , X ) + σ 2 I ) − 1 y . Kernelizing the variance begins in the same way with pulling out Σ p \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\m{\Sigma}_{p} Σ p .
σ ^ = σ 2 + x ^ ⊺ ( Σ p − 1 + σ − 2 X X ⊺ ) − 1 x ^ = σ 2 + x ^ ⊺ Σ p ( I + σ − 2 X X ⊺ Σ p ) − 1 x ^ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\begin{aligned}
\pred{\sigma} & = \sigma^2 + \pred{\v{x}}\T \left( \m{\Sigma}_{p}\inv + \sigma^{-2}\obs{\m{X}}\obs{\m{X}}\T \right) \inv\pred{\v{x}}\\
& = \sigma^2 + \pred{\v{x}}\T\m{\Sigma}_{p} \left( \m{I} + \sigma^{-2}\obs{\m{X}}\obs{\m{X}}\T\m{\Sigma}_{p} \right) \inv\pred{\v{x}}
\end{aligned} σ ^ = σ 2 + x ^ ⊺ ( Σ p − 1 + σ − 2 X X ⊺ ) − 1 x ^ = σ 2 + x ^ ⊺ Σ p ( I + σ − 2 X X ⊺ Σ p ) − 1 x ^ This looks quite similar to what we had before but we are missing an X \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\m{X} X on the
right side that prevents us from forming inner products. Instead we need to invoke the
Woodbury matrix identity
( A + U C V ) − 1 = A − 1 − A − 1 U ( C − 1 + V A − 1 U ) − 1 V A − 1 \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
(\m{A} + \m{U}\m{C}\m{V})\inv = \m{A}\inv - \m{A}\inv\m{U}(\m{C}\inv + \m{V}\m{A}\inv\m{U})\inv\m{V}\m{A}\inv ( A + U C V ) − 1 = A − 1 − A − 1 U ( C − 1 + V A − 1 U ) − 1 V A − 1 with A = I \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\m{A} = \m{I} A = I , U = X \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\m{U} = \obs{\m{X}} U = X , C = σ − 2 I \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\m{C} = \sigma^{-2}\m{I} C = σ − 2 I and
V = X ⊺ Σ p \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\m{V} = \obs{\m{X}}\T\m{\Sigma}_{p} V = X ⊺ Σ p .
σ ^ = σ 2 + x ^ ⊺ Σ p ( I − X ( σ 2 I + X ⊺ Σ p X ) − 1 X ⊺ Σ p ) x ^ = σ 2 + x ^ ⊺ Σ p x ^ − x ^ ⊺ Σ p X ( σ 2 I + X ⊺ Σ p X ) − 1 X ⊺ Σ p x ^ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\begin{aligned}
\pred{\sigma} & = \sigma^2 + \pred{\v{x}}\T\m{\Sigma}_{p} \left( \m{I} - \obs{\m{X}} \left( \sigma^2\m{I} + \obs{\m{X}}\T\m{\Sigma}_{p}\obs{\m{X}} \right)\inv \obs{\m{X}}\T\m{\Sigma}_{p} \right) \pred{\v{x}}\\
& = \sigma^2 + \pred{\v{x}}\T\m{\Sigma}_{p}\pred{\v{x}} - \pred{\v{x}}\T\m{\Sigma}_{p}\obs{\m{X}} \left( \sigma^2\m{I} + \obs{\m{X}}\T\m{\Sigma}_{p}\obs{\m{X}} \right)\inv \obs{\m{X}}\T\m{\Sigma}_{p}\pred{\v{x}}
\end{aligned} σ ^ = σ 2 + x ^ ⊺ Σ p ( I − X ( σ 2 I + X ⊺ Σ p X ) − 1 X ⊺ Σ p ) x ^ = σ 2 + x ^ ⊺ Σ p x ^ − x ^ ⊺ Σ p X ( σ 2 I + X ⊺ Σ p X ) − 1 X ⊺ Σ p x ^ At this point, the expression admits the same kernel formulation as the mean did before and we get
σ ^ = σ 2 + k ( x ^ , x ^ ) − k ( x ^ , X ) ( σ 2 I + k ( X , X ) ) − 1 k ( X , x ^ ) . \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{\sigma} = \sigma^2 + k(\pred{\v{x}}, \pred{\v{x}}) - k(\pred{\v{x}}, \obs{\m{X}}) \left( \sigma^2\m{I} + k(\obs{\m{X}}, \obs{\m{X}}) \right)\inv k(\obs{\m{X}}, \pred{\v{x}}). σ ^ = σ 2 + k ( x ^ , x ^ ) − k ( x ^ , X ) ( σ 2 I + k ( X , X ) ) − 1 k ( X , x ^ ) . Compare the equations for μ ^ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{\mu} μ ^ and σ ^ \newcommand{\nth}[2]{#1^{\left(#2\right)}}
\newcommand{\el}[2]{#1_{#2}}
\newcommand{\col}[2]{#1_{:,#2}}
\newcommand{\row}[2]{#1_{#2,:}}
\renewcommand{\v}[1]{{\bm{#1}}}
\newcommand{\m}[1]{{\bm{#1}}}
\newcommand{\s}[1]{{\mathcal{#1}}}
\renewcommand{\set}[1]{{\left\{#1\right\}}}
% Symbols
\def\d{{\mathrm{d}}}
\def\R{{\mathbb{R}}}
\def\Rplus{{\pos{\R}}}
\def\N{{\mathcal{N}}}
% Operators
\def\E{\operatorname*{\mathbb{E}}}
\def\T{^{\intercal}}
\def\inv{^{-1}}
% Functions
\def\sign{\operatorname{sign}}
\def\Tr{\operatorname{Tr}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\sp{\operatorname{sp}}
\def\p{\mathrm{p}}
\def\data{\mathcal{D}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\obs}[1]{#1}
\pred{\sigma} σ ^ with equations
(2.22)-(2.24) in GPML for GP prediction and you will find them equivalent except for
additional prediction noise on our part.
It is known that GPs are equivalent to deep neural networks in the infinite-width
limit. Speaking informally, this shows that "Bayesianization" and kernelizing can
generalize linear regression into something as powerful as deep learning. Goes to show
that I underestimated the increase in "power" that kernelizing a method brings.