A Derivation of the MSE bias-variance decomposition

For the sake of brevity, we will abbreviate \(f = f(\mathbf{x})\) and \(\hat{f} = \hat{f}(\mathbf{x})\) estimated on a training set \(\mathcal{T}\).

\[\begin{aligned} \label{eq:bias-var} \mathbb{E}_\mathcal{T}[(\mathrm{Y} - \hat{f})^2] & = \mathbb{E}_\mathcal{T} [\mathrm{Y}^2 + 2\mathrm{\mathrm{Y}}\hat{f} + \hat{f}^2] \\ \nonumber & = \mathbb{E}_\mathcal{T}(\mathrm{Y}^2) + \mathbb{E}_\mathcal{T}[\hat{f}^2] - 2\mathbb{E}_\mathcal{T}[\mathrm{Y}\hat{f}].\end{aligned}\]

Remembering the following properties of the variance and expectation: \[\begin{aligned} & Var(\mathrm{X}) = \mathbb{E}(\mathrm{X}^2) - \mathbb{E}^2(\mathrm{X}),& \\ & \mathbb{E}(\mathrm{X} \mathrm{Y}) = \mathbb{E}(\mathrm{X})\mathbb{E}(\mathrm{Y}) + Cov(\mathrm{X},\mathrm{Y}),& \\ & Var(\mathrm{X} + \mathrm{Y}) = Var(\mathrm{X}) + Var(\mathrm{Y}) + 2Cov(\mathrm{X},\mathrm{Y}),& \\ & Var(\mathrm{X} - \mathrm{Y}) = Var(\mathrm{X}) + Var(\mathrm{Y}) - 2Cov(\mathrm{X},\mathrm{Y}),& \\ & Cov(\mathrm{X},\mathrm{Y}) = 0 \text{ if $\mathrm{X}$ and $\mathrm{Y}$ are independent},& \end{aligned}\] and using them in we get:

\[\begin{equation} \begin{aligned} \mathbb{E}_\mathcal{T}[(\mathrm{Y} - \hat{f})^2] = Var(\mathrm{Y}) + \mathbb{E}^2_\mathcal{T}(\mathrm{Y}) + Var(\hat{f}) + \mathbb{E}^2_\mathcal{T}(\hat{f}) - 2\mathbb{E}_\mathcal{T}[(f+\epsilon) \hat{f}]. \end{aligned} \tag{A.1} \end{equation}\]

Developing the expression: \[\begin{aligned} 2\mathbb{E}_\mathcal{T}[(f+\epsilon) \hat{f}] & = 2\mathbb{E}_\mathcal{T}(f\hat{f}) + 2\mathbb{E}_\mathcal{T}(\hat{f}\epsilon) \\ & = 2\mathbb{E}_\mathcal{T}(f\hat{f}) + 2[\mathbb{E}_\mathcal{T}(\hat{f})\underbrace{\mathbb{E}_\mathcal{T}(\epsilon)}_{=0} + \underbrace{cov(\hat{f},\epsilon)}_{=0}] \\ & = 2[\mathbb{E}_\mathcal{T}(f)\mathbb{E}_\mathcal{T}(\hat{f}) + Cov(f,\hat{f})],\end{aligned}\] and remplacing \(Var(\mathrm{Y}) = \underbrace{Var(f)}_{=0} + Var(\epsilon) + \underbrace{2Cov(f,\epsilon)}_{=0} = \sigma^2\) in (A.1), we get:

\[\begin{equation} \begin{aligned} \mathbb{E}_\mathcal{T}[(\mathrm{Y} - \hat{f})^2] & = Var(\mathrm{Y}) + \mathbb{E}^2_\mathcal{T}(\mathrm{Y}) + Var(\hat{f}) + \mathbb{E}^2_\mathcal{T}(\hat{f}) - 2[\mathbb{E}_\mathcal{T}(f)\mathbb{E}_\mathcal{T}(\hat{f}) + Cov(f,\hat{f})] \\ \nonumber & = Var(\hat{f}) + \sigma^2 + \mathbb{E}^2_\mathcal{T}(\mathrm{Y}) + \mathbb{E}^2_\mathcal{T}(\hat{f}) - 2\mathbb{E}(f)\mathbb{E}_\mathcal{T}(\hat{f}) \\ & = Var(\hat{f}) + \sigma^2 + \mathbb{E}_\mathcal{T}^2(\mathrm{Y}) + \mathbb{E}^2(\hat{f}) - 2\mathbb{E}(f)\mathbb{E}_\mathcal{T}(\hat{f}) \end{aligned} \tag{A.2} \end{equation}\]

Knowing that \(\mathbb{E}_\mathcal{T}^2(\mathrm{Y}) = \mathbb{E}_\mathcal{T}^2(f + \epsilon) = \mathbb{E}^2(f)\) and replacing in (A.2), we finally obtain:

\[\begin{aligned} \label{eq:bias-var4} \mathbb{E}_\mathcal{T}[(\mathrm{Y} - \hat{f})^2] & = Var(\hat{f}) + \sigma^2 + \mathbb{E}^2(f) + \mathbb{E}_\mathcal{T}^2(\hat{f}) - 2\mathbb{E}(f)\mathbb{E}_\mathcal{T}(\hat{f}) \\ \nonumber & = \underbrace{Var(\hat{f})}_{Variance} + \underbrace{[\mathbb{E}(f) - \mathbb{E}_\mathcal{T}(\hat{f})]^2}_{Bias^2} + \underbrace{\sigma^2}_{noise}\end{aligned}\]