watch: AML 13 | Validation

Video 17 Validation 2021-12-08

Outline:

The validation set
Model selection
Cross validation

Review of Lec12 (期末不涉及12)

Regularization: add a overfit or complexity penalty term，与模型复杂度有关，使用这个"惩罚项"估计out-of-sample error

两种正则化方法：

constrained regularization: select some type of hypotheses
unconstrained regularization: 不是最小化$E_{out}$，而是最小化 $E_{\rm augment}(\mathbf w) = E_{in}(\mathbf w) + \underbrace{\frac{\lambda}{N} \mathbf w^T \bf w}_{\text{penalty term}}$

选择一个 regularizer 去估计penalty项: $E_{\rm augment}(\mathbf w) = E_{in}(\mathbf w) + \frac{\lambda}{N} \Omega(h)$

其中 $\Omega(h)$ 是regularizer，$\lambda$ 是正则化参数(regularization parameter)

$\Omega(h)$: 启发式地选择 heuristic，通常使用weight decay，找到一个 smooth, simple $h$

$\lambda$ 决定了正则化被引入的程度。如果选了正确的$\lambda$，可以很好的估计未知目标函数。Validation 也要找到一个合适的$\lambda$

Validation vs Regularization

在learning 过程中，$E_{out}(h)$ 未知（因为目标函数未知），但是它等于 $E_{in}(h)+$ overfit penalty，Ein 是已知的（预测值与训练样本真实值的误差），还需要知道overfit penalty。所以为了计算 Eout 有两种方法：Regularization 是先估计出 overfit penalty。而Validation 是直接估计 Eout。

$$ \begin{aligned} \rm Regularization: E_{out}(h) = E_{in}(h) + \underbrace{\text{overfit penalty}}_{\mathclap{\text{regularization estimates this quantity}}} \\ \\ \rm Validation: \underbrace{E_{out}(h)}_{\mathclap{\text{validation estimates this quantity}}} = E_{in}(h) + \text{overfit penalty} \end{aligned} $$

Analyzing the estimate

Out-of-sample point 是没有在训练阶段中使用的点，

在一个out-of-sample 点 $(\mathbf X,y)$ 上的误差是 $\mathbf e(h(\mathbf x),y)$。根据要解决问题的不同，误差函数有不同的形式：
$$ \begin{aligned} \text{回归, Squared error:} & (h(\mathbf x)-y)^2 \\ \text{分类, Binary error:} & [\![ h(\mathbf x)\neq y]\!] \end{aligned} $$
- $h$ 在 out-of-sample分布上的误差的期望是$E_{out}(h)$： $\mathbb E[\mathbf e(h(\mathbf x),y)] = E_{out}(h)$
- $h$ 在 out-of-sample分布上的误差的方差是$\sigma^2$： $\operatorname{var}[\mathbf e(h(\mathbf x),y)] = \sigma^2$
从1个点到1组点： 从training set 中独立地选出K个点组成一个验证集(validation set) $(\mathbf x_1,y_1), \cdots, (\mathbf x_K, y_K)$，验证集上的误差是 $E_{\text{val}}(h) = \frac{1}{K} \sum_{k=1}^K \mathbf e(h(\mathbf x_k), y_k)$
- 不同验证集误差的期望：$\mathbb E[E_{\text{val}}(h)] = \frac{1}{K} \sum_{k=1}^K \mathbb E[\mathbf e(h(\mathbf x_k), y_k)] = E_{out}(h)$ (期望放里面，就是$E_{out}$)
- 不同验证集误差的方差：$\operatorname{var} [E_{\text{val}}(h)] = \frac{1}{K^2} \sum_{k=1}^K \operatorname{var}[\mathbf e(h(\mathbf x_k), y_k)] = \frac{\sigma^2}{K}$ (因为各点互相独立，所以协方差矩阵除了对角线其他位置都是零)
验证集的误差等于Eout 加一个 $\frac{1}{\sqrt{K}}$ 阶（标准差）的偏置项：
$$ E_{\text{val}}(h) = E_{\text{out}}(h) \pm O(\frac{1}{\sqrt{K}}) $$
如果增加验证集样本数量 K，偏置项变小，验证集误差就越接近Eout。
对于数据集 $\mathcal D = (\mathbf x_1, y_1), \cdots, (\mathbf x_N, y_N)$
- 选K个点作为验证集：$\mathcal D_{\rm val}$
- 剩下 N-K 个点是训练集：$\mathcal D_{\rm train}$
- 对于偏置项：$O(\frac{1}{\sqrt{K}})$，小K让Eval 与 Eout 差的远，而大K让Ein 与 Eout 差得远。所以K需要tradeoff
  
  以前通常用全部的数据集来训练，得到g，现在只用了一部分数据 (reduced dataset) 来训练，得到$g^-$，所以它的 Ein和Eout 都比g大。然后计算 $g^-$ 在验证集上的误差 $E_{val}(g^-)$，作为Eout 的近似，如果K很大，近似效果会差。经验法则：$K= \frac{N}{5}$

Validation set 不是 test set

$E_{val}(g^-)$ 也不是 $E_{out}$。测试集与训练无关 (unbiased)，而验证集会在训练阶段帮助我们选择超参数，从而影响了学习过程 (optimistic bias)。

比如，有两个假设 $h_1$ 和 $h_2$，其实它们真正的Eout都是0.5：$E_{out}(h_1) = E_{out}(h_2) = 0.5$ ，但是未知。它们分别在验证集上的误差为 $\mathbf e_1,\ \mathbf e_2$，然后我们会选择留下误差小的那个：$\mathbf e = min(\mathbf{e_1,e_2})$, 它的 Eout $\mathbb E(\mathbf e)$ 要小于真实值0.5，因为它用的训练数据少于全部数据集，所以validataion 给出的误差是偏向“乐观的”

Model selection

比如要解决一个分类问题，有M个假设空间：$\mathcal H_1,\cdots, \mathcal H_M$ （比如svm的核可以为linear, polynomial, rbf，选哪种好呢？）。

根据 (有缩减的reduced) 训练集，从每个假设空间选出“最佳假设”(finalists model 决赛选手)。然后分别在验证集上计算Eval。根据这 M 个Eval，选出最佳 $E_{val}$ 和最佳假设空间 $\mathcal H_{m^*}$。然后再使用整个数据集在最佳假设空间中找出最佳假设 $g_{m^*}$

使用$\mathcal D_{\rm val}$ 和 $E_{\rm val}(g_{m^*}^-)$ 选择的最佳假设空间 $\mathcal H_{m^*}$ 是 $E_{out}(g_{m^*}^-)$ 的一个 biased estimate，因为没有使用全部的数据集，所以叫biased。

不同容量的验证集与预期偏差的关系如下图：

验证集中数据 K 越多，用于训练的样本越少，Eout越差，但是同时 $O(\frac{1}{\sqrt{K}})$ 减小，$E_{\rm val}$ 会越接近 $E_{\rm out}$。

How much bias

对于 M 个假设空间：$\mathcal H_1, \cdots ,\mathcal H_M$，从中选出了 M 个 finalists model $H_{\rm val} = \{ g_1^-, g_2^-,\cdots, g_M^- \}$，然后用验证集 $\mathcal D_{\rm val}$ 去“训练”它们，也就是再找出它们中的最佳 minus 假设 $g_{m^\star}^-$（$E_{\rm val}$最小）。

对于一个"训练"过程，对于假设 $g_{m^\star}^-$ 有Hoeffding不等式成立：

$$ E_{out} (g_{m^\star}^-) \leq E_{val}(g_{m^\star}^-) + O \left( \sqrt{\frac{ln M}{K}} \right) $$

如果有无穷多个假设集（无穷多个正则化参数，$\lambda$ 是连续值），所以 $O \left( \sqrt{\frac{ln M}{K}} \right)$ 就变得不再有效

为了约束 M，就像之前那样，引入 VC 维。比如，我们不关心正则化参数 $\lambda$ 能取多少值，而是关心我们有几个参数（自由度），我们只有1个参数 $\lambda$，所以VC维是1。

Data contamination

在训练阶段用了多少数据样本
$E_{in}，E_{out}(E_{test})，E_{\rm val}$
Contamination: Optimistic (deceptive) bias in estimating Eout
- Training set: totally contaminated
- Validation set: slightly contaminated (起到了“测试”的效果，但也被用于训练了)
- Test set: totally ‘clean’ (完全用于测试)

Cross validation

把train set 分成n折，每次取n-1折做训练，计算在剩下那折上的准确率，n个准确率求平均就是该组超参数的表现。
不使用test set，却可以估计在test set上的表现。
目的是选最佳的超参数；不能根据在train set上的准确率判断好坏。
选用不同超参数时，CV准确率的变化趋势与在test set上的变化趋势近似一致。
K 进退两难: $g^-$是用 reduced训练集找出的最佳，K越小，用于训练的数据越多，越接近真实的Eout，而根据Hoeffding不等式，$E_{\rm val}(g^-)$需要很大的K，才能近似$E_{out}(g^-)$
$$ E_{\rm out}(g) \underset{\mathclap{\substack{\\ \text{小K才近似}}}}{\approx} E_{\rm out}(g^-) \underset{\mathclap{\substack{\\ \text{大K才近似}}}}{\approx} E_{\rm val}(g^-) $$
$E_{out}$ 是最终目标，但是只知道验证误差 $E_{\rm val}(g^-)$
have K both small and large
两种交叉验证方法：
1. Leave One Out
  
  K=1，每次迭代选1个样本做验证，剩下N-1个样本做训练。去除第n个样本的训练集$\mathcal D_n:$
  $$ \mathcal D_n = (\mathbf x_1,y_1),\cdots,(\mathbf x_{n-1},y_{n-1}),\sout{(\mathbf x_n, y_n)},(\mathbf x_{n+1},y_{n+1}),\cdots,(\mathbf x_N, y_N) $$
  从 $\mathcal D_n$ 中学到的假设是 $g_n^-$，验证误差 $\mathbf e_n = E_{\rm val}(g_n^-) = \mathbf e(g_n^- (\mathbf x_n),y_n)$
  
  对每个留出的样本点，计算验证误差，然后取平均，就是交叉验证误差 (cross validation error):
  $$ E_{CV} = \frac{1}{N} \sum_{n=1}^N \mathbf e_n $$
  对于3个点，每次取出一个做验证集，剩下两个做训练集，线性回归问题，对于两个样本，误差最小的Linear假设，就是过两点的一条直线。
  
  对于 Constant 假设：
  
  对比 $E_{CV}$，constant 模型的交叉验证误差较小，所以最终选择constant模型
  
  N个样本的数据集要迭代 N 次，每次在 N-1 个样本上训练，如果有1千个样本就要迭代1千次，计算复杂度太高。
2. Leave More Out
  
  把数据集划分成多份，划分成10份的话：$K = \frac{N}{10}$，只需迭代10 ($\frac{N}{K}$)次，每次在N-K个点上训练。

Cross validation in action

数字分类任务，把2个特征（symmetry和Average intensity）非线性变换到20维空间，最高幂次为5的多项式

$$ \left(1, x_{1}, x_{2}\right) \rightarrow\left(1, x_{1}, x_{2}, x_{1}^{2}, x_{1} x_{2}, x_{2}^{2}, x_{1}^{3}, x_{1}^{2} x_{2}, \ldots, x_{1}^{5}, x_{1}^{4} x_{2}, x_{1}^{3} x_{2}^{2}, x_{1}^{2} x_{2}^{3}, x_{1} x_{2}^{4}, x_2^{5}\right) $$

使用特征数量越多，模型越复杂，$E_{in}$ 越小（迭代了很多次），$E_{out}$先减小后增大，出现Overfitting，而$E_{CV}$的趋势与$E_{out}$相同，因为$E_{out}$未知，$E_{CV}$是 $E_{out}$ 的近似，所以可以根据 $E_{CV}$ 来决定该选用几个特征。Ecv 的最小值出现在5 和7，所以可以选用6个特征的模型。

没用validation时，直接使用20个特征的模型很复杂，而且过拟合（噪音），Ein为零；使用validation后，决定只用6个特征，模型相对简单，Eout较小。

例题

Given three two-dimensional data examples $x_1 = (-1,1)，x_2=(0,2)$, and $x_3=(1,1)$, perform the leave-one-out cross validation for a linear fit using these data examples. What is $E_{CV}$?

$$ E_{CV} = \frac{1}{N} \sum_{n=1}^N \varepsilon_n $$

where $\varepsilon_n = (y_n - g(x_n))^2$

Note: The line passing through two-dimensional data points $(x_1, y_1)$ and $(x_2,y_2)$ can be obtained as follows: $y-y_1 = \frac{y_2 - y_1}{x_2-x_1} \times (x-x_1)$

GA answer:

Keep $x_1$ as for the validation, while $x_2, x_3$ as for training:

$g:\ y-2 = \frac{1-2}{1-0}(x-0) \Rightarrow y=-x+2$

$\varepsilon_1 = (1-g(-1))^2 = (1-3)^2 = 4$
Keep $x_2$ as for the validation:

$g:\ y-1 = \frac{1-1}{1+1}(x+1) \Rightarrow y=1$

$\varepsilon_2 = (2-g(0))^2 = (2-1)^2 = 1$
Keep $x_3$ as for the validation:

$g:\ y-1 = \frac{2-1}{0+1}(x+1) \Rightarrow y=x+2$

$\varepsilon_3 = (1-g(1))^2 = (1-3)^2 = 4$

$E_{CV} = \frac{1}{3}(4+1+4) = 3$

Table of contents