watch: AML 04 | Error and Noise

Video 9 - Error and noise 10-18-2021

Outline

Review of Lec 3

Linear Models

Using “signal” to classify and regress
signal:
$$ \sum_{i=0}^d w_i x_i = \mathbf{w^T x} $$
Linear Classification: $h(\mathbf x) = \rm sign(\mathbf{w^T x})$ (把信号传入threshold, PLA, Pocket)
Linear Regression: $h(\mathbf x) = \mathbf{w^T x}$ (不把信号传入threshold, one-shot learning)

$\mathbf w = \mathbf{(x^T x)^{-1} x^T} y$

Quantify the dissimilarity between the output of hypothesis $h$ and the output of the unknown target function $f$.
Almost all error measures are pointwise

Compute $h$ and $f$ on individual points $\mathbf x$ using a pointwise error $e(h(\mathbf x), f(\mathbf x))$:

Binary error: $e(h(\mathbf x), f(\mathbf x))= [\![ h(\mathbf x) \neq f(\mathbf x) ]\!]$ （不相等error=1, 相等error=0） (Classification)

Squared error: $e(h(\mathbf x), f(\mathbf x)) = (h(\mathbf x) - f(\mathbf x))^2$ (真实距离) (Regression)
In-sample error: $h(x)$ 与 $f(x)$ 在各样本点上的差异
$$ E_{in}(h) = \frac{1}{N} \sum_{n=1}^N e(h(\mathbf x_n), f(\mathbf x_n)) $$
Out-of-sample error: $h(x)$ 与 $f(x)$ 在空间所有点上的偏差的期望
$$ E_{out}(h) = \mathbb E_x [e(h(\mathbf x), f(\mathbf x))] $$
How to choose the error measure

False accept and False reject

confusion matrix (混淆矩阵):
$$ \begin{array}{c|lcr} & \qquad f (\text{unknown}) & \\ h& +1 & -1 \\ \hline +1 & \text{no error} & \text{false accept} \\ -1 & \text{false reject} & \text{no error} \\ \end{array} $$
The error measure is pretty much related to the kind of application with different penalty.

确定的目标分布 $f(\mathbf x) = \mathbb E(y|\mathbf x)$ + 噪声 $y-f(\mathbf x)$
有时相同的输入对应不同的标签，所以潜在关系不是一个"函数" $y=f(\mathbf x)$，而是一个分布 $P(y|\mathbf x)$

$\mathbf x$ 按照某种未知的分布 $P(\mathbf x)$ 从空间$\mathcal X$ 中抽取出来。标签 $y$ 服从分布 $P(y|\mathbf x)$。所以输入 $(\mathbf x,y)$ 是由联合分布 $P(\mathbf x) P(y|\mathbf x) = P(\mathbf x,y)$ 产生。

Determistic target 是当 P(y|x)=0 的特殊的noisy target, 那时噪声=0，也就是 $y=f(\mathbf x)$

Preamble to the theory

Learning is feasible in a probabilitstic sence: $E_{out}(g) \approx E_{in}(g)$
We need $g\approx f$, which means $E_{out}(g) \approx 0$
1. $E_{out}(g) \approx E_{in}(g)$ (Hoeffding Inequality)
2. $E_{in}(g) \approx 0$ (PLA, Pocket, Linear classification/regression)

Table of contents