watch: AML | Classification

Video 16 Classification 2021-12-06

supervised learning practical diagram

K-means 对三螺线数据训练，分类结果不好，因为它是用于聚类的

Fuzzy C Means 是另一种改进的聚类算法，分类效果也不好

Decision Tree 分类器对三螺线的训练样本分类效果很好

K nearest neighbor 是非常强大的 lazy learner

Multilayer neural network 强大

分类器预测类别标签有多准确？选择哪个分类器（模型）更合适？
Remember:
1. The data have to be used both for training and testing.
2. More training data → better generalization.
3. More test data → better estimation for the classification error probability.
4. Do not evaluate performance on training data → the conclusion would be optimistically biased. (否则会偏向训练集)
Methods for estimating a classifier’s accuracy:
1. Holdout method (reserve 2/3 for training and 1/3 for testing)
2. random subsampling (iterative holdout)
3. Cross-validation (partition the data into k folders
4. Stratified oversampling and undersampling (保持类别比例)
5. Bootstrap (sampling with replacement)
Comparing classifiers:
1. Confidence intervals
2. Cost-benefit analysis and ROC Curves
Once evaluation is finished, all the available data can be used to train the final classifier. (知道了最佳参数后，再用全部的数据训练最佳假设)

Given data is randomly partitioned into two independent sets
比如：2/3 作为Training set 去构建模型，1/3作为Test set 去估计准确率
Random sampling:
It is a variation of holdout method.
Repeat the method k times, accuracy is estimated as average of obtained accuracies.

Evaluation measures

Measure	Formula
Accuracy, recognition rate	(TP+TN)/all
Error rate, misclassification rate	(FP+FN)/all
Sensitivity, true positive rate, recall	TP/(TP+FN)
Specificity, true negative rate	TN/(TN+FP)
Precision	TP/(TP+FP)
F, F1,F-score, Harmonic mean of precision and recall	$\frac{2 \times \rm Precision \times recall}{\rm Precision + recall}$
$F_\beta$ where $\beta$ is a none negative real number	$\frac{(1+\beta)^2 \times \rm Precision \times recall}{\beta^2 \times \rm Precision +recall}$

Accuracy/recognition rate: the proportion of the total number of predictions that were correct.
Error rate: 1- accuracy
Precision: what % of tuples that the classifier labeled as positive are actually positive (查准率)
Recall: what % of positive tuples did the classifier label as positive? (查全率)

Represent a relation between sensitivity and specificity for a given classifier.
- The area under the curve is the measure of the accuracy of the classifier.
- The perfect accuracy is equal to one.
- The closer to red line, the less accurate model
如果模型的准确率显著低于红线，是不能接受的(有错)。ROC 曲线上升越快，越接近1，越好 (Learner 1最好)
It can be used for visual comparison of classification models.
ROC space:
Two dimensional:
- FP rate on X axis → FPR=FP/(TN+FP)
- TP rate on Y axis → TPR=TP/(TP+FN) （灵敏度）
- FPR=1-SPC （= 1- 特异度）

Model selection criteria is always based on a compromise between the complexity of the model and its prediction accuracy on the training data

Given a dataset, basically we are looking for the simplest model that attains highest accuracy.

合奏系统

每次抽取不同的样本(子集)，训练多个模型，然后聚合（aggregation）起来，误差可能更小

也可以训练不同种类的分类器：感知机，DT,kNN,SVM…

Ensemble 系统有两Key Component: 分类算法（注意训练集样本的多样性）和融合方法（简单：多数票）

Ensemble 适合用于很大容量数据，也可以用于很小容量数据。

Large volume data:
Small size data:

数据很少，用复杂的模型可能导致过拟合，所以第一次使用比较弱的感知机，有3个点分错了，增加它们的权重，使它们更可能被抽取到作为下一次的训练样本。第二次分类后，再强调分错的2个蓝点。第三次分类，就只有1个红点分错了。Ensemble 使得模型不复杂，更准确

其他优势：处理复杂的决策边界，非线性情况，实时…

Table of contents