watch: AML | Clustering

Video 15 Clustering 2021-11-22

Outline:

Cluster is a set of data objects that
- similar to one another within the same group
- dissimilar to the objects in other groups
High quality clusters:
- High intra-class similarity;
- Low inter-class similarity

典型应用:
- As a stand-alone tool to get insight into data distribution
- As a preprocessing step for other algorithms (classifier, regressor)
聚类方法的质量的相关因素：
1. the similarity measure used by the method
2. its implementation
3. its ability to discover some or all of the hidden patterns
聚类分析的考虑因素
- Partitioning criteria
  Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
- Separation of clusters
  Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class) 是否专属于1类
- Similarity measure
  Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)
- Clustering space
  Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
聚类的挑战和要求
- Quality
  - Ability to deal with different type of attributes (不同属性)
  - Discovery of clusters with arbitrary shape (任意形状)
  - Ability to deal with noisy data (噪声)
- Interpretability and usability
- Constraint based clustering
- Scalability
  - Constraint based clustering
  - High dimensionality
  - Incremental clustering and insensitivity to input order

对于两个样本点的第i个维度: $x_i$和$y_i$，两者的相似性可以用一个距离函数表达：$\rm d(x_i, y_i)$
Similarity measure are usually different based on type of data: interval-scaled, boolean, categorical, ordinal ratio, and vector variables.
距离种类：
1. Euclidean: $\sqrt{\sum_{i=1}^{k}\left(x_{i}-y_{i}\right)^{2}}$ 两点所有属性间的距离
2. Manhattan: $\sum_{i=1}^{k}\left|x_{i}-y_{i}\right|$
3. Minkowski: $\left(\sum_{i=1}^{k}\left(\left|x_{i}-y_i\right|\right)^{2}\right)^{1 / q}$

Partitioning approaches (分区):
They create various partitions and then evaluate them by some criterion
e.g., minimizing the sum of square errors
typical methods: k-means, k-medoids, CLARANS
Hierarchical approaches (分层):
They create a hierarchical decomposition of the set of data (or objects) using some criterion
typical methods: Diana, Agnes, BIRCH, CHAMELEON
Density-based approaches:
They are based on connectivity and density functions
typical methods: DBSACN, OPTICS, DenClue
Grid-based approaches:
They are based on a multiple-level granularity structure
typical methods: STING, WaveCluster, CLIQUE
Model-based approaches:
A model is hypothesized for each of the clusters and then aim to find the best fit of that model to each other
typical methods: EM, SOM, COBWEB
Frequent pattern-based:
They are based on the analysis of frequent patterns
typical methods: p-Cluster

etc…

K nearest neighbors

supervised, efficient, clustering, classification and regression learning algorithm