Table of contents
Video 15 Clustering 2021-11-22
Outline:
- Clustering
- K-means
- K nearest neighbor
Cluster
- Cluster is a set of data objects that
- similar to one another within the same group
- dissimilar to the objects in other groups
- High quality clusters:
- High intra-class similarity;
- Low inter-class similarity
Cluster analysis
- an unsupervised learning (unlabeled)
- 给定一组数据对象
- 找到数据对象之间的相似性
- 把相似的数据对象归到 clusters
-
典型应用:
- As a stand-alone tool to get insight into data distribution
- As a preprocessing step for other algorithms (classifier, regressor)
-
聚类方法的质量的相关因素:
- the similarity measure used by the method
- its implementation
- its ability to discover some or all of the hidden patterns
-
聚类分析的考虑因素
- Partitioning criteria Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
- Separation of clusters Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class) 是否专属于1类
- Similarity measure Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)
- Clustering space Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
-
聚类的挑战和要求
- Quality
- Ability to deal with different type of attributes (不同属性)
- Discovery of clusters with arbitrary shape (任意形状)
- Ability to deal with noisy data (噪声)
- Interpretability and usability
- Constraint based clustering
- Scalability
- Constraint based clustering
- High dimensionality
- Incremental clustering and insensitivity to input order
- Quality
Similarity measure
- 对于两个样本点的第i个维度: $x_i$和$y_i$,两者的相似性可以用一个距离函数表达:$\rm d(x_i, y_i)$
- Similarity measure are usually different based on type of data: interval-scaled, boolean, categorical, ordinal ratio, and vector variables.
- 距离种类:
- Euclidean: $\sqrt{\sum_{i=1}^{k}\left(x_{i}-y_{i}\right)^{2}}$ 两点所有属性间的距离
- Manhattan: $\sum_{i=1}^{k}\left|x_{i}-y_{i}\right|$
- Minkowski: $\left(\sum_{i=1}^{k}\left(\left|x_{i}-y_i\right|\right)^{2}\right)^{1 / q}$
Major approaches
-
Partitioning approaches (分区):
They create various partitions and then evaluate them by some criterion
e.g., minimizing the sum of square errors
typical methods: k-means, k-medoids, CLARANS -
Hierarchical approaches (分层):
They create a hierarchical decomposition of the set of data (or objects) using some criterion
typical methods: Diana, Agnes, BIRCH, CHAMELEON -
Density-based approaches:
They are based on connectivity and density functions
typical methods: DBSACN, OPTICS, DenClue -
Grid-based approaches:
They are based on a multiple-level granularity structure
typical methods: STING, WaveCluster, CLIQUE -
Model-based approaches:
A model is hypothesized for each of the clusters and then aim to find the best fit of that model to each other
typical methods: EM, SOM, COBWEB -
Frequent pattern-based:
They are based on the analysis of frequent patterns
typical methods: p-Clusteretc…
Partitioning method
K-means 可视化:Visualizing K-Means Clustering
K nearest neighbors
- supervised, efficient, clustering, classification and regression learning algorithm