watch: AML | Clustering

Video 15 Clustering 2021-11-22

Outline:

  1. Clustering
  2. K-means
  3. K nearest neighbor

Cluster

  • Cluster is a set of data objects that
    • similar to one another within the same group
    • dissimilar to the objects in other groups
  • High quality clusters:
    • High intra-class similarity;
    • Low inter-class similarity

Cluster analysis

  • an unsupervised learning (unlabeled)
  1. 给定一组数据对象
  2. 找到数据对象之间的相似性
  3. 把相似的数据对象归到 clusters
  • 典型应用:

    • As a stand-alone tool to get insight into data distribution
    • As a preprocessing step for other algorithms (classifier, regressor)
  • 聚类方法的质量的相关因素:

    1. the similarity measure used by the method
    2. its implementation
    3. its ability to discover some or all of the hidden patterns
  • 聚类分析的考虑因素

    • Partitioning criteria
      Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
    • Separation of clusters
      Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class) 是否专属于1类
    • Similarity measure
      Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)
    • Clustering space
      Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
  • 聚类的挑战和要求

    • Quality
      • Ability to deal with different type of attributes (不同属性)
      • Discovery of clusters with arbitrary shape (任意形状)
      • Ability to deal with noisy data (噪声)
    • Interpretability and usability
    • Constraint based clustering
    • Scalability
      • Constraint based clustering
      • High dimensionality
      • Incremental clustering and insensitivity to input order

Similarity measure

  • 对于两个样本点的第i个维度: $x_i$和$y_i$,两者的相似性可以用一个距离函数表达:$\rm d(x_i, y_i)$
  • Similarity measure are usually different based on type of data: interval-scaled, boolean, categorical, ordinal ratio, and vector variables.
  • 距离种类:
    1. Euclidean: $\sqrt{\sum_{i=1}^{k}\left(x_{i}-y_{i}\right)^{2}}$ 两点所有属性间的距离
    2. Manhattan: $\sum_{i=1}^{k}\left|x_{i}-y_{i}\right|$
    3. Minkowski: $\left(\sum_{i=1}^{k}\left(\left|x_{i}-y_i\right|\right)^{2}\right)^{1 / q}$

Major approaches

  • Partitioning approaches (分区):
    They create various partitions and then evaluate them by some criterion
    e.g., minimizing the sum of square errors
    typical methods: k-means, k-medoids, CLARANS

  • Hierarchical approaches (分层):
    They create a hierarchical decomposition of the set of data (or objects) using some criterion
    typical methods: Diana, Agnes, BIRCH, CHAMELEON

  • Density-based approaches:
    They are based on connectivity and density functions
    typical methods: DBSACN, OPTICS, DenClue

  • Grid-based approaches:
    They are based on a multiple-level granularity structure
    typical methods: STING, WaveCluster, CLIQUE

  • Model-based approaches:
    A model is hypothesized for each of the clusters and then aim to find the best fit of that model to each other
    typical methods: EM, SOM, COBWEB

  • Frequent pattern-based:
    They are based on the analysis of frequent patterns
    typical methods: p-Cluster

    etc…

Partitioning method

K-means 可视化:Visualizing K-Means Clustering

K nearest neighbors

  • supervised, efficient, clustering, classification and regression learning algorithm