Clustering

Clustering

  • process of grouping a set of physical or abstract objects into classes of similar objects

  • unsupervised classification

  • Quality factors

    • similarity measure

    • representation of cluster chosen

    • clustering algorithm

  • Similarity

    • Distance Measures

      • Peter and Piotr : 3

      • Two fingerprint : 342.7

  • Typical Applications of Clustering Analysis

    • Pattern Recognition

    • market segmentation

    • Biology

    • Geography

    • Insurance

  • Requirements of Clustering

    • Scalability

    • Multiple classes

    • Deal with noisy data

    • High dimensionality

    • Interpretability and usability

  • Issues

    • Outlier handling

      • Outlier 可以徹底改變 cluster 圈的範圍

    • Dynamic Data

    • Interpreting results (centroid meaning)

    • Number of clusters (k)

Type of Clusters

Well-Seperated

  • A cluster is a set of points

  • point in a cluster is closer (or more similar) to every other point

Center-based

  • A cluster is a set of objects

  • an object in a cluster is closer (more similar) to the “center” of a cluster

  • calculate center first, and compare distance from centroid to each group

Contiguity-Based

  • Nearest neighbor or Transitive

  • A cluster is a set of points

  • a point in a cluster is closer (or more similar) to one or more other points

Conceptual

  • Finds clusters that share some common property

Density-based

  • A cluster is a dense region of points

  • Separated by low-density regions

  • Used when the clusters are irregular or intertwined

    • and when noise and outliers are present

Clustering Algorithms

  • Partitioning algorithms

    • Construct various partitions and then evaluate them by some criterion.

  • Hierarchy algorithms

    • Create a hierarchical decomposition of the set of data (or objects) using some criterion.

  • Density-based

    • based on connectivity and density functions

  • Grid-based

    • based on a multiple-level granularity structure

  • Model-based

    • A model is hypothesized for each of the clusters

    • the idea is to find the best fit of that model to each other.

Partition-based Clustering

  • Construct a partition of a database D of n objects into a set of k clusters

    • Given a k, find a partition of k clusters

    • optimizes the chosen partitioning criterion

      • Global optimal : Iterate all partitions (NP-Hard)

      • Heuristic methods

        • k-means

          • each cluster is represented by the center of the cluster

        • k-medoids (Partition Around Medoids, PAM)

          • each cluster is represented by one of the objects in the cluster.

          • Used when full of outliers

K-means

  • partition objects into k nonempty subsets

  • compute the centroids (mean) of each partitions

  • relocate all objects to the nearest clusters

  • Repeat compute centroids and relocate until no more relocation

Select K points as the initial centroids

repeat
    Form K clusters by assigning all points to the closest centroid
    Recompute the centroid of each cluster

until The centroids don't change

Details

  • Initial centroids are often chosen randomly

  • centroid is typically the mean of the points in the cluster

  • closeness or similarity is measured by :

    • Euclidean distance

    • Cosine similarity

    • Correlation

    • etc.

  • K-means will converge eventually

    • Most of the convergence happens in the first few iterations

    • Very fast

    • stopping condition is often changed to 特定的幾個點變動為止

Comments

  • Pros

    • Efficient, O(nKId)O(nKId)

      • n = # points

      • K = # clusters

      • I = # iterations

      • d = # attributes

    • terminates at a local optimum (useful local optimum)

  • Cons

    • Applicable only when mean is defined

    • Need to specify k

    • Unable to handle noisy data and outliers

  • Variations

    • k-modes : Handling categorical data

      • Replacing means of clusters with modes (distance = 0/1)

Problem with selecting initial points

  • Given Large K clusters, and clusters are the same size n

  • The probability of one specific initial points

    P=K!KKIf K is 10=10!1010=0.00036\begin{aligned} P &= \frac{K!}{K^K} \\ \text{If K is } 10 &= \frac{10!}{10^{10}} = 0.00036 \end{aligned}

Problem with clusters different size

Problem with clusters densities

Problem with clusters non-globular

Solution

  • One solution is to use many clusters

    • Find parts of clusters, but need to put together

    • Use many k and merge

K-medoid

  • The k-means algorithm is sensitive to outliers

  • Instead of taking the mean value, we find the actual point closest to mean

  • Then we can get out of outliers (maybe)

Methods

  • PAM (Partitioning Around Medoids, 1987)

  • CLARA (Kaufmann & Rousseeuw, 1990)

  • CLARANS (Ng & Han, 1994): Randomized sampling

  • Focusing + spatial data structure (Ester et al., 1995)

PAM

  • Robust than k-means in the presence of noise and outliers

    • A medoid is less influenced by outliers

  • Works efficiently for small data sets

  • But does not scale well for large data sets

    • => Sampling based method (CLARA)

CLARA (Clustering Large Applications)

  • Partition for large database

  • Steps

    • draws multiple samples of the data set

    • applies PAM on each sample

    • gives the best clustering as the output

  • Pros

    • deal with larger data sets than PAM

  • Cons

    • efficiency depends on the sample size

    • good clustering on samples means good on whole dataset ?

Hierarchical Clustering

  • Produces a set of nested clusters organized as a hierarchical tree

  • Can be visualized as a dendrogram (樹枝狀圖)

    • Can record the sequences of merges or splits

  • No need to assume number k of clusters

  • Hierarchy may correspond to meaningful taxonomies

  • Two main types of hierarchical clustering

    • Agglomerative

      • Start with the points as individual clusters

      • At each step, merge the closest pair of clusters until only k cluster left

    • Divisive

      • Start with one, all-inclusive cluster

      • At each step, split a cluster until each cluster contains a point

Agglomerative Clustering Algorithm

  • More popular

  • Key operation is the computation of the proximity of two clusters

Compute the proximity matrix
Let each data point be a cluster

Repeat
    Merge the two closest clusters
    Update the proximity matrix

Until only a single cluster remains
  • How to define similarity to merge ?

    • MIN

      • Can handle non-elliptical shapes

      • But sensitive to noise and outliers

    • MAX

      • Less susceptible to noise and outliers

      • But tends to break large clusters and biased to globular

    • Group Average

    • Distance Between Centroids

    • Ward’s Method

      • Based on the increase in squared error

Requirements and Limitations

  • Requirements

    • space : O(N2)O(N^2)

    • time : O(N3)O(N^3)

  • Limitations

    • merges cannot be undone

    • no objective function

    • different schemes have different problems

      • sensitivity to noises

      • difficulty to cluster shapes

      • breaking large clusters

Divisive Clustering Algorithm

  • Build MST (Minimum Spanning Tree)

    • build minimum spanning tree

    • cut the longest edges

Compute the MST for the proximity graph

Repeat
    Create a new cluster by breaking the link corresponding to the largest disance

Until Only singleton clusters remain

Integration of hierarchical clustering & distance-based

  • Clustering algorithms have been designed to handle very large datasets

    • BIRCH (1996)

    • CURE (1998)

BIRCH

  • Main idea

    • use an in-memory R-tree to store points that are being clustered

    • Insert points one at a time into the R-tree

      • merging a new point with an existing cluster if is less than some distance

    • If there are more leaf nodes than fit in memory, merge existing clusters that are close to each other

    • At the end of first pass we get a large number of clusters at the leaves of the R-tree

      • Merge clusters to reduce the number of clusters

  • Balanced Iterative Reducing and Clustering using Hierarchies

  • Incrementally construct a CF (Clustering Feature) tree

    • a hierarchical data structure for multiphase clustering

1. scan DB to build an initial in-memory CF tree

2. use any clustering algorithm to cluster the leaf nodes of the CF-tree

CURE

  • Clustering Using Representatives

  • Stops the creation of a cluster hierarchy if a level consists of k clusters

  • Use many points to represent a cluster instead of only one

    • Call multiple representative points

  • Points will be well scattered

1.Create Clusters with Representative Points

2.Merge Clusters with Closest Points

3.Shrink Representative Points

DBSCAN

  • Density-based Algorithm

    • Density = number of points within a specified radius (Eps)

    • core point

      • it has more than a specified number of points (MinPts) within Eps

      • These are points that are at the interior of a cluster

    • border point

      • has fewer than MinPts within Eps

      • but is in the neighborhood of a core point

    • noise point

      • any point that is not a core point or a border point

  • DBSCAN can eliminate noise points

  • Works well

    • Resistant to Noise

    • Can handle clusters of different shapes and sizes

  • Works fail

    • Varying densities

    • High-dimensional data

  • Determining Eps and MinPts

    • Consider k nearest neighbors are at roughly the same distance

Last updated