> For the complete documentation index, see [llms.txt](https://sejkai.gitbook.io/academic/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://sejkai.gitbook.io/academic/ncku-data-mining/clustering.md).

# Clustering

## Clustering

* process of grouping a set of physical or abstract objects into classes of **similar objects**
* **unsupervised classification**
* Quality factors&#x20;
  * similarity measure
  * representation of cluster chosen
  * clustering algorithm
* Similarity
  * Distance Measures
    * Peter and Piotr : 3
    * Two fingerprint : 342.7
* Typical Applications of Clustering Analysis
  * Pattern Recognition
  * market segmentation
  * Biology
  * Geography
  * Insurance
* Requirements of Clustering
  * Scalability
  * Multiple classes
  * Deal with noisy data
  * High dimensionality
  * Interpretability and usability
* Issues
  * Outlier handling
    * Outlier 可以徹底改變 cluster 圈的範圍
  * Dynamic Data
  * Interpreting results (centroid meaning)
  * Number of clusters (k)

## Type of Clusters

### Well-Seperated

* A cluster is a set of points
* point in a cluster is closer (or more similar) to every other point

![](/files/-Lv0okdat9d1O4XA9c3p)

### Center-based

* A cluster is a set of objects
* an object in a cluster is closer (more similar) to the **“center”** of a cluster
* calculate center first, and compare distance from centroid to each group

![](/files/-Lv0okdcXxlT38AdZbXn)

### Contiguity-Based

* Nearest neighbor or Transitive
* A cluster is a set of points
* a point in a cluster is closer (or more similar) to one or more other points

![](/files/-Lv0okdesOKcGK_Ssk77)

### Conceptual

* Finds clusters that share some common property

### Density-based

* A cluster is a dense region of points
* Separated by low-density regions
* Used when the clusters are irregular or intertwined
  * and when noise and outliers are present

![](/files/-Lv0okdg-aAfLsADGOP6)

## Clustering Algorithms

* Partitioning algorithms
  * Construct various **partitions** and then **evaluate** them by some criterion.
* Hierarchy algorithms
  * Create a hierarchical **decomposition** of the set of data (or objects) using some criterion.
* Density-based
  * based on **connectivity** and **density** functions
* Grid-based
  * based on a **multiple-level** granularity structure
* Model-based
  * A model is **hypothesized** for each of the clusters
  * the idea is to find the **best fit** of that model to each other.&#x20;

## Partition-based Clustering

* Construct a partition of a **database D of n objects** into a set of **k clusters**
  * Given a k, find a partition of k clusters
  * **optimizes** the chosen partitioning criterion
    * Global optimal : Iterate all partitions (NP-Hard)
    * Heuristic methods
      * **k-means**
        * each cluster is represented by the **center** of the cluster
      * **k-medoids** (Partition Around Medoids, PAM)
        * each cluster is represented by one of the objects in the cluster.
        * Used when full of outliers

### K-means

* partition objects into **k** nonempty subsets
* compute the centroids (mean) of each partitions
* relocate all objects to the nearest clusters
* Repeat compute centroids and relocate until no more relocation

```
Select K points as the initial centroids

repeat
    Form K clusters by assigning all points to the closest centroid
    Recompute the centroid of each cluster

until The centroids don't change
```

![](/files/-Lv0okdi0oeiOKi1vT5a)

#### Details

* Initial centroids are often **chosen randomly**&#x20;
* centroid is typically the mean of the points in the cluster
* closeness or similarity is measured by :
  * Euclidean distance
  * Cosine similarity
  * Correlation
  * etc.
* K-means will converge eventually
  * Most of the convergence happens in the first few iterations
  * Very fast
  * stopping condition is often changed to **特定的幾個點變動為止**

#### Comments

* Pros
  * Efficient, $$O(nKId)$$
    * n = # points
    * K = # clusters
    * I = # iterations
    * d = # attributes
  * terminates at a local optimum (useful local optimum)
* Cons
  * **Applicable** only when mean is defined
  * Need to specify **k**
  * Unable to handle **noisy data** and **outliers**
* Variations
  * k-modes : Handling categorical data
    * Replacing means of clusters with modes (distance = 0/1)

#### Problem with selecting initial points

* Given Large **K** clusters, and clusters are the same size **n**&#x20;
* The probability of one specific initial points&#x20;

  $$
  \begin{aligned}
  P &= \frac{K!}{K^K} \\
  \text{If K is } 10 &= \frac{10!}{10^{10}} = 0.00036\
  \end{aligned}
  $$

#### Problem with clusters different size

![](/files/-Lv0okdlfSR89mT0K0Ko)

#### Problem with clusters densities

![](/files/-Lv0okdnE9r36H7LRCyl)

#### Problem with clusters non-globular

![](/files/-Lv0okdpr0FPzDJLkTj4)

#### Solution

* One solution is to use many clusters
  * Find parts of clusters, but need to put together
  * Use many k and merge

### K-medoid

* The k-means algorithm is sensitive to outliers
* Instead of taking the mean value, we find the actual point closest to mean
* Then we can get out of outliers (maybe)

![](/files/-Lv0okdraekrG64RBgMb)

#### Methods

* PAM (Partitioning Around Medoids, 1987)
* CLARA (Kaufmann & Rousseeuw, 1990)
* CLARANS (Ng & Han, 1994): Randomized sampling
* Focusing + spatial data structure (Ester et al., 1995)

#### PAM

* Robust than k-means in the presence of noise and outliers
  * A medoid is less influenced by outliers
* Works efficiently for small data sets&#x20;
* But does not **scale well** for large data sets
  * \=> Sampling based method (CLARA)

#### CLARA (Clustering Large Applications)

* Partition for large database
* Steps
  * draws multiple samples of the data set
  * applies PAM on each sample
  * gives the best clustering as the output
* Pros
  * deal with larger data sets than PAM
* Cons
  * efficiency depends on the sample size
  * good clustering on samples means good on whole dataset ?

## Hierarchical Clustering

* Produces a set of nested clusters organized as a **hierarchical tree**
* Can be visualized as a dendrogram (樹枝狀圖)
  * Can record the sequences of merges or splits
* No need to assume number k of clusters
* Hierarchy may correspond to meaningful taxonomies
* Two main types of hierarchical clustering
  * **Agglomerative**
    * Start with the points as individual clusters
    * At each step, merge the closest pair of clusters until only k cluster left
  * **Divisive**
    * Start with one, all-inclusive cluster
    * At each step, split a cluster until each cluster contains a point

![](/files/-Lv0okdtElTxv8vz0ZhV)

### Agglomerative Clustering Algorithm

* More popular
* Key operation is the computation of the proximity of two clusters&#x20;

```
Compute the proximity matrix
Let each data point be a cluster

Repeat
    Merge the two closest clusters
    Update the proximity matrix

Until only a single cluster remains
```

![](/files/-Lv0okdvaZBRlQOglml_)

* How to define similarity to merge ?
  * MIN
    * Can handle non-elliptical shapes
    * But sensitive to noise and outliers
  * MAX
    * Less susceptible to noise and outliers
    * But tends to break large clusters and biased to globular
  * Group Average
  * Distance Between Centroids
  * Ward’s Method
    * Based on the increase in squared error

#### Requirements and Limitations

* Requirements
  * space : $$O(N^2)$$
  * time : $$O(N^3)$$
* Limitations
  * merges cannot be undone
  * no objective function
  * different schemes have different problems
    * sensitivity to noises
    * difficulty to cluster shapes
    * breaking large clusters

### Divisive Clustering Algorithm

* Build MST (Minimum Spanning Tree)
  * build minimum spanning tree
  * cut the longest edges

```
Compute the MST for the proximity graph

Repeat
    Create a new cluster by breaking the link corresponding to the largest disance

Until Only singleton clusters remain
```

## Integration of hierarchical clustering & distance-based

* Clustering algorithms have been designed to handle **very large datasets**
  * BIRCH (1996)
  * CURE (1998)

### BIRCH

* Main idea
  * use an in-memory R-tree to store points that are being clustered
  * Insert points one at a time into the R-tree
    * merging a new point with an existing cluster if is less than some distance
  * If there are more leaf nodes than fit in memory, merge existing clusters that are close to each other
  * At the end of first pass we get a large number of clusters at the leaves of the R-tree
    * Merge clusters to reduce the number of clusters
* Balanced Iterative Reducing and Clustering using Hierarchies
* Incrementally construct a **CF (Clustering Feature) tree**
  * a hierarchical data structure for multiphase clustering

```
1. scan DB to build an initial in-memory CF tree

2. use any clustering algorithm to cluster the leaf nodes of the CF-tree
```

### CURE

* Clustering Using Representatives
* Stops the creation of a cluster hierarchy if a level consists of k clusters
* Use **many points** to represent a cluster instead of only one
  * Call **multiple representative points**
* Points will be well scattered

```
1.Create Clusters with Representative Points

2.Merge Clusters with Closest Points

3.Shrink Representative Points
```

## DBSCAN

* Density-based Algorithm
  * Density = number of points within a specified radius (Eps)
  * **core point**&#x20;
    * it has more than a specified number of points (MinPts) within Eps
    * These are points that are at the interior of a cluster
  * **border point**
    * has fewer than MinPts within Eps
    * but is in the neighborhood of a core point
  * **noise point**
    * any point that is not a core point or a border point

![](/files/-Lv0okdxZ2t_qZy_bIx6)

* DBSCAN can eliminate noise points

![](/files/-Lv0okdzPyKgNA1xPkZ3)

* Works well
  * Resistant to Noise
  * Can handle clusters of different shapes and sizes
* Works fail
  * Varying densities
  * High-dimensional data
* Determining Eps and MinPts
  * Consider k nearest neighbors are at roughly the same distance
