# Classification

## Classification

* Given a collection of training sets **`[attributes : class]`**
* Find a **model** for class attribute as a function of the values of other attributes.&#x20;
* Goal: **previously unseen** records should be assigned a class as accurately as possible.&#x20;

### Data Preparation

* Data cleaning
* Relevance analysis
* Data transformation

### Classification Evaluation

* accuracy
* speed & scalability
* robustness
  * handling noise and missing values
* interpretability
* goodness of rules
  * decision tree size&#x20;
  * compactness of classification rules&#x20;

### Techniques for classification

* Decision Tree
* Memory based reasoning (K-NN)
* Regression
* Neural Networks
* Naïve Bayes
* Support Vector Machines
* Random Forest

### Issues of Classification

* Missing Values
* Costs of Classification
* Underfitting and Overfitting
  * Address Overfitting
    * Pre-Pruning (Early Stopping Rule)
      * Stop the algorithm before it becomes a fully-grown tree
    * Post-pruning
      * Grow decision tree to its entirety
      * Trim the nodes of the decision tree in a bottom-up fashion
      * If generalization error improves after trimming, replace sub-tree by a leaf node
      * Class label of leaf node is determined from majority class of instances in the sub-tree

## Decision tree

* Algorithms
  * Hunt's Algorithm
  * CART
  * ID3, C4.5, C5.0, See5
  * SLIQ,SPRINT
* Tree Induction
  * Greedy strategy
    * Split the records based on an attribute test that&#x20;

      optimizes certain criterion
  * Issues
    * How to split the records
    * When to stop splitting
* Test Condition
  * Depend on attribute types
    * Nominal: `{hot, cool, cold}`
    * Ordinal: `{A, B, C}, {small, medium, large}`
    * Continuous: `{1.25, 2, -3}`
      * Binary split
        * `x > 80 -> Yes, No`
      * Multi-way split
        * `[10, 25], [25, 50], [50, 80]`
  * Depends on number of ways to split
    * binary split
      * Divides values into two subsets.
      * Need to find optimal partitioning.
        * `CarType -> {Sports, Luxury}, {Family}`
    * Multi-way split
      * Use as many partitions as distinct values.
        * `CarType -> Family, Sports, Luxury`
* How to Determine the Best Split
  * Need a measure of node impurity
  * Non-homogeneous, High degree of impurity&#x20;
    * `a:5, b:5`
  * Homogeneous, Low degree of impurity&#x20;
    * `a:9, b:1`
* Measures of Node Impurity
  * Gini Index
  * Entropy
  * Misclassification error
* Advantages
  * Inexpensive to construct
  * Extremely fast at classifying unknown records
  * Easy to interpret for small-sized trees
  * Accuracy is comparable to other classification techniques for many simple data sets
  * Good at dealing with symbolic feature
  * Easy to comprehend

### GINI

$$
GINI(t) = 1 - \sum\_j \[p(j\mid t)]^2
$$

* Maximum : when records are equally distributed among all classes, implying least interesting information
* Minimum : when all records belong to one class, implying most interesting information
* GINI 越小代表分的越好
  * 最差 0.5，最好 0

![](/files/-LtEQcoDvd7ywTpIk1l-)

#### GINI with Multiple Attributes

![](/files/-LtEQcoGUO3cIbg_QybL)

* 以左邊為例

$$
\begin{aligned}
1-(\frac{3}{5})^2 - (\frac{2}{5})^2 &= 0.48\\
1-(\frac{1}{5})^2 - (\frac{4}{5})^2 &= 0.32\\
0.48 \times \frac{5}{10} + 0.32 \times \frac{5}{10} &= 0.4
\end{aligned}
$$

#### GINI with Continuous Attributes

* Need to Computing Gini Index&#x20;
  * Several Choices for the splitting value
* For efficient computation:
  * Sort the attribute on values
  * Linearly scan these values
    * Updating the count matrix and computing gini index
  * Choose the split position that has the least gini index

![](/files/-LtEQcoKvMIxCmtIYjgK)

### Entropy

$$
Entropy(t) = -\sum\_j p(j\mid t)\log\_2p(j\mid t)
$$

* NOTE: $$p(j\mid t)$$ is relative frequency of class j at node t
* Maximum : when records are equally distributed among all classes implying least information
* Minimum :  when all records belong to one class, implying most information&#x20;
* Entropy 越小代表分的越好
  * Entropy based computations are similar to the GINI index computations
  * 最差 1，最好 0
* log\_2 代表有 2 類別，若有 10 類別可以用 log\_10

![](/files/-LtEQcoNfZL4HDFG0fdH)

#### Information Gain

$$
GAIN\_{\text{split}} = Entropy(p) - (\sum\_{i=1}^k \frac{n\_i}{n}Entropy(i))
$$

* Measures Reduction in Entropy achieved because of the split.&#x20;
* Choose the split that achieves most reduction (maximizes GAIN)&#x20;
* Used in ID3 and C4.5
* Disadvantage
  * Tends to prefer splits that result in large number of partitions, each being small but pure.

#### Gain Ratio

$$
GainRATIO\_\text{split} = \frac{GAIN\_\text{split}}{SplitINFO}
$$

$$
SplitINFO = -\sum\_{i=1}^k \frac{n\_i}{n}\log\frac{n\_i}{n}
$$

* Adjusts Information Gain by the entropy of the partitioning (SplitINFO).
* Higher entropy partitioning (large number of small partitions) is penalized!
* Used in C4.5
* Designed to overcome the disadvantage of Information Gain

### Classification error

$$
Error(t) = 1 - \max\_i P(i\mid t)
$$

* Maximum : when records are equally distributed among all classes, implying least interesting information
* Minimum : when all records belong to one class, implying most interesting information
* Error 越小越好
* 最差 1，最好 0

![](/files/-LtEQcoPJcdMRPgFdvDR)

### Splitting Criteria Comparison

![](/files/-LtEQcoRHE95ahvBfXni)

## Nearest Neighbor Classifiers

* Requires three things
  * The set of stored records
  * **Distance Metric** to compute distance between records
  * **The value of k**, the number of nearest neighbors to retrieve
* To classify an unknown record
  * Compute distance to other training records
  * Identify k nearest neighbors
  * Use class labels of nearest neighbors to determine the class label of unknown record

![](/files/-LtEQcoTs50K9gdOUV_G)

* k-NN classifiers are lazy learners
* Compute distance between two points

  $$
  d(p, q) = \sqrt{\sum\_i (p\_i - q\_i)^2}
  $$
* Determine the class from nearest neighbor list
* Choosing the value of k
  * k is too small, sensitive to noise points
  * k is too large, neighborhood may include points from other classes
* Scaling issues
  * Attributes may have to be scaled to prevent distance measures error
    * height, weight, income with different unit
* distance measure issues
  * High dimensional data may cause **curse of dimensionality**
  * Can produce counter-intuitive results
  * Solution
    * Normalize the vectors to unit length

## Bayesian Classifiers

* A probabilistic framework for solving classification problems
* Bayes theorem

  $$
  P(C\mid A) = \frac{P(A\mid C)P(C)}{P(A)}
  $$
* Consider each attribute and class label as random variables&#x20;
  * Given a record with attributes $$(A\_1, A\_2, \cdots, A\_n)$$
  * Goal is to predict class $$C$$
* Estimate $$P(C\mid A\_1 , A\_2 ,\cdots,A\_n)$$ directly from data
  * compute the posterior probability for all values of C

$$
P(C\mid A\_1A\_2\cdots A\_n) = \frac{P(A\_1A\_2\cdots A\_n\mid C)P(C)}{P(A\_1A\_2\cdots A\_n)}
$$

## Naïve Bayes Classifier

* Assume independence among attributes

  $$
  P(A\_1 , A\_2 ,\cdots,A\_n\mid C) = P(A\_1\mid C) P(A\_2\mid C)\cdots P(A\_n\mid C)
  $$
* Estimate all $$P(A\_i \mid C\_j)$$
* Maximize $$P(C\_j) \prod P(A\_i\mid C\_j)$$

### Example

![](/files/-LtEQcoWjZYuV4grYs95)

* Summary
  * Robust to **isolated noise points**
  * Handle missing values by ignoring the instance during probability estimate calculations
  * Robust to irrelevant attributes
  * **Independence** assumption may not hold for some attributes

## Support Vector Machines

* Find a linear hyperplane (decision boundary) that will separate the data
* Find hyperplane maximizes the margin

![](/files/-LtEQcoY8Tf3YyuBAeYI)

$$
\text{Margin} = \frac{2}{\lVert \vec{w}\rVert^2}
$$

* This is a constrained optimization problem
  * Numerical approaches to solve it

## Ensemble Methods

* Construct a set of classifiers from the training data
* Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

![](/files/-LtEQco_VDYUUmpi3m7U)

* Why it works?
  * Suppose there are 25 base classifiers
    * Each classifier has error rate $$\epsilon = 0.3$$
    * Assume classifiers are independent
    * Probability that the ensemble classifier makes a wrong prediction

      $$
      \sum\_{i=13}^{25}\binom{25}{i}\epsilon^i(1-\epsilon)^{25-i}=0.06
      $$
* Methods
  * Bagging
  * Boosting

### Boosting

* Training
  * Produce a sequence of classifiers
  * Each classifier is dependent on the previous one
    * and focuses on the previous one’s errors
  * Examples that are **incorrectly** predicted in previous classifiers are given **higher weights**
* Testing
  * the results of the series of classifiers are combined to determine the final class of the test case
* Records that are wrongly classified will have their weights increased&#x20;
* Records that are classified correctly will have their weights decreased

#### AdaBoost

![](/files/-LtEQcobfQbN3iZXpsUy)

* The actual performance of boosting depends on the data and the base learner.&#x20;
* Boosting seems to be **susceptible to noise**.
  * When the number of outliners is very large, the emphasis placed on the hard examples can hurt the performance.&#x20;

## Semi-supervised learning

* Assigning labels to training set
  * expensive, time consuming
* Abundance of unlabeled data
  * LU learning
    * Learning with a small set of labeled examples and a large set of unlabeled examples
  * PU learning
    * Learning with positive and unlabeled examples

| Usage          | Supervised | Semi-Supervised | Unsupervised |
| -------------- | ---------- | --------------- | ------------ |
| labeled data   | yes        | yes             | no           |
| unlabeled data | no         | yes             | yes          |

* Label propagation
  * Each node in the graph is an example
    * Two examples are labeled&#x20;
    * Most examples are unlabeled
  * Compute the similarity between examples
  * Connect examples to their most similar examples

![](/files/-LtEQcodq6ZLm70InlYm)

* Application of PU Learning
  * given a ICML proceedings, find all machine learning papers from AAAI, IJCAI, KDD
  * given one's bookmarks (positive documents), identify those documents that are of interest to him/her from Web sources.
  * Company has database with details of its customer – positive examples, but no information on those who are not their customers
    * No labeling of negative examples from each of these collections.&#x20;

### PU Learning 2-step strategy

1. Identifying a set of **reliable negative** documents from the unlabeled set.&#x20;
   * Reliable Negative : Human labeling 100% Negative data
2. Building a sequence of classifiers by iteratively applying a classification algorithm and then selecting a good classifier.&#x20;

![](/files/-LtEQcogrf7S7QwzRmU2)

## Co-training Algorithm

* Given
  * labeled data L
  * unlabeled data U
* Compute
  * Train **h1** using L
  * Train **h2** using L
  * Allow h1 to label p positive, n negative examples from U
  * Allow h2 to label p positive, n negative examples from U
  * Add these most confident self-labeled examples to L
  * Loop

### Example

Consider the task of classifying web pages into two categories: category for students and category for professors

Two aspects of web pages should be considered

* Content of web pages
  * "I am currently the second year Ph.D. student …"
* Hyperlinks
  * "My advisor is …"

#### Scheme

1. Train a content-based classifier using labeled web pages
2. Apply the content-based classifier to classify unlabeled web pages
3. Label the web pages that have been confidently classified
4. Train a hyperlink based classifier using the web pages that **are initially labeled and labeled by the classifier**
5. Apply the hyperlink-based classifier to classify the unlabeled web pages
6. Label the web pages that have been confidently classified


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sejkai.gitbook.io/academic/ncku-data-mining/classification.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
