Academic
  • Introduction
  • Artificial Intelligence
    • Introduction
    • AI Concepts, Terminology, and Application Areas
    • AI: Issues, Concerns and Ethical Considerations
  • Biology
    • Scientific Method
    • Chemistry of Life
    • Water, Acids, Bases
    • Properties of carbon
    • Macromolecules
    • Energy and Enzymes
    • Structure of a cell
    • Membranes and transport
    • Cellular respiration
    • Cell Signaling
    • Cell Division
    • Classical and molecular genetics
    • DNA as the genetic material
    • Central dogma
    • Gene regulation
  • Bioinformatics
    • Bioinformatics Overview
  • Deep Learning
    • Neural Networks and Deep Learning
      • Introduction
      • Logistic Regression as a Neural Network
      • Python and Vectorization
      • Shallow Neural Network
      • Deep Neural Network
    • Improving Deep Neural Networks
      • Setting up your Machine Learning Application
      • Regularizing your Neural Network
      • Setting up your Optimization Problem
      • Optimization algorithms
      • Hyperparameter, Batch Normalization, Softmax
    • Structuring Machine Learning Projects
    • Convolutional Neural Networks
      • Introduction
    • Sequence Models
      • Recurrent Neural Networks
      • Natural Language Processing & Word Embeddings
      • Sequence models & Attention mechanism
  • Linear Algebra
    • Vectors and Spaces
      • Vectors
      • Linear combinations and spans
      • Linear dependence and independence
      • Subspaces and the basis for a subspace
      • Vector dot and cross products
      • Matrices for solving systems by elimination
      • Null space and column space
    • Matrix transformations
      • Functions and linear transformations
      • Linear transformation examples
      • Transformations and matrix multiplication
      • Inverse functions and transformations
      • Finding inverses and determinants
      • More Determinant Depth
  • Machine Learning
    • Introduction
    • Linear Regression
      • Model and Cost Function
      • Parameter Learning
      • Multivariate Linear Regression
      • Computing Parameters Analytically
      • Octave
    • Logistic Regression
      • Classification and Representation
      • Logistic Regression Model
    • Regularization
      • Solving the Problem of Overfitting
    • Neural Networks
      • Introduction of Neural Networks
      • Neural Networks - Learning
    • Improve Learning Algorithm
      • Advice for Applying Machine Learning
      • Machine Learning System Design
    • Support Vector Machine
      • Large Margin Classification
      • Kernels
      • SVM in Practice
  • NCKU - Artificial Intelligence
    • Introduction
    • Intelligent Agents
    • Solving Problems by Searching
    • Beyond Classical Search
    • Learning from Examples
  • NCKU - Computer Architecture
    • First Week
  • NCKU - Data Mining
    • Introduction
    • Association Analysis
    • FP-growth
    • Other Association Rules
    • Sequence Pattern
    • Classification
    • Evaluation
    • Clustering
    • Link Analysis
  • NCKU - Machine Learning
    • Probability
    • Inference
    • Bayesian Inference
    • Introduction
  • NCKU - Robotic Navigation and Exploration
    • Kinetic Model & Vehicle Control
    • Motion Planning
    • SLAM Back-end (I)
    • SLAM Back-end (II)
    • Computer Vision / Multi-view Geometry
    • Lie group & Lie algebra
    • SLAM Front-end
  • Python
    • Numpy
    • Pandas
    • Scikit-learn
      • Introduction
      • Statistic Learning
  • Statstics
    • Quantitative Data
    • Modeling Data Distribution
    • Bivariate Numerical Data
    • Probability
    • Random Variables
    • Sampling Distribution
    • Confidence Intervals
    • Significance tests
Powered by GitBook
On this page
  • The Problem of Overfitting
  • Cost Function
  • Regularized Linear Regression
  • Linear regression - gradient descent
  • Linear regression - Normal equation
  • Regularized Logistic Regression
  • Advanced optimization

Was this helpful?

  1. Machine Learning
  2. Regularization

Solving the Problem of Overfitting

PreviousRegularizationNextNeural Networks

Last updated 5 years ago

Was this helpful?

The Problem of Overfitting

當我們要尋找 Hypothesis 來 fit 我們的 training set 時

可以遇到三種情況

  1. Underfit (High bias)

    • Hypothesis 不是很好的 fit 這些 example data

  2. Just right

    • Hypothesis 很好的 fit 這些 example data

  3. Overfit (High variance)

    • Hypothesis 太過 "鑽牛角尖" 於每一個 training set

我們為 Overfitting 下一個定義 :

太多的 features,讓求得的 Hypothesis 在 training data 得到不錯分數 (cost function J 接近 0)

但完全沒辦法讓 test set 或是真實資料套入使用

同樣的事情可以發生在 logistic regression :

會造成 overfitting 的原因就是因為 features 使用太多了

但又不知道哪些 features 可以完全消除掉 (可能對 training 有益)

有兩種方法可以解決 :

  1. 減少 features 數量

    • 手動選擇不要的 features

    • 使用 model selection algorithm (later in other course)

  2. Regularization

    • 留下全部 features,但降低 theta's 的 magnitude

    • 這方法就可以不用擔心會不會刪掉對 training 有益的 features

Cost Function

若我們想降低多餘 (可能) 的 features 所帶來的 overfitting 影響

我們只要將這些 features 在 cost function 的代價提高,讓他們對 hypothesis 的影響下降就好

以 θ0+θ1x+θ2x2+θ3x3+θ4x4\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4θ0​+θ1​x+θ2​x2+θ3​x3+θ4​x4 為例

我們將他的 Cost function 調整為

min⁡θ12m∑i=1m(hθ(x(i))−y(i))2+1000⋅θ32+1000⋅θ42\min_\theta \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 {\color{red}+1000\cdot \theta_3^2+1000\cdot\theta_4^2}θmin​2m1​i=1∑m​(hθ​(x(i))−y(i))2+1000⋅θ32​+1000⋅θ42​

新的 cost function 就可以在不去除 θ3\theta_3θ3​ 跟 θ4\theta_4θ4​ 的情況

讓兩者對 hypothesis 的影響降低

1000 可以是其他較大數值,但不能夠過大

不然會使得 θ3\theta_3θ3​ 跟 θ4\theta_4θ4​ 直接不見

右上的紫線就是在 implement regularization 過後的新 hypothesis function

所以除了 θ0\theta_0θ0​ 不用變更以外

我們將 regularization 套用到所有的 θ\thetaθ (因為我們也不知道到底是哪一個 feature 造成 overfitting)

得到了新的公式 :

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2+λ∑j=1nθj2J(\theta) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum_{j=1}^n \theta_j^2J(θ)=2m1​i=1∑m​(hθ​(x(i))−y(i))2+λj=1∑n​θj2​
  • λ\lambdaλ 為 regularization parameter

    • 將會作為 θ\thetaθ 對於整個 cost function 的影響指標

  • 只是多了後面一個式子,就可以讓 overfitting 的影響減少

  • 但要小心,若 λ\lambdaλ 非常大的話

    • 會讓每個 θ\thetaθ 都變為 0

    • 變回 underfit 的狀況

Regularized Linear Regression

現在我們將 regularization 的公式套入 linear regression 和 logistic regression

先從 linear regression 的 gradient descent 開始

Linear regression - gradient descent

我們會改變所有 θ\thetaθ 但是 θ0\theta_0θ0​ 要維持原本的樣子

Repeat {θ0:=θ0−α1m∑i=1m(hθ(x(i))−y(i))x0(i)θj:=θj−α[(1m∑i=1m(hθ(x(i))−y(i))xj(i))+λmθj]j∈{1,2,...n}}\begin{aligned} &\text{Repeat \{}\\ &\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})- y^{(i)})x_0^{(i)}\\ &\theta_j := \theta_j - \alpha \left[\left(\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})- y^{(i)})x_j^{(i)}\right)+\frac{\lambda}{m}\theta_j\right] && j \in \begin{Bmatrix}1, 2, ... n \end{Bmatrix}\\ &\text{\}} \end{aligned}​Repeat {θ0​:=θ0​−αm1​i=1∑m​(hθ​(x(i))−y(i))x0(i)​θj​:=θj​−α[(m1​i=1∑m​(hθ​(x(i))−y(i))xj(i)​)+mλ​θj​]}​​j∈{1,2,...n​}​

我們發現第二行 θj\theta_jθj​ 的公式可以在簡化成這樣 :

θj:=θj(1−αλm)−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}θj​:=θj​(1−αmλ​)−αm1​i=1∑m​(hθ​(x(i))−y(i))xj(i)​

可以發現 1−αλm1-\alpha\frac{\lambda}{m}1−αmλ​ 永遠都會是小於 1 的正數

所以負號的前一項能表示每回合的 θj\theta_jθj​ 會固定變小

而後面一項跟原本的 gradient descent 一模一樣

Linear regression - Normal equation

要將 regularization 加入 normal equation 中

只需簡單加入 λ⋅L\lambda \cdot Lλ⋅L 即可,其中 LLL 為一個像是 identity matrix 但第一項為 0 的特殊矩陣

這個矩陣為一個 (n−1)×(n−1)(n-1)\times(n-1)(n−1)×(n−1) 的矩陣

有一個 0 的原因就是不想要變動到 θ0\theta_0θ0​ 的值

θ=(XTX+λ⋅[011⋱1])−1XTy\theta = \left(X^TX + \lambda \cdot \begin{bmatrix} 0 \\ & 1 \\ && 1 \\ &&& \ddots \\ &&&&1 \end{bmatrix} \right)^{-1}X^Tyθ=​XTX+λ⋅​0​1​1​⋱​1​​​−1XTy

將 regularization 套入 normal equation 也有另一個好處

之前說過若 features 大於 training examples 數量太多

則 XTXX^TXXTX 有可能會為 non-invertible

但現在有了 λ⋅L\lambda \cdot Lλ⋅L

XTX+λ⋅LX^TX + \lambda \cdot LXTX+λ⋅L 就一定是 invertible 的矩陣了 !

Regularized Logistic Regression

我們一樣來將 regularization 套用在 logistic regression

讓 overfitting 的藍線變為正常的紫線 hypothesis

一樣在 logistic cost function

J(θ)=−1m∑i=1m[y(i)log⁡(hθ(x(i)))+(1−y(i))log⁡(1−hθ(x(i)))]J(\theta) = -\frac{1}{m}\sum_{i=1}^m[y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x^{(i)}))]J(θ)=−m1​i=1∑m​[y(i)log(hθ​(x(i)))+(1−y(i))log(1−hθ​(x(i)))]

的最後加上 regularization 的式子得到 :

J(θ)=−1m∑i=1m[y(i)log⁡(hθ(x(i)))+(1−y(i))log⁡(1−hθ(x(i)))]+λ2m∑j=1nθj2J(\theta) = -\frac{1}{m}\sum_{i=1}^m[y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2J(θ)=−m1​i=1∑m​[y(i)log(hθ​(x(i)))+(1−y(i))log(1−hθ​(x(i)))]+2mλ​j=1∑n​θj2​

注意我們一樣不想變動到 θ0\theta_0θ0​

所以 regularization 的 summation 從 1 開始 !

接著套進 gradient descent

Repeat {θ0:=θ0−α1m∑i=1m(hθ(x(i))−y(i))x0(i)θj:=θj−α[(1m∑i=1m(hθ(x(i))−y(i))xj(i))+λmθj]j∈{1,2,...n}}\begin{aligned} &\text{Repeat \{}\\ &\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})- y^{(i)})x_0^{(i)}\\ &\theta_j := \theta_j - \alpha \left[\left(\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})- y^{(i)})x_j^{(i)}\right)+\frac{\lambda}{m}\theta_j\right] && j \in \begin{Bmatrix}1, 2, ... n \end{Bmatrix}\\ &\text{\}} \end{aligned}​Repeat {θ0​:=θ0​−αm1​i=1∑m​(hθ​(x(i))−y(i))x0(i)​θj​:=θj​−α[(m1​i=1∑m​(hθ​(x(i))−y(i))xj(i)​)+mλ​θj​]}​​j∈{1,2,...n​}​

其實跟 linear regression 一樣

只是 hθ(x)=11+e−θTxh_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}hθ​(x)=1+e−θTx1​

Advanced optimization

在 Octave 實作 logistic regression 的 advanced optimization 時

這些 code 又該怎麼跟 regularization 一起實作 :

function [jVal, gradient] = costFunction(theta)jVal = [code to compute J(θ)];→J(θ)=−1m∑i=1m[y(i)log⁡(hθ(x(i)))+(1−y(i))log⁡(1−hθ(x(i)))]+λ2m∑j=1nθj2gradient(1) = [code to compute ddθ0J(θ)];→1m∑i=1m(hθ(x(i))−y(i))x0(i)gradient(2) = [code to compute ddθ1J(θ)];→1m∑i=1m(hθ(x(i))−y(i))x1(i)+λmθ1gradient(3) = [code to compute ddθ2J(θ)];→1m∑i=1m(hθ(x(i))−y(i))x2(i)+λmθ2⋮gradient(n+1) = [code to compute ddθnJ(θ)];\begin{aligned} &\text{function [jVal, gradient] = costFunction(theta)} \\ &\text{jVal = }[ \text{code to compute } J(\theta) ]; \\ \rightarrow &J(\theta) = -\frac{1}{m}\sum_{i=1}^m[y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2\\ &\text{gradient(1) = } [ \text{code to compute } \frac{d}{d\theta_0}J(\theta)]; \\ \rightarrow &\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})- y^{(i)})x_0^{(i)} \\ &\text{gradient(2) = } [ \text{code to compute } \frac{d}{d\theta_1}J(\theta)]; \\ \rightarrow &\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})- y^{(i)})x_1^{(i)} + \frac{\lambda}{m}\theta_1 \\ &\text{gradient(3) = } [ \text{code to compute } \frac{d}{d\theta_2}J(\theta)]; \\ \rightarrow &\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})- y^{(i)})x_2^{(i)} + \frac{\lambda}{m}\theta_2 \\ \vdots\\ &\text{gradient(n+1) = } [ \text{code to compute } \frac{d}{d\theta_n}J(\theta)]; \\ \end{aligned}→→→→⋮​function [jVal, gradient] = costFunction(theta)jVal = [code to compute J(θ)];J(θ)=−m1​i=1∑m​[y(i)log(hθ​(x(i)))+(1−y(i))log(1−hθ​(x(i)))]+2mλ​j=1∑n​θj2​gradient(1) = [code to compute dθ0​d​J(θ)];m1​i=1∑m​(hθ​(x(i))−y(i))x0(i)​gradient(2) = [code to compute dθ1​d​J(θ)];m1​i=1∑m​(hθ​(x(i))−y(i))x1(i)​+mλ​θ1​gradient(3) = [code to compute dθ2​d​J(θ)];m1​i=1∑m​(hθ​(x(i))−y(i))x2(i)​+mλ​θ2​gradient(n+1) = [code to compute dθn​d​J(θ)];​