Academic
  • Introduction
  • Artificial Intelligence
    • Introduction
    • AI Concepts, Terminology, and Application Areas
    • AI: Issues, Concerns and Ethical Considerations
  • Biology
    • Scientific Method
    • Chemistry of Life
    • Water, Acids, Bases
    • Properties of carbon
    • Macromolecules
    • Energy and Enzymes
    • Structure of a cell
    • Membranes and transport
    • Cellular respiration
    • Cell Signaling
    • Cell Division
    • Classical and molecular genetics
    • DNA as the genetic material
    • Central dogma
    • Gene regulation
  • Bioinformatics
    • Bioinformatics Overview
  • Deep Learning
    • Neural Networks and Deep Learning
      • Introduction
      • Logistic Regression as a Neural Network
      • Python and Vectorization
      • Shallow Neural Network
      • Deep Neural Network
    • Improving Deep Neural Networks
      • Setting up your Machine Learning Application
      • Regularizing your Neural Network
      • Setting up your Optimization Problem
      • Optimization algorithms
      • Hyperparameter, Batch Normalization, Softmax
    • Structuring Machine Learning Projects
    • Convolutional Neural Networks
      • Introduction
    • Sequence Models
      • Recurrent Neural Networks
      • Natural Language Processing & Word Embeddings
      • Sequence models & Attention mechanism
  • Linear Algebra
    • Vectors and Spaces
      • Vectors
      • Linear combinations and spans
      • Linear dependence and independence
      • Subspaces and the basis for a subspace
      • Vector dot and cross products
      • Matrices for solving systems by elimination
      • Null space and column space
    • Matrix transformations
      • Functions and linear transformations
      • Linear transformation examples
      • Transformations and matrix multiplication
      • Inverse functions and transformations
      • Finding inverses and determinants
      • More Determinant Depth
  • Machine Learning
    • Introduction
    • Linear Regression
      • Model and Cost Function
      • Parameter Learning
      • Multivariate Linear Regression
      • Computing Parameters Analytically
      • Octave
    • Logistic Regression
      • Classification and Representation
      • Logistic Regression Model
    • Regularization
      • Solving the Problem of Overfitting
    • Neural Networks
      • Introduction of Neural Networks
      • Neural Networks - Learning
    • Improve Learning Algorithm
      • Advice for Applying Machine Learning
      • Machine Learning System Design
    • Support Vector Machine
      • Large Margin Classification
      • Kernels
      • SVM in Practice
  • NCKU - Artificial Intelligence
    • Introduction
    • Intelligent Agents
    • Solving Problems by Searching
    • Beyond Classical Search
    • Learning from Examples
  • NCKU - Computer Architecture
    • First Week
  • NCKU - Data Mining
    • Introduction
    • Association Analysis
    • FP-growth
    • Other Association Rules
    • Sequence Pattern
    • Classification
    • Evaluation
    • Clustering
    • Link Analysis
  • NCKU - Machine Learning
    • Probability
    • Inference
    • Bayesian Inference
    • Introduction
  • NCKU - Robotic Navigation and Exploration
    • Kinetic Model & Vehicle Control
    • Motion Planning
    • SLAM Back-end (I)
    • SLAM Back-end (II)
    • Computer Vision / Multi-view Geometry
    • Lie group & Lie algebra
    • SLAM Front-end
  • Python
    • Numpy
    • Pandas
    • Scikit-learn
      • Introduction
      • Statistic Learning
  • Statstics
    • Quantitative Data
    • Modeling Data Distribution
    • Bivariate Numerical Data
    • Probability
    • Random Variables
    • Sampling Distribution
    • Confidence Intervals
    • Significance tests
Powered by GitBook
On this page
  • Evaluating a Learning Algorithm
  • Debuging learning algorithm
  • Evaluating a Hypothesis
  • Model Selection
  • Bias vs. Variance
  • Regularization with bias/variance
  • Learning Curves
  • Summary

Was this helpful?

  1. Machine Learning
  2. Improve Learning Algorithm

Advice for Applying Machine Learning

PreviousImprove Learning AlgorithmNextMachine Learning System Design

Last updated 5 years ago

Was this helpful?

Evaluating a Learning Algorithm

Debuging learning algorithm

當你的 Cost function 怎麼算都不對時,下一步該怎麼做 ?

  • 找更多 training examples

  • 減少 features

  • 增加 features

  • 試著加入 polynomial features

  • Increasing λ\lambdaλ

  • Decreasing λ\lambdaλ

如果只是隨便從中任選一個當解方,那可能會花上數個月解決

所以我們必須要採取 Machine Learning Diagnostic

Diagnostic 可能會發非常多時間 implement

但他可以給我們 guidance 以及 insight of learning algorithm

Evaluating a Hypothesis

為了避免 hypothesis overfitting

我們也將 trainging examples 拆成兩組,其中

70% 作為 training set,而 30% 作為 test set (拆分時最好是隨機的狀態)

所以現在 learning 的順序變成 :

在 Linear regression 中,我們表示 test set error 為 :

在 Logistic regression 中,我們重新定義了 Misclassification error (0/1 misclassification error)

而 Average test error 即告訴我們有多少的 test set 被 misclassified :

Model Selection

為了進一步解決 Overfitting 的問題,我們能夠採用 model selection 的辦法

一次列出不同 degree 的多種 model 來測試

但我們提早用了 test set 當作測試 model 的 data

難道我們又要再用 test set 進行最終測試嗎 ?

Cross Validation (CV) Set

為此我們將資料拆成三等分

多了一種 validation set 用來當作 model selection 的 data

  • Training set : 60%

  • Validation set : 20%

  • Test set : 20%

現在我們將進行三個步驟,各別算出 train, cv, test 的 error values :

  1. 利用 validation set 找出最小 error 的 degree model

    • d = theta from polynomial with lower error

這麼一來,就不會再發生 test set 偷看的問題了 !

Bias vs. Variance

為了認清每一個 degree 是 underfit 或是 overfit

我們需要先知道 bias 和 variance 是什麼

其實 high bias 就是指 underfit,而 high variance 就是 overfit

我們知道不管 overfitting,training error 會隨著 degree 增加而減少

而因為 overfitting 的關係,沒有了 training set 的

validation 及 test 的 error 則都會隨著 degree 增加而增加

  • d increase

    • training error decrease

    • validation error increase

    • test error increase

因此我們可以從這個特徵找出 cost function 是 high bias 或是 high variance

Regularization with bias/variance

我們知道解決 regularization 可以解決 overfitting 的問題

  1. 找出在 CV 測試中最小 error 的 model

Learning Curves

現在我們可以利用一種工具來檢查 bias 或是 variance 稱作 learning curves

假設我們有一個做好的 quadratic curve 的 hypothesis

從 m = 1, 2, 3, ... 個 training sets 開始測試

一開始的 error 會非常的小,但隨著 size m 越大 error 就會變得很大

因為只有 quadratic 的 curve 很難 fit 越來越多的 data

High bias experience

High bias 代表 underfitting

  • Training sets 小的時候

  • Training sets 大的時候

所以若 hypothesis 有 high bias problem

Learning curves 的測試結果會跟下圖差不多

  • 但兩者都會比 desired performance 還要差

High variance experience

High variance 代表 overfitting

  • Training sets 小的時候

    • 跟 high bias 狀況一樣

  • Training sets 大的時候

      • training size 越來越滿足 overfitting

High variance problem 在隨著 training sets 增加後

learning curves 會跟下圖差不多

  • 但兩者是朝著 desired performance 交會

所以 High variance 問題發生時,增加 training sets size 應該是個不錯的方法

Summary

所以當你 diagnose 並發現問題點後,可以分別做正確的修正了 !

  • When you get high bias problem

    • add features

  • When you get high variance problem

    • add more training examples

    • try smaller sets of features

Diagnose Neural Networks

  • 小的 neural networks 趨向於 underfitting

    • 但他 computationally cheaper

  • 大的 neural networks 趨向於 overfitting

    • 但他 computationally expensive

Neural networks 預設是使用 1 hidden layer

但你也可以使用多個 hidden layer 並且搭配 cross validation sets 來訓練

Model Complexity Effects

  • 當 hypothesis 的 degree 很低時

    • 會 high bias 及 low variance

    • train set 和 test set 都會 fit poorly

  • 當 hypothesis 的 degree 很高時

    • 會在 train set 有 low bias 但 high variance

    • 在 train set fit perfectly

    • 但在 test set fit poorly

我們希望的 model 會介於兩者之間,fit all sets reasonably !

找出能夠 minimize Jtrain(Θ)J_\text{train}(\Theta)Jtrain​(Θ) 的 Θ\ThetaΘ 得到 hypothesis

計算對應的 test set error Jtest(Θ)J_\text{test}(\Theta)Jtest​(Θ)

Jtest(Θ)=12mtest∑i=1mtest(hΘ(xtest(i))−ytest(i))J_\text{test}(\Theta) = \frac{1}{2m_\text{test}}\sum_{i=1}^{m_\text{test}}(h_\Theta(x_\text{test}^{(i)}) - y_\text{test}^{(i)})Jtest​(Θ)=2mtest​1​i=1∑mtest​​(hΘ​(xtest(i)​)−ytest(i)​)
err(hΘ(x),y)={1if hΘ(x)≥0.5 and y=0 || hΘ(x)<0.5 and y=10otherwiseerr(h_\Theta(x), y) = \left\{\begin{matrix} \begin{aligned} &1 && \text{if }h_\Theta(x) \ge 0.5 \text{ and } y = 0 \text{ || } h_\Theta(x) < 0.5 \text{ and } y = 1 \\ &0 &&\text{otherwise} \end{aligned} \end{matrix}\right.err(hΘ​(x),y)={​10​​if hΘ​(x)≥0.5 and y=0 || hΘ​(x)<0.5 and y=1otherwise​​
Test Error =1mtest∑i=1mtesterr(hΘ(xtest(i)),ytest(i))\text{Test Error } = \frac{1}{m_\text{test}}\sum_{i=1}^{m_\text{test}} err(h_\Theta(x_\text{test}^{(i)}), y_\text{test}^{(i)})Test Error =mtest​1​i=1∑mtest​​err(hΘ​(xtest(i)​),ytest(i)​)

首先對各個 model 計算出 θ\thetaθ

然後把每個 θ\thetaθ 都丟進 Jtest(θ)J_\text{test}(\theta)Jtest​(θ) 測試,找出最小的 model

d=1,hθ(x)=θ0+θ1x→θ(1)→Jtest(θ(1))d=2,hθ(x)=θ0+θ1x+θ2x2→θ(2)→Jtest(θ(2))⋮d=10,hθ(x)=θ0+θ1x+⋯+θ10x10→θ(10)→Jtest(θ(10))\begin{aligned} &d=1, &&h_\theta(x) = \theta_0 + \theta_1x&&\rightarrow \theta^{(1)}\rightarrow J_\text{test}(\theta^{(1)})\\ &d=2, &&h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2&&\rightarrow \theta^{(2)}\rightarrow J_\text{test}(\theta^{(2)})\\ &&\vdots\\ &d=10, &&h_\theta(x) = \theta_0 + \theta_1x + \cdots + \theta_{10}x^{10}&&\rightarrow \theta^{(10)}\rightarrow J_\text{test}(\theta^{(10)}) \end{aligned}​d=1,d=2,d=10,​⋮​hθ​(x)=θ0​+θ1​xhθ​(x)=θ0​+θ1​x+θ2​x2hθ​(x)=θ0​+θ1​x+⋯+θ10​x10​​→θ(1)→Jtest​(θ(1))→θ(2)→Jtest​(θ(2))→θ(10)→Jtest​(θ(10))​

利用 training set 找出每個 degree model 的最佳 θ\thetaθ

將找到的 model 與 test set 作 Jtest(θ(d))J_\text{test}(\theta^{(d)})Jtest​(θ(d)) 的最終測試

High bias (underfit) : Jtrain(θ)J_\text{train}(\theta)Jtrain​(θ) 和 JCV(θ)J_\text{CV}(\theta)JCV​(θ) 的 error 都很高,並且 Jtrain(θ)≈JCV(θ)J_\text{train}(\theta) \approx J_\text{CV}(\theta)Jtrain​(θ)≈JCV​(θ)

High variance (overfit) : Jtrain(θ)J_\text{train}(\theta)Jtrain​(θ) error 很低,但 JCV(θ)J_\text{CV}(\theta)JCV​(θ) 的 error 很高

但要怎麼設定 λ\lambdaλ ? 可不可以自動找出一個最好的 λ\lambdaλ ?

我們觀察,當 λ\lambdaλ 在不同程度時的變化

λ=10000\lambda = 10000λ=10000,所有的 θ≈0\theta \approx 0θ≈0,所以變成 High bias (underfit)

λ=0\lambda = 0λ=0,等於沒有 regularization,所以變成 High variance (overfit)

也就是 λ\lambdaλ 越小時,train cost 很低 (overfit),但也因此 CV cost 很高

而 λ\lambdaλ 越大時,train cost 變高了 (underfit),所以 CV cost 依然很高

我們可以用類似 model selection 的方式來尋找 best λ\lambdaλ

首先定義一個的 λ\lambdaλ list (可以以 *2 列出)

用每一個 λ\lambdaλ 去學習每一個 min⁡θJ(θ)\min_\theta J(\theta)minθ​J(θ) 得到不同的 θ\thetaθ

將學到的 θ\thetaθ 丟到不含 regularization 的 CV cost function JCV(θ)J_{CV}(\theta)JCV​(θ) 計算

將最好的 λ and θ\lambda \text{ and } \thetaλ and θ 丟到 J_\Text{test}(\theta) 測試結果

Jtrain(θ)J_\text{train}(\theta)Jtrain​(θ) 會非常小 (因為訓練過)

JCV(θ)J_\text{CV}(\theta)JCV​(θ) 會非常大 (因為不是訓練的 data,且只有一點點 data)

Jtrain(θ)J_\text{train}(\theta)Jtrain​(θ) 會越來越大 (underfit 的關係)

JCV(θ)J_\text{CV}(\theta)JCV​(θ) 會降低,但還是會很大 (一樣是因為 underfit)

Jtrain(θ)J_\text{train}(\theta)Jtrain​(θ) 跟 JCV(θ)J_\text{CV}(\theta)JCV​(θ) 會 converge

Jtrain(θ)J_\text{train}(\theta)Jtrain​(θ) 會非常小 (因為訓練過)

JCV(θ)J_\text{CV}(\theta)JCV​(θ) 會非常大 (因為不是訓練的 data,且只有一點點 data)

Jtrain(θ)J_\text{train}(\theta)Jtrain​(θ) 會越來越大,但是好現象的越來越大

JCV(θ)J_\text{CV}(\theta)JCV​(θ) 會越來越低,並且越來越接近 desired performance

Jtrain(θ)J_\text{train}(\theta)Jtrain​(θ) 跟 JCV(θ)J_\text{CV}(\theta)JCV​(θ) 一樣會 converge

add polynomial features (x1x2,x12,x22,⋯ )(x_1x_2, x_1^2, x_2^2, \cdots)(x1​x2​,x12​,x22​,⋯)

decrease λ\lambdaλ

increase λ\lambdaλ

而 overfitting 也可以透過 regularization 來修正 (increase λ\lambdaλ)