Academic
  • Introduction
  • Artificial Intelligence
    • Introduction
    • AI Concepts, Terminology, and Application Areas
    • AI: Issues, Concerns and Ethical Considerations
  • Biology
    • Scientific Method
    • Chemistry of Life
    • Water, Acids, Bases
    • Properties of carbon
    • Macromolecules
    • Energy and Enzymes
    • Structure of a cell
    • Membranes and transport
    • Cellular respiration
    • Cell Signaling
    • Cell Division
    • Classical and molecular genetics
    • DNA as the genetic material
    • Central dogma
    • Gene regulation
  • Bioinformatics
    • Bioinformatics Overview
  • Deep Learning
    • Neural Networks and Deep Learning
      • Introduction
      • Logistic Regression as a Neural Network
      • Python and Vectorization
      • Shallow Neural Network
      • Deep Neural Network
    • Improving Deep Neural Networks
      • Setting up your Machine Learning Application
      • Regularizing your Neural Network
      • Setting up your Optimization Problem
      • Optimization algorithms
      • Hyperparameter, Batch Normalization, Softmax
    • Structuring Machine Learning Projects
    • Convolutional Neural Networks
      • Introduction
    • Sequence Models
      • Recurrent Neural Networks
      • Natural Language Processing & Word Embeddings
      • Sequence models & Attention mechanism
  • Linear Algebra
    • Vectors and Spaces
      • Vectors
      • Linear combinations and spans
      • Linear dependence and independence
      • Subspaces and the basis for a subspace
      • Vector dot and cross products
      • Matrices for solving systems by elimination
      • Null space and column space
    • Matrix transformations
      • Functions and linear transformations
      • Linear transformation examples
      • Transformations and matrix multiplication
      • Inverse functions and transformations
      • Finding inverses and determinants
      • More Determinant Depth
  • Machine Learning
    • Introduction
    • Linear Regression
      • Model and Cost Function
      • Parameter Learning
      • Multivariate Linear Regression
      • Computing Parameters Analytically
      • Octave
    • Logistic Regression
      • Classification and Representation
      • Logistic Regression Model
    • Regularization
      • Solving the Problem of Overfitting
    • Neural Networks
      • Introduction of Neural Networks
      • Neural Networks - Learning
    • Improve Learning Algorithm
      • Advice for Applying Machine Learning
      • Machine Learning System Design
    • Support Vector Machine
      • Large Margin Classification
      • Kernels
      • SVM in Practice
  • NCKU - Artificial Intelligence
    • Introduction
    • Intelligent Agents
    • Solving Problems by Searching
    • Beyond Classical Search
    • Learning from Examples
  • NCKU - Computer Architecture
    • First Week
  • NCKU - Data Mining
    • Introduction
    • Association Analysis
    • FP-growth
    • Other Association Rules
    • Sequence Pattern
    • Classification
    • Evaluation
    • Clustering
    • Link Analysis
  • NCKU - Machine Learning
    • Probability
    • Inference
    • Bayesian Inference
    • Introduction
  • NCKU - Robotic Navigation and Exploration
    • Kinetic Model & Vehicle Control
    • Motion Planning
    • SLAM Back-end (I)
    • SLAM Back-end (II)
    • Computer Vision / Multi-view Geometry
    • Lie group & Lie algebra
    • SLAM Front-end
  • Python
    • Numpy
    • Pandas
    • Scikit-learn
      • Introduction
      • Statistic Learning
  • Statstics
    • Quantitative Data
    • Modeling Data Distribution
    • Bivariate Numerical Data
    • Probability
    • Random Variables
    • Sampling Distribution
    • Confidence Intervals
    • Significance tests
Powered by GitBook
On this page
  • Cost Function
  • Simplified Cost Function and Gradient Descent
  • Advanced Optimization
  • Example
  • Multiclass Classification: One-vs-all

Was this helpful?

  1. Machine Learning
  2. Logistic Regression

Logistic Regression Model

Cost Function

在處理 Classification 的 logistic regression 時

我們的 cost function 不能和 linear regression 一樣使用 :

J(θ)=1m∑i=1m12(hθ(x(i))−y(i))2J(\theta) = \frac{1}{m}\sum_{i=1}^m \color{red}{\frac{1}{2} (h_\theta(x^{(i)})- y^{(i)})^2}J(θ)=m1​i=1∑m​21​(hθ​(x(i))−y(i))2

原因如下,我們把 sum 後面那一段簡寫成 :

Cost(hθ(x),y)=12(hθ(x)−y)2\text{Cost}(h_\theta(x), y) = \frac{1}{2}(h_\theta(x)-y)^2Cost(hθ​(x),y)=21​(hθ​(x)−y)2

而其中的 h(x) 在 logistic regression 時需要帶入 :

Cost(hθ(x),y)=12(11+e−θTx−y)2\text{Cost}(h_\theta(x), y) = \frac{1}{2}(\frac{1}{1+e^{-\theta^Tx}}-y)^2Cost(hθ​(x),y)=21​(1+e−θTx1​−y)2

這個 complicated nonlinear function 會使得 cost function 變為 non-convex

也就是有多個 local optima 難以 converge

所以針對 logistic regression 有另一套 cost function :

J(θ)=1m∑i=1mCost(hθ(x(i)),y(i))Cost(hθ(x),y)=−log⁡(hθ(x)) if y=1Cost(hθ(x),y)=−log⁡(1−hθ(x)) if y=0\begin{aligned} &J(\theta) = \frac{1}{m}\sum_{i=1}^m \text{Cost}(h_\theta(x^{(i)}),y^{(i)})\\ &\text{Cost}(h_\theta(x),y) = -\log(h_\theta(x))&& \text{ if } y = 1\\ &\text{Cost}(h_\theta(x),y) = -\log(1 - h_\theta(x))&& \text{ if } y = 0 \end{aligned}​J(θ)=m1​i=1∑m​Cost(hθ​(x(i)),y(i))Cost(hθ​(x),y)=−log(hθ​(x))Cost(hθ​(x),y)=−log(1−hθ​(x))​​ if y=1 if y=0​

跟使用 logistic function 用來表達 classification 的 hypothesis 一樣

我們用 log function 的特徵加上負號來代表 cost function

所以當 y = 1 且我的 h(x) 也為 1 時

代表我準確命中,所以 cost function 應該為 0

而我 h(x) 為 0 時,代表我錯的離譜,所以 cost function 會 → ∞

而當 y = 0 且我 h(x) 也為 0 時

一樣表示我準確命中,所以 cost function 為 0

而 h(x) 為 1 時,錯的離譜,所以 cost function 一樣 → ∞

現在我們得到一個 cost function J 他是一定能夠 converge 的 convex function

Cost(hθ(x),y)=0 if hθ(x)=yCost(hθ(x),y)→∞ if y=0 and hθ(x)→1Cost(hθ(x),y)→∞ if y=1 and hθ(x)→0\begin{aligned} &\text{Cost}(h_\theta(x), y) = 0 \text{ if } h_\theta(x) = y\\ &\text{Cost}(h_\theta(x), y) \rightarrow \infty \text{ if } y = 0 \text{ and } h_\theta(x) \rightarrow 1\\ &\text{Cost}(h_\theta(x), y) \rightarrow \infty \text{ if } y = 1 \text{ and } h_\theta(x) \rightarrow 0 \end{aligned}​Cost(hθ​(x),y)=0 if hθ​(x)=yCost(hθ​(x),y)→∞ if y=0 and hθ​(x)→1Cost(hθ​(x),y)→∞ if y=1 and hθ​(x)→0​

Simplified Cost Function and Gradient Descent

我們可以將兩行的 cost function 濃縮成一行 :

Cost(hθ(x),y)=−ylog⁡(hθ(x))−(1−y)log⁡(1−hθ(x))\text{Cost}(h_\theta(x), y) = -y \log(h_\theta(x)) - (1 - y) \log (1-h_\theta(x))Cost(hθ​(x),y)=−ylog(hθ​(x))−(1−y)log(1−hθ​(x))
  • 假設 y = 1 那麼後面項將會被消掉

    Cost(hθ(x),y)=−log⁡(hθ(x))\text{Cost}(h_\theta(x), y) = -\log(h_\theta(x))Cost(hθ​(x),y)=−log(hθ​(x))
  • y = 0 則是前面項會被消掉

    Cost(hθ(x),y)=−log⁡(1−hθ(x))\text{Cost}(h_\theta(x), y) = -\log(1-h_\theta(x))Cost(hθ​(x),y)=−log(1−hθ​(x))

所以完整的 cost function J 為

J(θ)=−1m∑i=1m[y(i)log⁡(hθ(x(i)))+(1−y(i))log⁡(1−hθ(x(i)))]J(\theta) = - \frac{1}{m}\sum_{i=1}^m [y^{(i)}\log(h_\theta(x^{(i)})) + (1 - y^{(i)})\log(1-h_\theta(x^{(i)}))]J(θ)=−m1​i=1∑m​[y(i)log(hθ​(x(i)))+(1−y(i))log(1−hθ​(x(i)))]

可以再進化成 vectorized function

h=g(Xθ)J(θ)=1m⋅(−yTlog⁡(h)−(1−y)Tlog⁡(1−h))h = g(X\theta)\\ J(\theta) = \frac{1}{m} \cdot (-y^T\log(h)-(1-y)^T\log(1-h))h=g(Xθ)J(θ)=m1​⋅(−yTlog(h)−(1−y)Tlog(1−h))

我們一樣可以使用 Gradient Descent 來找出 cost function 的最小 θ\thetaθ

原本的 gradient descent 長這樣 :

Repeat {θj:=θj−αddθjJ(θ)}\begin{aligned} &\text{Repeat \{} \\ &\theta_j := \theta_j - \alpha \frac{d}{d\theta_j}J(\theta)\\ &\text{\}} \end{aligned}​Repeat {θj​:=θj​−αdθj​d​J(θ)}​

現在我們可以將新的 cost function J 插入得到 :

Repeat {θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)}\begin{aligned} &\text{Repeat \{} \\ &\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_j\\ &\text{\}} \end{aligned}​Repeat {θj​:=θj​−αm1​i=1∑m​(hθ​(x(i))−y(i))xj(i)​}​

這個新的 Gradient descent 看起來跟 linear regression 的一模一樣

但實際上裡面的 h(x) 是不同的東西 :

Old : h(x)=θTxNew : h(x)=11+e−θTx\begin{aligned} \text{Old : } h(x) &= \theta^Tx \\ \text{New : } h(x) &= \frac{1}{1+e^{-\theta^Tx}} \end{aligned}Old : h(x)New : h(x)​=θTx=1+e−θTx1​​

每個 loop 裡面的 theta 一樣要同時更新,而他的 vectorized implementation 為

θ:=θ−α1mXT(g(Xθ)−y⃗)\theta := \theta - \alpha \frac{1}{m} X^T (g(X\theta) - \vec{y})θ:=θ−αm1​XT(g(Xθ)−y​)

Advanced Optimization

有一些進階的方法可以求 optimize theta

例如 :

  • Conjugate gradient

  • BFGS

  • L-BFGS

這些方法比起自己寫的 gradient descent 還要更快更有效率

也不需手動決定 learning rate

但十分複雜,不過我們可以透過一些 library 來直接執行

在 Octave 中我們呼叫 fminunc() 來實作

我們需要提供以下兩個東西的數值,分別為

J(θ) and ddθjJ(θ)J(\theta) \text{ and } \frac{d}{d\theta_j}J(\theta)J(θ) and dθj​d​J(θ)

這邊我們建立一個 costFunction 可以一次算出兩者

function [jVal, gradient] = costFunction(theta)
    jVal = [...code to compute J(theta)...];
    gradient = [...code to compute derivative of J(theta)...];
end

接著要提供 optimset (function 的 options)

以及最初的 θ\thetaθ 值

Octave 程式碼如下 :

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);

[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

Example

舉個簡單的例子 :

要找到該 Cost function 最小的 theta (一看就知道兩個都是 5)

θ=[θ1θ2]J(θ)=(θ1−5)2+(θ2−5)2ddθ1J(θ)=2(θ1−5)ddθ2J(θ)=2(θ2−5)\begin{aligned} \theta &= \begin{bmatrix}\theta_1\\\theta_2\end{bmatrix}\\ J(\theta) &= (\theta_1 - 5)^2 + (\theta_2 - 5)^2\\ \frac{d}{d\theta_1}J(\theta) &= 2(\theta_1 - 5)\\ \frac{d}{d\theta_2}J(\theta) &= 2(\theta_2 - 5) \end{aligned}θJ(θ)dθ1​d​J(θ)dθ2​d​J(θ)​=[θ1​θ2​​]=(θ1​−5)2+(θ2​−5)2=2(θ1​−5)=2(θ2​−5)​

實作 :

  • Cost function

function [jVal, gradient] = costFunction(theta)
    jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
    gradient = zeros(2, 1);
    gradient(1) = 2*(theta(1)-5);
    gradient(2) = 2*(theta(2)-5);
end
  • Calling fminunc

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2, 1);

[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
  • prints

optTheta =
    5.0000
    5.0000

functionVal = 1.5777e-030  // 就是 cost 0
exitFlag = 1  // 代表 Converge

Multiclass Classification: One-vs-all

當 classification 的種類超過兩種以上

也就是 y = {0, 1} 擴充至 y = {0, 1, ... n} 時

要使用 one-vs-all (one-vs-rest) 的方法來解決

做法就是將 n 個種類的 data 執行 n 次的分類

每一次專注在一種種類,並把其他種類當成一類

y∈{0,1,⋯ ,n}hθ(0)(x)=P(y=0∣x;θ)hθ(1)(x)=P(y=1∣x;θ)hθ(2)(x)=P(y=2∣x;θ)⋮hθ(n)(x)=P(y=n∣x;θ)prediction =max⁡i(hθ(i)(x))\begin{aligned} &y \in \begin{Bmatrix}0,1,\cdots,n\end{Bmatrix}\\ &h_\theta^{(0)} (x) = P(y = 0 \mid x;\theta)\\ &h_\theta^{(1)} (x) = P(y = 1 \mid x;\theta)\\ &h_\theta^{(2)} (x) = P(y = 2 \mid x;\theta)\\ \vdots\\ &h_\theta^{(n)} (x) = P(y = n \mid x;\theta)\\\\ &\text{prediction } = \max_{i}(h_\theta^{(i)}(x)) \end{aligned}⋮​y∈{0,1,⋯,n​}hθ(0)​(x)=P(y=0∣x;θ)hθ(1)​(x)=P(y=1∣x;θ)hθ(2)​(x)=P(y=2∣x;θ)hθ(n)​(x)=P(y=n∣x;θ)prediction =imax​(hθ(i)​(x))​

然後分析每一個 data 在所有 n 個 h(x) 裡誰的機率最高,他就是哪一類

例如新的 data 要進到裡面來,看他是 class 1, 2, 3 哪一種

就會把三種 h(x) 各跑一次

最後哪一種機率最高,他就是那一個 class

PreviousClassification and RepresentationNextRegularization

Last updated 5 years ago

Was this helpful?