Academic
  • Introduction
  • Artificial Intelligence
    • Introduction
    • AI Concepts, Terminology, and Application Areas
    • AI: Issues, Concerns and Ethical Considerations
  • Biology
    • Scientific Method
    • Chemistry of Life
    • Water, Acids, Bases
    • Properties of carbon
    • Macromolecules
    • Energy and Enzymes
    • Structure of a cell
    • Membranes and transport
    • Cellular respiration
    • Cell Signaling
    • Cell Division
    • Classical and molecular genetics
    • DNA as the genetic material
    • Central dogma
    • Gene regulation
  • Bioinformatics
    • Bioinformatics Overview
  • Deep Learning
    • Neural Networks and Deep Learning
      • Introduction
      • Logistic Regression as a Neural Network
      • Python and Vectorization
      • Shallow Neural Network
      • Deep Neural Network
    • Improving Deep Neural Networks
      • Setting up your Machine Learning Application
      • Regularizing your Neural Network
      • Setting up your Optimization Problem
      • Optimization algorithms
      • Hyperparameter, Batch Normalization, Softmax
    • Structuring Machine Learning Projects
    • Convolutional Neural Networks
      • Introduction
    • Sequence Models
      • Recurrent Neural Networks
      • Natural Language Processing & Word Embeddings
      • Sequence models & Attention mechanism
  • Linear Algebra
    • Vectors and Spaces
      • Vectors
      • Linear combinations and spans
      • Linear dependence and independence
      • Subspaces and the basis for a subspace
      • Vector dot and cross products
      • Matrices for solving systems by elimination
      • Null space and column space
    • Matrix transformations
      • Functions and linear transformations
      • Linear transformation examples
      • Transformations and matrix multiplication
      • Inverse functions and transformations
      • Finding inverses and determinants
      • More Determinant Depth
  • Machine Learning
    • Introduction
    • Linear Regression
      • Model and Cost Function
      • Parameter Learning
      • Multivariate Linear Regression
      • Computing Parameters Analytically
      • Octave
    • Logistic Regression
      • Classification and Representation
      • Logistic Regression Model
    • Regularization
      • Solving the Problem of Overfitting
    • Neural Networks
      • Introduction of Neural Networks
      • Neural Networks - Learning
    • Improve Learning Algorithm
      • Advice for Applying Machine Learning
      • Machine Learning System Design
    • Support Vector Machine
      • Large Margin Classification
      • Kernels
      • SVM in Practice
  • NCKU - Artificial Intelligence
    • Introduction
    • Intelligent Agents
    • Solving Problems by Searching
    • Beyond Classical Search
    • Learning from Examples
  • NCKU - Computer Architecture
    • First Week
  • NCKU - Data Mining
    • Introduction
    • Association Analysis
    • FP-growth
    • Other Association Rules
    • Sequence Pattern
    • Classification
    • Evaluation
    • Clustering
    • Link Analysis
  • NCKU - Machine Learning
    • Probability
    • Inference
    • Bayesian Inference
    • Introduction
  • NCKU - Robotic Navigation and Exploration
    • Kinetic Model & Vehicle Control
    • Motion Planning
    • SLAM Back-end (I)
    • SLAM Back-end (II)
    • Computer Vision / Multi-view Geometry
    • Lie group & Lie algebra
    • SLAM Front-end
  • Python
    • Numpy
    • Pandas
    • Scikit-learn
      • Introduction
      • Statistic Learning
  • Statstics
    • Quantitative Data
    • Modeling Data Distribution
    • Bivariate Numerical Data
    • Probability
    • Random Variables
    • Sampling Distribution
    • Confidence Intervals
    • Significance tests
Powered by GitBook
On this page
  • Binary Classification
  • Notation
  • Logistic Regression
  • Loss Function
  • Cost Function
  • Gradient Descent
  • Computation Graph
  • Logisitic Regression Gradient Descent as NN

Was this helpful?

  1. Deep Learning
  2. Neural Networks and Deep Learning

Logistic Regression as a Neural Network

PreviousIntroductionNextPython and Vectorization

Last updated 5 years ago

Was this helpful?

Binary Classification

一開始,我們想將一張 64x64 的圖片餵進 nn 中,來辨識是否是貓

1 代表是,0 代表不是

怎麼餵呢,我們將圖片中 RGB 三種 channel 的各 64x64 個 pixel 存到一個非常長的向量 x

共有 64×64×3=1228864 \times 64 \times 3 = 1228864×64×3=12288 個 features

x=[255231⋮255134⋮255134⋮]x = \begin{bmatrix} 255\\231\\\vdots\\255\\134\\\vdots\\255\\134\\\vdots \end{bmatrix}x=​255231⋮255134⋮255134⋮​​

我們會使用 nxn_xnx​ 來表達 feature x 的維度,所以 n=nx=12288n = n_x = 12288n=nx​=12288

Notation

我們的 training sets 為 (x,y)(x, y)(x,y) 分別表示為

x∈Rnx,y∈{0,1}x \in \mathbb{R}^{n_x}, y \in \begin{Bmatrix}0,1\end{Bmatrix}x∈Rnx​,y∈{0,1​}

共有 m 個 training sets

{(x(1),y(1)),(x(2),y(2)),⋯ ,(x(m),y(m))}\begin{Bmatrix}(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}),\cdots, (x^{(m)}, y^{(m)})\end{Bmatrix}{(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))​}

並且我們會壓縮所有的 training sets 和 y 成為一個大矩陣

這邊跟 machine learning 學的有點不一樣

Logistic Regression

Loss Function

我們會定義一個 Loss function 來當作 single training set 的誤差

以下是一個 intuition 為什麼這個 loss function 可以這樣定義

Cost Function

cost function is the average of the loss functions of the entire training set

裡面的 superscript i 代表第 i 個 training set

Gradient Descent

這邊的 gradient descent 和 machine learning 時學的大同小異

只是現在我們連 b 也要一起進行更新

Derivatives intuition

這邊講一下 gradient descent 需要用到的微分技巧

其實就只是要求出 x 在對應的 cost function 當下的 slope 而已

做任何的移動都可以算出 slope = 3

Changing Derivatives

在不同的 function 上可以有不斷變動的 slope

當 a = 2 時

當 a = 5 時

可以看到不同地方的三角形所產生的斜率都不同

這其實就是一次微分的意義

常見的有 :

Computation Graph

在講解 backpropogation 前,我們用一個簡單方法來解釋一下流程

這個函式需要用到三個計算

可以畫成像這樣的 computation graph

而 backpropogation 就像是由右往左取每一個數對 J 的導數一樣

求完就可以對每個數做 gradient descent

為什麼要由右往左取呢,是因為需要用到 chain rule 的關係

其中 var 是任何一個變數

這是 python 在定義這些變數時的常見寫法

因為當 v 進行微調的時候,J 就會改變三倍

因為當 a 進行微調時,J 一樣也改變三倍

但我們要從 a 改變 v 而 v 改變 J 這樣算

基於這點,我們再找出 a 進行微調時 v 會維持一倍

我們用這個 chain rule 方法計算出 u 然後再計算出 b, c

就可以得到所有數字對 J 的 slope 是多少

Logisitic Regression Gradient Descent as NN

我們將 logistic regression 計算 cost function 轉換成 computation graph

  1. 然後對這三個數做 gradient descent

Gradient Decsent on m Examples

接著我們把上面所有東西統整一起,對 m 筆 training sets 做一次 gradient descent

pseudo code 如下 :

J = dw1 = dw2 = db = 0

For i = 1 to m:
    % 首先是 forward propogation
    z(i) = w'x(i) + b  % 每個 training set 的 z
    a(i) = sigmoid(z(i))  % 每個 training set 的 a
    J += -[y(i) * log(a(i)) + (1 - y(i)) * log(1 - a(i))]  % 加總所有 training sets 的 cost

    % 接著是 backward propogation
    dz(i) = a(i) - y(i)  % 每個 training set 的 dz

    % feature 很多時,這裡會是一個 loop
    dw1 += x1(i) * dz(i)  % 每個 training set 的 dw1 加到一個 dw1 總合去
    dw2 += x2(i) * dz(i)  % 每個 training set 的 dw2 加到一個 dw2 總合去
    db  += dz(i)  % 每個 training set 的 db 加到一個 db 總合去

% 結束迴圈後對以下幾個東西,都求平均數
J /= m
dw1 /= m
dw2 /= m
db /= m

這個做法可能會用到兩個迴圈,一個是對 m 筆資料,一個是對 n 個 input features (dw1, dw2, ..., dwn)

所以我們需要學習 vectorization 的精神來解決問題

在 XXX 矩陣中

每個 columns i 為不同個 training set x(i)x^{(i)}x(i)

每個 rows j 表示所有 xxx 的第 j 個 feature

在 yyy 矩陣中

每個 columns i 為不同個 result label y(i)y^{(i)}y(i)

定義好 xxx 和 y^\hat{y}y^​

Given x∈Rnx, want y^=P(y=1∣x),0≤y^≤1\text{Given } x \in \mathbb{R}^{n_x},\text{ want } \hat{y} = P(y=1\mid x), 0 \le \hat{y} \le 1Given x∈Rnx​, want y^​=P(y=1∣x),0≤y^​≤1

以及 parameter w∈Rnxw \in \mathbb{R}^{n_x}w∈Rnx​ 和 b∈Rb \in \mathbb{R}b∈R

這個 www 其實就是之前的 θ\thetaθ 是一個 vector

而 bbb 就是 bias 的 θ0\theta_0θ0​ 是一個實數

然後我們要求 output y^\hat{y}y^​ 就是之前的 hθ(x)h_\theta(x)hθ​(x)

其中的 σ\sigmaσ 就是之前的 sigmoid ggg function

y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b)y^​=σ(wTx+b)
L(y^,y)=−(ylog⁡y^+(1−y)log⁡(1−y^))\mathcal{L}(\hat{y}, y) = -(y\log \hat{y} + (1-y)\log (1-\hat{y}))L(y^​,y)=−(ylogy^​+(1−y)log(1−y^​))
If y=1:L(y^,y)=−log⁡y^If y=0:L(y^,y)=−log⁡(1−y^)\begin{aligned} &\text{If } y = 1 : && \mathcal{L}(\hat{y}, y) = -\log \hat{y} \\ &\text{If } y = 0 : && \mathcal{L}(\hat{y}, y) = -\log(1 - \hat{y}) \end{aligned}​If y=1:If y=0:​​L(y^​,y)=−logy^​L(y^​,y)=−log(1−y^​)​

就是說當 y=1y = 1y=1 時,我的 y^\hat{y}y^​ 也要越大越好

我的 y^\hat{y}y^​ 越大,代表越接近 yyy,我的 loss 就越小,所以 −log⁡y^≈0-\log \hat{y} \approx 0−logy^​≈0

當 y=0y = 0y=0, y^\hat{y}y^​ 要越小越好

所以 −log⁡(1−y^)-\log(1 - \hat{y})−log(1−y^​) 就會接近 0

於是我們可以將作用在每一筆 training sets 的 loss function 集合形成 cost function J(w,b)J(w,b)J(w,b)

J(w,b)=1m∑i=1mL(y^(i),y(i))=−1m∑i=1m[y(i)log⁡y^(i)+(1−y(i))log⁡(1−y^(i))]\begin{aligned} J(w,b) &= \frac{1}{m}\sum_{i=1}^m\mathcal{L}(\hat{y}^{(i)}, y^{(i)}) \\ &= -\frac{1}{m}\sum_{i=1}^{m}\begin{bmatrix}y^{(i)}\log\hat{y}^{(i)} + (1- y^{(i)}) \log(1-\hat{y}^{(i)})\end{bmatrix} \end{aligned}J(w,b)​=m1​i=1∑m​L(y^​(i),y(i))=−m1​i=1∑m​[y(i)logy^​(i)+(1−y(i))log(1−y^​(i))​]​
Repeat :w:=w−α∂J(w,b)∂wb:=b−α∂J(w,b)∂b\begin{aligned} &\text{Repeat :} && \\ &&w := w - \alpha \frac{\partial J(w, b)}{\partial w}\\ &&b := b - \alpha \frac{\partial J(w, b)}{\partial b}\\ \end{aligned}​Repeat :​w:=w−α∂w∂J(w,b)​b:=b−α∂b∂J(w,b)​​

slope 的算法就是 heightwidth\frac{\text{height}}{\text{width}}widthheight​

所以一個 f(a)=3af(a) = 3af(a)=3a 的 linear function 中

a=2f(a)=6a=2.001f(a)=6.003Slope of f(a) at a=2 is 0.0030.001=3=∂f(a)∂a\begin{aligned} &a = 2 && f(a) = 6\\ &a = 2.001 && f(a) = 6.003\\ &\text{Slope of } f(a) \text{ at } a = 2 \text{ is } \frac{0.003}{0.001} = 3 = \frac{\partial f(a)}{\partial a} \end{aligned}​a=2a=2.001Slope of f(a) at a=2 is 0.0010.003​=3=∂a∂f(a)​​​f(a)=6f(a)=6.003​

例如在 f(a)=a2f(a) = a^2f(a)=a2 的圖形中

a=2f(a)≈4a=2.001f(a)≈4.004Slope df(a)da=4 when a=2\begin{aligned} &a = 2 && f(a) \approx 4\\ &a = 2.001 && f(a) \approx 4.004\\ &\text{Slope } \frac{df(a)}{da} = 4 \text{ when } a = 2 \end{aligned}​a=2a=2.001Slope dadf(a)​=4 when a=2​​f(a)≈4f(a)≈4.004​
a=5f(a)≈25a=5.001f(a)≈25.010Slope df(a)da=10 when a=5\begin{aligned} &a = 5 && f(a) \approx 25\\ &a = 5.001 && f(a) \approx 25.010\\ &\text{Slope } \frac{df(a)}{da} = 10 \text{ when } a = 5 \end{aligned}​a=5a=5.001Slope dadf(a)​=10 when a=5​​f(a)≈25f(a)≈25.010​
f(a)=a2df(a)da=2af(a)=a3df(a)da=3a2f(a)=ln⁡(a)df(a)da=1a\begin{aligned} &f(a) = a^2 && \frac{d f(a)}{da} = 2a\\ &f(a) = a^3 && \frac{d f(a)}{da} = 3a^2\\ &f(a) = \ln(a) && \frac{d f(a)}{da} = \frac{1}{a}\\ \end{aligned}​f(a)=a2f(a)=a3f(a)=ln(a)​​dadf(a)​=2adadf(a)​=3a2dadf(a)​=a1​​

假設我們想要最小化 J(a,b,c)=3(a+bc)J(a,b,c) = 3(a+bc)J(a,b,c)=3(a+bc)

u=bcv=a+uJ=3v\begin{aligned} u &= bc\\ v &= a+u\\ J &= 3v \end{aligned}uvJ​=bc=a+u=3v​

以下會將 dJdvar\frac{dJ}{dvar}dvardJ​ 標示為 dvardvardvar

例如 dJda=da\frac{dJ}{da} = dadadJ​=da, dJdb=db\frac{dJ}{db} = dbdbdJ​=db

現在我們就可以往前求得 dv=3dv = 3dv=3

v=1J=3v=1.001J=3.003Slope =0.0030.001=3\begin{aligned} &v = 1 && J = 3\\ &v = 1.001 && J = 3.003 \\ &\text{Slope } = \frac{0.003}{0.001} = 3 \end{aligned}​v=1v=1.001Slope =0.0010.003​=3​​J=3J=3.003​

再來我們再往前求得 da=3da = 3da=3

dJda=dJdvdvda=3⋅dvda\frac{dJ}{da} = \frac{dJ}{dv} \frac{dv}{da} = 3 \cdot \frac{dv}{da}dadJ​=dvdJ​dadv​=3⋅dadv​

我們已經知道 dJdv\frac{dJ}{dv}dvdJ​

a=5v=11a=5.001J=11.001Slope =0.0010.001=1\begin{aligned} &a = 5 && v = 11\\ &a = 5.001 && J = 11.001 \\ &\text{Slope } = \frac{0.001}{0.001} = 1 \end{aligned}​a=5a=5.001Slope =0.0010.001​=1​​v=11J=11.001​

所以 da=dJda=3⋅dvda=3⋅1=3da = \frac{dJ}{da} = 3 \cdot \frac{dv}{da} = 3 \cdot 1 = 3da=dadJ​=3⋅dadv​=3⋅1=3

計算 a 的導數為 da=dL(a,y)da=−ya+1−y1−ada = \frac{d\mathcal{L}(a, y)}{da} = -\frac{y}{a} + \frac{1-y}{1-a}da=dadL(a,y)​=−ay​+1−a1−y​

計算 z 的導數為 dz=dL(a,y)dz=dLda⋅dadz=a−ydz = \frac{d\mathcal{L}(a, y)}{dz} = \frac{d\mathcal{L}}{da} \cdot \frac{da}{dz}= a - ydz=dzdL(a,y)​=dadL​⋅dzda​=a−y

其中已知 dLda=−ya+1−y1−a\frac{d\mathcal{L}}{da} = -\frac{y}{a} + \frac{1-y}{1-a}dadL​=−ay​+1−a1−y​

而新的 dadz=a(1−a)\frac{da}{dz} = a(1-a)dzda​=a(1−a)

接著就可以算出 w1,w2,bw_1, w_2, bw1​,w2​,b 對 JJJ 的導數

dw1=dLdw1=x1⋅dzdw_1 = \frac{d\mathcal{L}}{dw_1} = x_1 \cdot dzdw1​=dw1​dL​=x1​⋅dz

dw2=dLdw2=x2⋅dzdw_2 = \frac{d\mathcal{L}}{dw_2} = x_2 \cdot dzdw2​=dw2​dL​=x2​⋅dz

db=dLdb=dzdb = \frac{d\mathcal{L}}{db} = dzdb=dbdL​=dz

w1:=w1−αdw1w_1 := w_1 - \alpha dw_1w1​:=w1​−αdw1​

w2:=w2−αdw2w_2 := w_2 - \alpha dw_2w2​:=w2​−αdw2​

b:=b−αdbb := b - \alpha dbb:=b−αdb