Academic
  • Introduction
  • Artificial Intelligence
    • Introduction
    • AI Concepts, Terminology, and Application Areas
    • AI: Issues, Concerns and Ethical Considerations
  • Biology
    • Scientific Method
    • Chemistry of Life
    • Water, Acids, Bases
    • Properties of carbon
    • Macromolecules
    • Energy and Enzymes
    • Structure of a cell
    • Membranes and transport
    • Cellular respiration
    • Cell Signaling
    • Cell Division
    • Classical and molecular genetics
    • DNA as the genetic material
    • Central dogma
    • Gene regulation
  • Bioinformatics
    • Bioinformatics Overview
  • Deep Learning
    • Neural Networks and Deep Learning
      • Introduction
      • Logistic Regression as a Neural Network
      • Python and Vectorization
      • Shallow Neural Network
      • Deep Neural Network
    • Improving Deep Neural Networks
      • Setting up your Machine Learning Application
      • Regularizing your Neural Network
      • Setting up your Optimization Problem
      • Optimization algorithms
      • Hyperparameter, Batch Normalization, Softmax
    • Structuring Machine Learning Projects
    • Convolutional Neural Networks
      • Introduction
    • Sequence Models
      • Recurrent Neural Networks
      • Natural Language Processing & Word Embeddings
      • Sequence models & Attention mechanism
  • Linear Algebra
    • Vectors and Spaces
      • Vectors
      • Linear combinations and spans
      • Linear dependence and independence
      • Subspaces and the basis for a subspace
      • Vector dot and cross products
      • Matrices for solving systems by elimination
      • Null space and column space
    • Matrix transformations
      • Functions and linear transformations
      • Linear transformation examples
      • Transformations and matrix multiplication
      • Inverse functions and transformations
      • Finding inverses and determinants
      • More Determinant Depth
  • Machine Learning
    • Introduction
    • Linear Regression
      • Model and Cost Function
      • Parameter Learning
      • Multivariate Linear Regression
      • Computing Parameters Analytically
      • Octave
    • Logistic Regression
      • Classification and Representation
      • Logistic Regression Model
    • Regularization
      • Solving the Problem of Overfitting
    • Neural Networks
      • Introduction of Neural Networks
      • Neural Networks - Learning
    • Improve Learning Algorithm
      • Advice for Applying Machine Learning
      • Machine Learning System Design
    • Support Vector Machine
      • Large Margin Classification
      • Kernels
      • SVM in Practice
  • NCKU - Artificial Intelligence
    • Introduction
    • Intelligent Agents
    • Solving Problems by Searching
    • Beyond Classical Search
    • Learning from Examples
  • NCKU - Computer Architecture
    • First Week
  • NCKU - Data Mining
    • Introduction
    • Association Analysis
    • FP-growth
    • Other Association Rules
    • Sequence Pattern
    • Classification
    • Evaluation
    • Clustering
    • Link Analysis
  • NCKU - Machine Learning
    • Probability
    • Inference
    • Bayesian Inference
    • Introduction
  • NCKU - Robotic Navigation and Exploration
    • Kinetic Model & Vehicle Control
    • Motion Planning
    • SLAM Back-end (I)
    • SLAM Back-end (II)
    • Computer Vision / Multi-view Geometry
    • Lie group & Lie algebra
    • SLAM Front-end
  • Python
    • Numpy
    • Pandas
    • Scikit-learn
      • Introduction
      • Statistic Learning
  • Statstics
    • Quantitative Data
    • Modeling Data Distribution
    • Bivariate Numerical Data
    • Probability
    • Random Variables
    • Sampling Distribution
    • Confidence Intervals
    • Significance tests
Powered by GitBook
On this page
  • Binary Classification
  • Notation
  • Logistic Regression
  • Loss Function
  • Cost Function
  • Gradient Descent
  • Computation Graph
  • Logisitic Regression Gradient Descent as NN

Was this helpful?

  1. Deep Learning
  2. Neural Networks and Deep Learning

Logistic Regression as a Neural Network

Binary Classification

一開始,我們想將一張 64x64 的圖片餵進 nn 中,來辨識是否是貓

1 代表是,0 代表不是

怎麼餵呢,我們將圖片中 RGB 三種 channel 的各 64x64 個 pixel 存到一個非常長的向量 x

共有 64×64×3=1228864 \times 64 \times 3 = 1228864×64×3=12288 個 features

x=[255231⋮255134⋮255134⋮]x = \begin{bmatrix} 255\\231\\\vdots\\255\\134\\\vdots\\255\\134\\\vdots \end{bmatrix}x=​255231⋮255134⋮255134⋮​​

我們會使用 nxn_xnx​ 來表達 feature x 的維度,所以 n=nx=12288n = n_x = 12288n=nx​=12288

Notation

我們的 training sets 為 (x,y)(x, y)(x,y) 分別表示為

x∈Rnx,y∈{0,1}x \in \mathbb{R}^{n_x}, y \in \begin{Bmatrix}0,1\end{Bmatrix}x∈Rnx​,y∈{0,1​}

共有 m 個 training sets

{(x(1),y(1)),(x(2),y(2)),⋯ ,(x(m),y(m))}\begin{Bmatrix}(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}),\cdots, (x^{(m)}, y^{(m)})\end{Bmatrix}{(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))​}

並且我們會壓縮所有的 training sets 和 y 成為一個大矩陣

這邊跟 machine learning 學的有點不一樣

在 XXX 矩陣中

  • 每個 columns i 為不同個 training set x(i)x^{(i)}x(i)

  • 每個 rows j 表示所有 xxx 的第 j 個 feature

在 yyy 矩陣中

  • 每個 columns i 為不同個 result label y(i)y^{(i)}y(i)

Logistic Regression

定義好 xxx 和 y^\hat{y}y^​

Given x∈Rnx, want y^=P(y=1∣x),0≤y^≤1\text{Given } x \in \mathbb{R}^{n_x},\text{ want } \hat{y} = P(y=1\mid x), 0 \le \hat{y} \le 1Given x∈Rnx​, want y^​=P(y=1∣x),0≤y^​≤1

以及 parameter w∈Rnxw \in \mathbb{R}^{n_x}w∈Rnx​ 和 b∈Rb \in \mathbb{R}b∈R

這個 www 其實就是之前的 θ\thetaθ 是一個 vector

而 bbb 就是 bias 的 θ0\theta_0θ0​ 是一個實數

然後我們要求 output y^\hat{y}y^​ 就是之前的 hθ(x)h_\theta(x)hθ​(x)

其中的 σ\sigmaσ 就是之前的 sigmoid ggg function

y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b)y^​=σ(wTx+b)

Loss Function

我們會定義一個 Loss function 來當作 single training set 的誤差

L(y^,y)=−(ylog⁡y^+(1−y)log⁡(1−y^))\mathcal{L}(\hat{y}, y) = -(y\log \hat{y} + (1-y)\log (1-\hat{y}))L(y^​,y)=−(ylogy^​+(1−y)log(1−y^​))

以下是一個 intuition 為什麼這個 loss function 可以這樣定義

If y=1:L(y^,y)=−log⁡y^If y=0:L(y^,y)=−log⁡(1−y^)\begin{aligned} &\text{If } y = 1 : && \mathcal{L}(\hat{y}, y) = -\log \hat{y} \\ &\text{If } y = 0 : && \mathcal{L}(\hat{y}, y) = -\log(1 - \hat{y}) \end{aligned}​If y=1:If y=0:​​L(y^​,y)=−logy^​L(y^​,y)=−log(1−y^​)​

就是說當 y=1y = 1y=1 時,我的 y^\hat{y}y^​ 也要越大越好

我的 y^\hat{y}y^​ 越大,代表越接近 yyy,我的 loss 就越小,所以 −log⁡y^≈0-\log \hat{y} \approx 0−logy^​≈0

當 y=0y = 0y=0, y^\hat{y}y^​ 要越小越好

所以 −log⁡(1−y^)-\log(1 - \hat{y})−log(1−y^​) 就會接近 0

Cost Function

於是我們可以將作用在每一筆 training sets 的 loss function 集合形成 cost function J(w,b)J(w,b)J(w,b)

cost function is the average of the loss functions of the entire training set

J(w,b)=1m∑i=1mL(y^(i),y(i))=−1m∑i=1m[y(i)log⁡y^(i)+(1−y(i))log⁡(1−y^(i))]\begin{aligned} J(w,b) &= \frac{1}{m}\sum_{i=1}^m\mathcal{L}(\hat{y}^{(i)}, y^{(i)}) \\ &= -\frac{1}{m}\sum_{i=1}^{m}\begin{bmatrix}y^{(i)}\log\hat{y}^{(i)} + (1- y^{(i)}) \log(1-\hat{y}^{(i)})\end{bmatrix} \end{aligned}J(w,b)​=m1​i=1∑m​L(y^​(i),y(i))=−m1​i=1∑m​[y(i)logy^​(i)+(1−y(i))log(1−y^​(i))​]​

裡面的 superscript i 代表第 i 個 training set

Gradient Descent

這邊的 gradient descent 和 machine learning 時學的大同小異

只是現在我們連 b 也要一起進行更新

Repeat :w:=w−α∂J(w,b)∂wb:=b−α∂J(w,b)∂b\begin{aligned} &\text{Repeat :} && \\ &&w := w - \alpha \frac{\partial J(w, b)}{\partial w}\\ &&b := b - \alpha \frac{\partial J(w, b)}{\partial b}\\ \end{aligned}​Repeat :​w:=w−α∂w∂J(w,b)​b:=b−α∂b∂J(w,b)​​

Derivatives intuition

這邊講一下 gradient descent 需要用到的微分技巧

其實就只是要求出 x 在對應的 cost function 當下的 slope 而已

slope 的算法就是 heightwidth\frac{\text{height}}{\text{width}}widthheight​

所以一個 f(a)=3af(a) = 3af(a)=3a 的 linear function 中

做任何的移動都可以算出 slope = 3

a=2f(a)=6a=2.001f(a)=6.003Slope of f(a) at a=2 is 0.0030.001=3=∂f(a)∂a\begin{aligned} &a = 2 && f(a) = 6\\ &a = 2.001 && f(a) = 6.003\\ &\text{Slope of } f(a) \text{ at } a = 2 \text{ is } \frac{0.003}{0.001} = 3 = \frac{\partial f(a)}{\partial a} \end{aligned}​a=2a=2.001Slope of f(a) at a=2 is 0.0010.003​=3=∂a∂f(a)​​​f(a)=6f(a)=6.003​

Changing Derivatives

在不同的 function 上可以有不斷變動的 slope

例如在 f(a)=a2f(a) = a^2f(a)=a2 的圖形中

當 a = 2 時

a=2f(a)≈4a=2.001f(a)≈4.004Slope df(a)da=4 when a=2\begin{aligned} &a = 2 && f(a) \approx 4\\ &a = 2.001 && f(a) \approx 4.004\\ &\text{Slope } \frac{df(a)}{da} = 4 \text{ when } a = 2 \end{aligned}​a=2a=2.001Slope dadf(a)​=4 when a=2​​f(a)≈4f(a)≈4.004​

當 a = 5 時

a=5f(a)≈25a=5.001f(a)≈25.010Slope df(a)da=10 when a=5\begin{aligned} &a = 5 && f(a) \approx 25\\ &a = 5.001 && f(a) \approx 25.010\\ &\text{Slope } \frac{df(a)}{da} = 10 \text{ when } a = 5 \end{aligned}​a=5a=5.001Slope dadf(a)​=10 when a=5​​f(a)≈25f(a)≈25.010​

可以看到不同地方的三角形所產生的斜率都不同

這其實就是一次微分的意義

常見的有 :

f(a)=a2df(a)da=2af(a)=a3df(a)da=3a2f(a)=ln⁡(a)df(a)da=1a\begin{aligned} &f(a) = a^2 && \frac{d f(a)}{da} = 2a\\ &f(a) = a^3 && \frac{d f(a)}{da} = 3a^2\\ &f(a) = \ln(a) && \frac{d f(a)}{da} = \frac{1}{a}\\ \end{aligned}​f(a)=a2f(a)=a3f(a)=ln(a)​​dadf(a)​=2adadf(a)​=3a2dadf(a)​=a1​​

Computation Graph

在講解 backpropogation 前,我們用一個簡單方法來解釋一下流程

假設我們想要最小化 J(a,b,c)=3(a+bc)J(a,b,c) = 3(a+bc)J(a,b,c)=3(a+bc)

這個函式需要用到三個計算

u=bcv=a+uJ=3v\begin{aligned} u &= bc\\ v &= a+u\\ J &= 3v \end{aligned}uvJ​=bc=a+u=3v​

可以畫成像這樣的 computation graph

而 backpropogation 就像是由右往左取每一個數對 J 的導數一樣

求完就可以對每個數做 gradient descent

為什麼要由右往左取呢,是因為需要用到 chain rule 的關係

以下會將 dJdvar\frac{dJ}{dvar}dvardJ​ 標示為 dvardvardvar

其中 var 是任何一個變數

例如 dJda=da\frac{dJ}{da} = dadadJ​=da, dJdb=db\frac{dJ}{db} = dbdbdJ​=db

這是 python 在定義這些變數時的常見寫法

現在我們就可以往前求得 dv=3dv = 3dv=3

因為當 v 進行微調的時候,J 就會改變三倍

v=1J=3v=1.001J=3.003Slope =0.0030.001=3\begin{aligned} &v = 1 && J = 3\\ &v = 1.001 && J = 3.003 \\ &\text{Slope } = \frac{0.003}{0.001} = 3 \end{aligned}​v=1v=1.001Slope =0.0010.003​=3​​J=3J=3.003​

再來我們再往前求得 da=3da = 3da=3

因為當 a 進行微調時,J 一樣也改變三倍

但我們要從 a 改變 v 而 v 改變 J 這樣算

dJda=dJdvdvda=3⋅dvda\frac{dJ}{da} = \frac{dJ}{dv} \frac{dv}{da} = 3 \cdot \frac{dv}{da}dadJ​=dvdJ​dadv​=3⋅dadv​

我們已經知道 dJdv\frac{dJ}{dv}dvdJ​

基於這點,我們再找出 a 進行微調時 v 會維持一倍

a=5v=11a=5.001J=11.001Slope =0.0010.001=1\begin{aligned} &a = 5 && v = 11\\ &a = 5.001 && J = 11.001 \\ &\text{Slope } = \frac{0.001}{0.001} = 1 \end{aligned}​a=5a=5.001Slope =0.0010.001​=1​​v=11J=11.001​

所以 da=dJda=3⋅dvda=3⋅1=3da = \frac{dJ}{da} = 3 \cdot \frac{dv}{da} = 3 \cdot 1 = 3da=dadJ​=3⋅dadv​=3⋅1=3

我們用這個 chain rule 方法計算出 u 然後再計算出 b, c

就可以得到所有數字對 J 的 slope 是多少

Logisitic Regression Gradient Descent as NN

我們將 logistic regression 計算 cost function 轉換成 computation graph

  1. 計算 a 的導數為 da=dL(a,y)da=−ya+1−y1−ada = \frac{d\mathcal{L}(a, y)}{da} = -\frac{y}{a} + \frac{1-y}{1-a}da=dadL(a,y)​=−ay​+1−a1−y​

  2. 計算 z 的導數為 dz=dL(a,y)dz=dLda⋅dadz=a−ydz = \frac{d\mathcal{L}(a, y)}{dz} = \frac{d\mathcal{L}}{da} \cdot \frac{da}{dz}= a - ydz=dzdL(a,y)​=dadL​⋅dzda​=a−y

    • 其中已知 dLda=−ya+1−y1−a\frac{d\mathcal{L}}{da} = -\frac{y}{a} + \frac{1-y}{1-a}dadL​=−ay​+1−a1−y​

    • 而新的 dadz=a(1−a)\frac{da}{dz} = a(1-a)dzda​=a(1−a)

  3. 接著就可以算出 w1,w2,bw_1, w_2, bw1​,w2​,b 對 JJJ 的導數

    • dw1=dLdw1=x1⋅dzdw_1 = \frac{d\mathcal{L}}{dw_1} = x_1 \cdot dzdw1​=dw1​dL​=x1​⋅dz

    • dw2=dLdw2=x2⋅dzdw_2 = \frac{d\mathcal{L}}{dw_2} = x_2 \cdot dzdw2​=dw2​dL​=x2​⋅dz

    • db=dLdb=dzdb = \frac{d\mathcal{L}}{db} = dzdb=dbdL​=dz

  4. 然後對這三個數做 gradient descent

    • w1:=w1−αdw1w_1 := w_1 - \alpha dw_1w1​:=w1​−αdw1​

    • w2:=w2−αdw2w_2 := w_2 - \alpha dw_2w2​:=w2​−αdw2​

    • b:=b−αdbb := b - \alpha dbb:=b−αdb

Gradient Decsent on m Examples

接著我們把上面所有東西統整一起,對 m 筆 training sets 做一次 gradient descent

pseudo code 如下 :

J = dw1 = dw2 = db = 0

For i = 1 to m:
    % 首先是 forward propogation
    z(i) = w'x(i) + b  % 每個 training set 的 z
    a(i) = sigmoid(z(i))  % 每個 training set 的 a
    J += -[y(i) * log(a(i)) + (1 - y(i)) * log(1 - a(i))]  % 加總所有 training sets 的 cost

    % 接著是 backward propogation
    dz(i) = a(i) - y(i)  % 每個 training set 的 dz

    % feature 很多時,這裡會是一個 loop
    dw1 += x1(i) * dz(i)  % 每個 training set 的 dw1 加到一個 dw1 總合去
    dw2 += x2(i) * dz(i)  % 每個 training set 的 dw2 加到一個 dw2 總合去
    db  += dz(i)  % 每個 training set 的 db 加到一個 db 總合去

% 結束迴圈後對以下幾個東西,都求平均數
J /= m
dw1 /= m
dw2 /= m
db /= m

這個做法可能會用到兩個迴圈,一個是對 m 筆資料,一個是對 n 個 input features (dw1, dw2, ..., dwn)

所以我們需要學習 vectorization 的精神來解決問題

PreviousIntroductionNextPython and Vectorization

Last updated 5 years ago

Was this helpful?