Regularizing your Neural Network

Regularization

我們在 machine learning 中也有提到的 regularization
我們重新提出，並且將他應用到 neural network 中
這是 logistic regression $J(\theta)$ 使用 regularization
- $J(w, b) = \frac{1}{m}\sum_{i=1}^m\mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m}\lVert w\rVert^2_2$
後面加的這一項 regularization 可以細分為 L2 和 L1 regularization
- L2 就是上面寫的
  - $\frac{\lambda}{2m}\lVert w\rVert^2_2 = \frac{\lambda}{2m}\sum_{j=1}^{n_x}w_j^2 = \frac{\lambda}{2m} w^Tw$
- L1 就是沒有平方的版本
  - $\frac{\lambda}{2m}\lVert w\rVert_1 = \frac{\lambda}{2m}\sum_{j=1}^{n_x}\lvert w_j\rvert$
- 因為 L1 會產生 sparse matrix (很多 entries 為 0)
- 所以通常大家都使用 L2

在 nn 中所用到的 L2 regularization 稱為 Frobenius Norm (F底標記)
- 也就是整個 $w (n^{[l-1]}, n^{[l]})$ 每個 entries 的平方和
- $\lVert w^{[l]}\rVert_F^2 = \sum_{i=1}^{n^{[l-1]}}\sum_{j=1}^{n^{[l]}}(w^{[l]}_{ij})^2$
所以整個包含 regularization 的 neural network cost 為
- $J(w^{[1]},b^{[1],\cdots,w^{[L]},b^{[L]}}) = \frac{1}{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)}) + \frac{\lambda}{2m}\sum_{l=1}^L \lVert w^{[l]}\rVert_F^2$

\begin{aligned} dW^{[l]} &= \frac{\partial \mathcal{L}}{\partial w^{[l]}} + \frac{\lambda}{m}W^{[l]} \\ &\text{insert into gradient descent} \\ W^{[l]} &:= W^{[l]} - \alpha dW^{[l]}\\ &:= W^{[l]} - \alpha\begin{bmatrix}\frac{\partial \mathcal{L}}{\partial w^{[l]}} + \frac{\lambda}{m}W^{[l]}\end{bmatrix}\\ &:= W^{[l]} - \alpha\frac{\lambda}{m}W^{[l]} - \alpha\frac{\partial \mathcal{L}}{\partial w^{[l]}} \\ &:= (1-\frac{\partial\lambda}{m})W^{[l]} - \alpha\frac{\partial \mathcal{L}}{\partial w^{[l]}} \end{aligned}

keep_prob = 0.8  # hold neuron probability
dl = np.random.rand(al.shape[0], al.shape[1]) < keep_prob
al = np.multiply(al, dl)
al /= keep_prob

有點類似遮罩的感覺，將 al 乘以 0/1 遮罩 dl
最後 al /= keep_prob 就是 inverted dropout 的精神
- 是為了避免下一層計算發生錯誤
- 因為 al 有些 neuron 被歸零
- 所以我們在沒被歸零的 neuron 上補回 20% 期望值
另外，別在 test 時使用 dropout，會讓預測結果隨機化

對於不同的 layer 設定的 keep_prob 也不同
通常較少 hidden units 的 layer 的 keep_prob = 1
- Input layer 通常也為 1
越多 features (weights 越大) 的 layer 的 keep_prob 越小越好
Dropout 有一個大缺點
- 就是無法正確定義出 $J(\theta)$
- 所以無法邊訓練邊看 cost function 是否正確遞減
- 通常需要將 keep_prob 先全部設定回 1 再觀察

Last updated 5 years ago

Was this helpful?