Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization (Outline)

Aug 29, 2017

Week One

Regularizing your neural network

  • L2 正则化
  • Dropout
  • Early Stop
  • 动量法
  • Adam

Setting up your optimization problem

Normalizing inputs

使得优化过程更快 Substract mean and normalize variance $ x := \frac{x - \mu}{\sigma^2}, \mu = \frac{ \sum_{i=1}^{m}{ x_{(i)} } }{m}, \sigma =\frac{1}{m}\sum_{i=1}^{m}{x^{(i)} ** 2} $

Vanishing / Exploding gradients

随着层数多的情况,$W$ 会变得很小或者很大。

Weight Initialization For Deep Networks

输入乘以一个常数 $ w*C $

Numerical approximation of gradients

Gradient Checking

  • Don’t use in training - only to debug
  • If algorithm fails grad check, look at components to try to identify bug.
  • Remember regulariazation
  • Doesn’t work with dropout. (hard to compute $J$)
  • Run at randowm initialization; perhaps again after some training.

Week Two


  • Mini-batch’s size is (1, m)
    • Vectorization
    • Don’t need to see all the samples
    • If small train set, use mini-batch size (64, 128, 256, 512, …, 1024) and make sure mini-batch fits CPU/GPU memory
  • Batch is size m Mini-batch
    • takes long time to train
  • Stocastic is szie 1 Mini-batch
    • loses speeding-up

Exponentially Weighted Average

Bias Correction

Gradient descent with momentum


RMSprop + Momentum + Bias Correction (Adam)


Learning rate decay

Week Three

Hyperparameters tuning

Tuning Process

  • $\alpha$
  • $\beta$
  • number of hidden units
  • mini-batch size
  • number of layers
  • learning rate decay

Using an appropriate scale to pick hyperparameters

  • $\alpha$ log scale search
  • $\beta = 0.9 … 0.999, 1 - \beta = 0.1( 10^{-1} ) … 0.001( 10^{-3} ), r \in [-3, -1], \beta = 1 - 10^{r} $

Batch Normalization

Normalize activation in a network

Fitting batch norm into a neural network

Why does batch norm work?

  • Make input value more stable, make hidden layer stable
  • Each mini-batch is scaled by the mean/variance computed on just that mini-batch
  • Adds some noise to the values $Z^{[l]}$ within that mini-batch. So similar to dropout, it adds some noise to each hidden layer’s activation
  • This has a slight regularation effects

Batch Norm at test time

Estimate using exponentially weighted average to estimate $\mu$ and $\sigma^2$

Multiple-class classification

Softmax Regression

Training a softmax classifier