Aug 29, 2017

# Week One

• L2 正则化
• Dropout
• Early Stop
• 动量法

## Setting up your optimization problem

### Weight Initialization For Deep Networks

• Don’t use in training - only to debug
• If algorithm fails grad check, look at components to try to identify bug.
• Remember regulariazation
• Doesn’t work with dropout. (hard to compute $J$)
• Run at randowm initialization; perhaps again after some training.

# Week Two

## Mini-batch

• Mini-batch’s size is (1, m)
• Vectorization
• Don’t need to see all the samples
• If small train set, use mini-batch size (64, 128, 256, 512, …, 1024) and make sure mini-batch fits CPU/GPU memory
• Batch is size m Mini-batch
• takes long time to train
• Stocastic is szie 1 Mini-batch
• loses speeding-up

## Exponentially Weighted Average

$V_t = \beta V_{t-1} + (1-\beta)\theta_t \\ averaging \approx \frac{1}{1-\beta} days$

## Bias Correction

$V_t:= \frac{V_t}{1-\beta^t} （make low bias，fast warming up）$

$Vdw := \beta Vdw + (1-\beta)dw \\ Vdb := \beta Vdb + (1-\beta)db \\ w := w - \alpha Vdw \\ b := b - \alpha Vdb$

## RMSprop

$Sdw := \beta Sdw + (1-\beta) dw^2 \\ Sdb := \beta Sdb + (1-\beta) db^2 \\ w := w - \alpha \frac{dw}{\sqrt{Sdw} + \epsilon} \\ b := b - \alpha \frac{db}{\sqrt{Sdb} + \epsilon}$

## Hyperparameters

$\alpha = learning\_rate, \beta_1 = 0.9 (dw), \beta_2 = 0.999 (dw^2), \epsilon = 10^{-8}$

# Week Three

## Hyperparameters tuning

### Tuning Process

• $\alpha$
• $\beta$
• number of hidden units
• mini-batch size
• number of layers
• learning rate decay

### Using an appropriate scale to pick hyperparameters

• $\alpha$ log scale search
• $\beta = 0.9 … 0.999, 1 - \beta = 0.1( 10^{-1} ) … 0.001( 10^{-3} ), r \in [-3, -1], \beta = 1 - 10^{r}$

## Batch Normalization

### Why does batch norm work?

• Make input value more stable, make hidden layer stable
• Each mini-batch is scaled by the mean/variance computed on just that mini-batch
• Adds some noise to the values $Z^{[l]}$ within that mini-batch. So similar to dropout, it adds some noise to each hidden layer’s activation
• This has a slight regularation effects

### Batch Norm at test time

Estimate using exponentially weighted average to estimate $\mu$ and $\sigma^2$