Notes For Deep Learning

2017-09-05 12:05:54 +0800

1 Neural Network framework:

Initialization of each parameters
Repeat:
- Forward propagation (Compute z, a, yPred, loss)
- Backward propagation (Compute gradients - dz, da; dw, db), and update parameters (gradient descent)

Tips:

The weights $W^{[l]}$ should be initialized randomly to break symmetry.
If all $W^{[l]} <1$ or $W^{[l]} >1$ even slightly, the gradients will vanish/explode. We can reduce this problem by controlling the variance of $w$ during the initialization. For sigmoid, set $Var(w_i) = \frac{1}{n}$. For ReLu, set $Var(w_i) = \frac{2}{n^{[l-1]}}$ (He Initialization). For tanh, set $Var(w_i) =\sqrt{\frac{1}{n^{[l-1]}}} $ or $\sqrt{\frac{2}{n^{[l-1]} + n^{[l]}}}$.
Use Gradient Checking to debug. (See Details Course2 Week1)

Activation function: A Non-linear function

tanh function always works better than sigmoid function. tanh ( $tanh(z) = \frac{e^z - e^{-z}}{e^z +e^{-z}}$ ) function is a shifted version of sigmoid, but goes cross (0,0). The mean of its output is closer to zero, and so it centers the data better for the next layer.
When z is very small/large, both sigmoid & tanh has small slope. -> slow down gradient descent. -> Use RELU (rectified linear unit) (a = max(0,z)).
One disadvantage of RELU is that the derivative is equal to zero when z is negative (In practice, it works just fine. another version is Leaky ReLu (a = max(0.01z, z)), often works better, but not used widely).

Tips:

If you are doing binary classification, usually use sigmoid function as output layer.
On other layers RELU is often the default choice of activation function.

Gradient descent

Mini-batch Gradient Descent

Gradient Descent use all data to update gradient, mini-batch gradient descent use part and stochastic gradient descent uses only 1.
Tune a learning rate hyperparameter $\alpha$.
With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large). (Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.)

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence. Using momentum can reduce these oscillations. Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable $v$. Formally, this will be the exponentially weighted average of the gradient on previous steps.

$\begin{cases} v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\ W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \end{cases}$

where L is the number of layers, $\beta$ is the momentum and $\alpha$ is the learning rate. The larger the momentum $\beta$ is, the smoother the update because the more we take the past gradients into account. But if $\beta$ is too big, it could also smooth out the updates too much. Common values for $\beta$ range from 0.8 to 0.999 (default 0.9). In some place, people might ignore $(1-\beta)$. In that case, need to tune $\beta$ and $\alpha$ together.

Adam optimization algorithm combine momentum and RMSprop (root mean square prop) together.

$\begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon} \end{cases}$

where t counts the number of steps taken of Adam, L is the number of layers, $\beta_1$ and $\beta_2$ are hyperparameters that control the two exponentially weighted averages, $\alpha$ is the learning rate, $\varepsilon$ is a very small number to avoid dividing by zero.

Regularization of Neural Network

Regularization is usually used to reduce overfitting.

L2-regularization (with Frobenius Norm, also known as “weight decay”). Its cost function will be:

$J(w^{[1]}, b^{[1]}, \dots, w^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^{L} \sum_{l=1}^{L} ||w^{[l]}||^2_{F}$

where,

$||w^{[l]}||^2_{F} = \sum_{i=1}^{n^{[l-1]}} \sum_{j=1}^{n^{[l]}} (w_{ij}^{[l]})^2$

By penalizing the square values of the weights in the cost function, it drives all the weights to smaller values. This leads to a smoother model in which the output changes more slowly as the input changes.

Dropout Regularization: Go through each layer of the Neural Network, randomly eliminate the nodes with a given probability. [!$a^[i]$ need to be updated by dividing the probability, to keep the expected value; Dropout only implemented on training set, not testing set.] The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.
Other methods: Data augmentation, Early stopping.

Tuning Parameters

In deep learning, we usually recommend that you: [From Coursera: Neural Networks and Deep Learning - HW1 ]
- Choose the learning rate that better minimizes the cost function.
- If your model overfits, use other techniques to reduce overfitting.