Notes For Deep Learning

2017-09-05 12:05:54 +0800

1 Neural Network framework:

  1. Initialization of each parameters
  2. Repeat:
    • Forward propagation (Compute z, a, yPred, loss)
    • Backward propagation (Compute gradients - dz, da; dw, db), and update parameters (gradient descent)

Tips:

Activation function: A Non-linear function

  1. tanh function always works better than sigmoid function. tanh ( $tanh(z) = \frac{e^z - e^{-z}}{e^z +e^{-z}}$ ) function is a shifted version of sigmoid, but goes cross (0,0). The mean of its output is closer to zero, and so it centers the data better for the next layer.
  2. When z is very small/large, both sigmoid & tanh has small slope. -> slow down gradient descent. -> Use RELU (rectified linear unit) (a = max(0,z)).
  3. One disadvantage of RELU is that the derivative is equal to zero when z is negative (In practice, it works just fine. another version is Leaky ReLu (a = max(0.01z, z)), often works better, but not used widely).

Tips:

Gradient descent

Mini-batch Gradient Descent

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence. Using momentum can reduce these oscillations. Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable $v$. Formally, this will be the exponentially weighted average of the gradient on previous steps.

where L is the number of layers, $\beta$ is the momentum and $\alpha$ is the learning rate. The larger the momentum $\beta$ is, the smoother the update because the more we take the past gradients into account. But if $\beta$ is too big, it could also smooth out the updates too much. Common values for $\beta$ range from 0.8 to 0.999 (default 0.9). In some place, people might ignore $(1-\beta)$. In that case, need to tune $\beta$ and $\alpha$ together.

Adam optimization algorithm combine momentum and RMSprop (root mean square prop) together.

where t counts the number of steps taken of Adam, L is the number of layers, $\beta_1$ and $\beta_2$ are hyperparameters that control the two exponentially weighted averages, $\alpha$ is the learning rate, $\varepsilon$ is a very small number to avoid dividing by zero.

Regularization of Neural Network

Regularization is usually used to reduce overfitting.

where,

By penalizing the square values of the weights in the cost function, it drives all the weights to smaller values. This leads to a smoother model in which the output changes more slowly as the input changes.

Tuning Parameters