Activation Function - ReLU

Course Summary - 4. Introduction to TensorFlow

Activation Function

Activation Function for Non-linearlity
→ A linear model : the form of output y = w1 * x1 + w2 * x2 + w3 * x3
→ Substitute each group of weights for a new weight.!
Exactly the same linear model as before despite adding a hidden layer of neurons.
- First neuron of the Hidden Layer on the left takes the weights from all three Input nodes

Sigmoid/Tanh → Vanishing Gradient → ReLU → Dying ReLU → its variants

  • Nonlinear activation functions with sigmoid, hyberbolic tangent (scaled and shifted sigmoid) being some of the earliest.
    Saturation which leads to the vanishing gradient problem
    where, with zero gradients, the model’s weights don’t update and training halts.
  • ReLU: Rectified Linear Unit is one of our favorites because it’s simple and works well.
    • Networks with ReLU hidden activations often have 10 times the speed of training than networks with Sigmoid hidden activations.
    • In the positive domain : it is linear → don’t have saturation
    • In the negative domain : 0 → the negative domain’s function always being zero → can end up with ReLU layers dying.
    • Inputs in the negative domain → then the output of the activation will be zero which doesn’t help in the next layer in getting the inputs into the positive domain : compounds and creates a lot of zero activations.
    • During backpropagation when updating the weights, since we have to multiply our error’s derivative by the activation, we end up with a gradient of zero, thus a weight update of 0, and thus the weights don’t change and training fails for that layer.
Many different ReLU variants
Slightly modify the ReLU and avoid the "dying ReLU"

3 common failure modes for gradient descent

Problem Insight Solution
1. Gradients can vanish Each additional layer can successively reduce - signal vs. noise Using ReLu instead of sigmoid/tanh can help
2. Gradients can explode Learning rates are important here Batch normalization (useful knob) can help
3. ReLu layers can die Monitor fraction of zero weights in TensorBoard Lower your learning rate