Activation Function
Activation Function for Non-linearlity |
---|
→ A linear model : the form of output y = w1 * x1 + w2 * x2 + w3 * x3 |
→ Substitute each group of weights for a new weight.! |
→ Exactly the same linear model as before despite adding a hidden layer of neurons. |
- First neuron of the Hidden Layer on the left takes the weights from all three Input nodes |
Sigmoid/Tanh → Vanishing Gradient → ReLU → Dying ReLU → its variants
- Nonlinear activation functions with
sigmoid
,hyberbolic tangent
(scaled and shifted sigmoid) being some of the earliest.
→ Saturation which leads to the vanishing gradient problem
where, with zero gradients, the model’s weights don’t update and training halts. ReLU: Rectified Linear Unit
is one of our favorites because it’s simple and works well.- Networks with
ReLU
hidden activations often have 10 times the speed of training than networks withSigmoid
hidden activations. - In the positive domain : it is linear → don’t have saturation
- In the negative domain : 0 → the negative domain’s function always being zero → can end up with
ReLU
layers dying. - Inputs in the negative domain → then the output of the activation will be zero which doesn’t help in the next layer in getting the inputs into the positive domain : compounds and creates a lot of zero activations.
- During backpropagation when updating the weights, since we have to multiply our error’s derivative by the activation, we end up with a gradient of zero, thus a weight update of 0, and thus the weights don’t change and training fails for that layer.
- Networks with
Many different ReLU variants |
---|
Slightly modify the ReLU and avoid the "dying ReLU" |
3 common failure modes for gradient descent
Problem | Insight | Solution |
---|---|---|
1. Gradients can vanish | Each additional layer can successively reduce - signal vs. noise | Using ReLu instead of sigmoid /tanh can help |
2. Gradients can explode | Learning rates are important here | Batch normalization (useful knob) can help |
3. ReLu layers can die |
Monitor fraction of zero weights in TensorBoard | Lower your learning rate |
'Certificate - DS > Deep learning specialization' 카테고리의 다른 글
Glossary (0) | 2021.12.12 |
---|---|
Model Debugging - Hyperparameter 하이퍼파라미터 값 조정 (0) | 2021.11.23 |
Model debugging and Loss curve (0) | 2021.11.23 |
Hyperparameters Tuning 하이퍼파라미터 튜닝 (0) | 2021.11.22 |
Activation Function 활성화 함수 - Sigmoid, tanh, ReLU, LeakyReLU (0) | 2021.11.21 |