RNN and the gradient vanishing-exploding problem
Gradients for deeper layers are calculated as products of many gradients of activation functions in the multi-layer network. When those gradients are small or zero, it will easily vanish. On the other hand, when they are bigger than 1, it will possibly explode. So, it becomes very hard to calculate and update.
Let's explain them in more detail:
- If the weights are small, it can lead to a situation called vanishing gradients, where the gradient signal gets so small that learning either becomes very slow or stops working altogether. This is often referred to as vanishing gradients.
- If the weights in this matrix are large, it can lead to a situation where the gradient signal is so large that it can cause ...