We just learned how BPTT works, and we saw how the gradient of loss can be computed with respect to all the weights in RNNs. But here, we will encounter a problem called the vanishing and exploding gradients.
While computing the derivatives of loss with respect to and , we saw that we have to traverse all the way back to the first hidden state, as each hidden state at a time is dependent on its previous ...