Now, we will compute the gradients of loss with respect to hidden to hidden layer weights, . Similar to , the final gradient is the sum of the gradients at all time steps:
So, we can write:
First, let's compute gradient of loss, with respect to , that is, .
We cannot compute derivative ...