We minimize the loss function using the gradient descent algorithm. So, we backpropagate the network, calculate the gradient of the loss function with respect to weights, and update the weights according to the weight update rule.
First, we compute the gradient of loss with respect to hidden to output layer . We cannot calculate the derivative of loss with respect to directly from as it has no term in it, so we apply the ...