The weights w of a neural network are found by running a gradient based optimization algorithm such as stochastic gradient descent that iteratively minimizes the loss or error (L) incurred by the network in making predictions over the training data. Mean-squared error (MSE) and mean absolute error (MAE) (sometimes mean absolute percentage error) are frequently used for regression tasks while binary and categorical log loss are common loss functions for classification problems. For time series forecasting, MSE and MAE would be apt to train the neural models.
Gradient descent algorithms work by moving the weights, in iterations i, along their gradient path. The gradient is the partial derivative of the loss function L with respect ...