The following are some of the ways in which the problem of vanishing gradients can be solved:
- One method to overcome this problem to some extent by using the ReLU activation function. It computes the function f(x)=max(0,x) (i.e., the activation function simply thresholds the lower level of outputs at zero) and prevents the network from producing negative gradients.
- Another way to overcome this problem is to perform unsupervised training on each layer separately and then fine-tune the entire network through backpropagation, as done by Jürgen Schmidhuber in his study of multi-level hierarchy in neural networks. The link to this paper is provided in the following section.
- A third solution to this problem is the use of LSTM