From the preceding explanation of vanishing gradient, it comes out that the root cause of this problem is the sigmoid function being picked as an activation function. The similar problem has been detected when tanh is chosen as an activation function.
In order to counter such a scenario, the ReLU function comes to the rescue:
ReLU(x)= max(0,x)
If the input is negative or less than zero, the function outputs as zero. In the second scenario, if the input is greater than zero, then the output will be equal to input.
Let's take the derivative of this function and see what happens:
Case 1: x<0:
Case 2: x>0:
If we ...