Overcoming vanishing gradient

From the preceding explanation of vanishing gradient, it comes out that the root cause of this problem is the sigmoid function being picked as an activation function. The similar problem has been detected when tanh is chosen as an activation function.

In order to counter such a scenario, the ReLU function comes to the rescue:

ReLU(x)= max(0,x)

If the input is negative or less than zero, the function outputs as zero. In the second scenario, if the input is greater than zero, then the output will be equal to input.

Let's take the derivative of this function and see what happens:

Case 1: x<0:

Case 2: x>0:

If we ...

Get Machine Learning Quick Reference now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.