Given that neural networks are to support nonlinearity and more complexity, the activation function to be used has to be robust enough to have the following:
- It should be differential; we will see why we need differentiation in backpropagation. It should not cause gradients to vanish.
- It should be simple and fast in processing.
- It should not be zero centered.
The sigmoid is the most used activation function, but it suffers from the following setbacks:
- Since it uses logistic model, the computations are time consuming and complex
- It cause gradients to vanish and no signals pass through the neurons at some point of time
- It is slow in convergence
- It is not zero centered
These drawbacks are solved by ReLU. ...