Because of the form of the ReLU activation function, it returns the value of zero much more often than the sigmoid function. We consider this behavior as a type of sparsity. This sparsity results in a speeding up of convergence, but a loss of controlled gradients. On the other hand, the sigmoid function has very well-controlled gradients and does not risk the extreme values that the ReLU activation does, as illustrated in the following diagram:
Activation function |
Advantages |
Disadvantages |
Sigmoid |
Less extreme outputs |
Slower convergence |
ReLU |
Converges quicker |
Extreme output values possible |