But which one should we use?
Each of these activation functions is useful; however, as ReLU has the most useful features of all of the activation functions and is easy to calculate, this should be the function you are using most of the time.
It can be a good idea to switch to leaky ReLU if you run into stuck gradients frequently. However, you can usually lower the learning rate to help to prevent this or use it in the earlier layers, instead of all of your layers, in order to maintain the edge of having fewer activations overall across the network.
Sigmoid is most valuable as an output layer, preferably with a probability as the output. The tanh function can also be valuable, for example, where we would like layers to constantly adjust values ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access