Weight initialization

One key component of training deep networks is the random weight initialization. This matters because some activation functions, such as sigmoid and ReLU, produce meaningful outputs and gradients if their inputs are within a certain range.

A famous example is the vanishing gradient problem. To understand it, let's take an FC layer with sigmoid activation (this example is also valid for tanh). We saw the sigmoid's graph (blue) and its derivative (green) in the Activation functions section. If the weighted sum of the inputs falls roughly outside the (-5, 5) range, the sigmoid activation will be effectively 0 or 1. In essence, it saturates. This is visible during the backward pass where we derive the sigmoid (the formula ...

Get Advanced Deep Learning with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.