Let's suppose we have two class classification problems; that is, we need to predict the value of a binary outcome variable, y. In terms of probability, the outcome, y, is Bernoulli-distributed conditioned on the feature, x. The neural network needs to predict the probability, P(y = 1 | x). For the output of the neural network to be a valid probability, it should lie in [0, 1]. We use a sigmoid activation function for this and get a non-linear logistic unit.
To learn the weights of the logistic unit, first we need a cost function and to find the derivatives of the cost function. From a probabilistic point of view, the cross-entropy loss arises as the natural cost function if we want to maximize ...