A single layered non-linear unit still has a limited capacity for the input output transformations it can learn. It can be explained by looking at the XOR problem. In the XOR problem, we want a neural network model to learn the XOR function. The XOR function takes two Boolean input, and outputs 1 if they differ, or 0 if the input is identical.
We can think of it as a pattern-classification problem with input patterns of X = {(0,0), (0, 1), (1, 0), (1,1 )}. The first and fourth are in class 0 and the others in class 1. Let's treat this problem as a regression problem, with Mean Square Error (MSE) loss and try to model this with a linear unit. Solving it analytically, we arrive at the desired weights : and bias: ...