A machine learning problem can also be analyzed in terms of information transfer or exchange. Our dataset is composed of *n* features, which are considered independent (for simplicity, even if it's often a realistic assumption) drawn from *n* different statistical distributions. Therefore, there are *n* probability density functions *p _{i}(x)* which must be approximated through other

*n*

*q*

*functions. In any machine learning task, it's very important to understand how two corresponding distributions diverge and what is the amount of information we lose when approximating the original dataset.*

_{i}(x)The most useful measure is called **entropy**:

This value is proportional to the uncertainty of *X* and it's measured in **bits** (if the ...