6Equipartition and Universality
When we deal with empirical data, such as texts in natural language, and we wish to predict them, often we do not know their exact probability distribution in advance. Consequently, we may wish to apply some statistical procedure to infer this distribution incrementally from the data while making predictions. It is not guaranteed in general, however, whether the loss that we incur in this case is similar to the loss that we would incur if we used the true distribution of the data‐generating process. It turns out that for stationary processes, there exist indeed effective procedures, called universal distributions, which are capable of approximating any unknown process with the minimal achievable loss – equal to the entropy rate of the respective ergodic component of the process. This fact is a consequence of the Birkhoff ergodic theorem.
The existence of universal distributions is of a great practical importance since, as it will become clear from Section 7.1, these distributions can be used as general purpose data‐compressors. From this perspective, computationally simpler universal procedures such as the Lempel–Ziv code (Ziv and Lempel, 1977) and irreducible grammar‐based codes (Kieffer and Yang, 2000) attract more attention in practice. While studying natural language, however, we would like to support a different point of view into universal distributions, namely, whether they admit some kind of a combinatorial interpretation resembling the ...
Get Information Theory Meets Power Laws now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.