Anyone who’s played the game “telephone” knows how important context is to the human brain. The words “cup” and “cop” taken by themselves are pretty likely to occur equally in most situations. However, if it’s a loud party, and you hear a word that you believe is either “cup” or “cop,” your brain will use the previous context to decide which one it was. For example, if your new friend said, “Wash the,” the next word is most likely “cup.” However, if they said, “Run from the”, it might be “cop.”1
This is the basic concept behind multicontext encoders. They take into account the last few observed symbols in order to identify the ideal number of bits for encoding the current symbol.
Perhaps a more concrete example is how symbol pairs influence the probability of subsequent letters in the English language.
For example, in “typical” English text, we expect to see the letter “h” about 5% of the time, on average. However, if the current symbol is a letter “t”, there is a high probability, actually about 30%, that the next symbol will be “h”, because the pair “th” is common in English. Similarly, the letter “u” has a general probability of about 2%. When a “q” is encountered, however, the probability is more than 99% that the next letter will be a “u”. In this case, the current symbol “q” predicts that the next letter will be “u”, and thus can use fewer bits assigned to it. This type of adjacency, based on statistical observance, has also dubbed this group of ...