January 2019
Intermediate to advanced
386 pages
11h 13m
English
The inference of the probability of a long sequence, say w1, ..., wm, is typically infeasible. Calculating the joint probability of P(w1, ... , wm) would be done by applying the following chain rule:

The probability of the later words given the earlier words would be especially difficult to estimate from the data. That's why this joint probability is typically approximated by an independence assumption that the ith word is only dependent on the n-1 previous words. We'll only model the joint probabilities of combinations of n sequential words, called n-grams. For example, in the phrase the quick brown fox, we have the following n-grams: ...