Next, let's investigate the tf-idf weighting for a few terms to illustrate the impact of the commonality or rarity of a term.
First, we can compute the minimum and maximum tf-idf weights across the entire corpus:
val minMaxVals = tfidf.map { v => val sv = v.asInstanceOf[SV] (sv.values.min, sv.values.max) } val globalMinMax = minMaxVals.reduce { case ((min1, max1), (min2, max2)) => (math.min(min1, min2), math.max(max1, max2)) } println(globalMinMax)
As we can see, the minimum tf-idf is zero, while the maximum is significantly larger:
(0.0,66155.39470409753)
We will now explore the tf-idf weight attached to various terms. In the previous section on stop words, we filtered out many common terms that occur ...