It is also a common practice to exclude terms during tokenization when their overall occurrence in the corpus is very low. For example, let's examine the least occurring terms in the corpus (notice the different ordering we use here to return the results sorted in ascending order):
val oreringAsc = Ordering.by[(String, Int), Int](-_._2) println(tokenCountsFilteredSize.top(20)(oreringAsc) .mkString("n"))
You will get the following results:
As we can see, there are many terms that ...