Aggregation at Scale

In a higher-level idiom, without the constraints on scale we could sum up the counts by year easily with a map or dictionary. For example, in more verbose Scala we could do something like this:

 val data:Seq[NGramData] = ???
 val m = mutable.Map[String,Int]
 for (d <- data) {
  if (m.containsKey(d.word)) {
  m(d.word) = d.count
  } else {
  m(d.word) += d.count
  }
 }

Or in an even more functional style, admittedly at the cost of further efficiency, we could do this:

 val g = l.groupBy(identity).map { i => (i._1,i._2.size) }

However, when scale is a concern, planning this sort of bulk data-processing job can be a genuinely hard problem, and Scala frameworks like Spark[18] often do a great job of it. A common technique for ...

Get Modern Systems Programming with Scala Native now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.