April 2018
Beginner
238 pages
7h 13m
English
We have a standard preamble to the coding. All Spark programs need a context to work with. The context is used to define the number of threads and the like. We are only using the defaults. It's important to note that Spark will automatically utilize underlying multiple CPUs and the like as needed, without specific intervention.
Then we load the text file into memory. This is a standard method available in Spark. If we were accessing a database, we might be able to use parallel operations to read different segments of the primary key to split up the file access.
Once the file is loaded, we split each line into words and use a lambda function to tick off each occurrence of a word. The code is truly creating a new record for ...