Now that we have seen some of the functionality, let's explore further. We can use a similar script to count the word occurrences in a file, as follows:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() text_file = sc.textFile("Spark File Words.ipynb") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) for x in counts.collect(): print x
We have the same preamble to the coding. Then we load the text file into memory.
Once the file is loaded, we split each line into words. Use a
lambda function to tick off each occurrence of a word. The code is truly creating a new record for each word occurrence. If a word appears in the stream, a record ...