April 2018
Beginner
238 pages
7h 13m
English
We can run a small Spark script to read in a file and sum up the line lengths. We are using lambda functions to map/reduce the sizes in a Hadoop fashion:
import pysparkif not 'sc' in globals(): sc = pyspark.SparkContext()lines = sc.textFile("B09656_02 Spark Sample.ipynb")lineLengths = lines.map(lambda s: len(s))totalLengths = lineLengths.reduce(lambda a, b: a + b)print(totalLengths)
That results in a screen that looks like this:

Note that we are running a Python 2 Notebook that calls upon the Spark (pyspark) library.