March 2022
Beginner to intermediate
456 pages
13h
English
This chapter covers
groupby and a simple aggregate functionspark-submit to launch your program in batch modeChapter 2 dealt with all the data preparation work for our word frequency program. We read the input data, tokenized each word, and cleaned our records to only keep lowercase words. If we bring out our outline, we only have steps 4 and 5 to complete:
[DONE]Read: Read the input data (we’re assuming a plain text file).
[DONE]Token: Tokenize each word.
[DONE]Clean: Remove any punctuation and/or tokens ...