April 2018
Beginner
238 pages
7h 13m
English
For this example, I am using text from an online article from Atlantic Monthly called The World Might Be Better Off Without College for Everyone at https://www.theatlantic.com/magazine/archive/2018/01/whats-college-good-for/546590/.
I am using this script:
import pysparkif not 'sc' in globals(): sc = pyspark.SparkContext() sentences = sc.textFile('B09656_09_article.txt') \ .glom() \ .map(lambda x: " ".join(x)) \ .flatMap(lambda x: x.split("."))print(sentences.count(),"sentences")bigrams = sentences.map(lambda x:x.split()) \ .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])print(bigrams.count(),"bigrams")frequent_bigrams = bigrams.reduceByKey(lambda x,y:x+y) \ .map(lambda x:(x[1],x[0])) \ .sortByKey(False) ...