Creating RDDs
There are two ways to create an RDD in PySpark: you can either .parallelize(...)
a collection (list
or an array
of some elements):
data = sc.parallelize( [('Amber', 22), ('Alfred', 23), ('Skye',4), ('Albert', 12), ('Amber', 9)])
Or you can reference a file (or files) located either locally or somewhere externally:
data_from_file = sc.\ textFile( '/Users/drabast/Documents/PySpark_Data/VS14MORT.txt.gz', 4)
Note
We downloaded the Mortality dataset VS14MORT.txt
file from (accessed on July 31, 2016) ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2014us.zip; the record schema is explained in this document http://www.cdc.gov/nchs/data/dvs/Record_Layout_2014.pdf. We selected this dataset on purpose: The encoding of the ...
Get Learning PySpark now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.