February 2017
Intermediate to advanced
274 pages
5h 58m
English
There are two ways to create an RDD in PySpark: you can either .parallelize(...) a collection (list or an array of some elements):
data = sc.parallelize(
[('Amber', 22), ('Alfred', 23), ('Skye',4), ('Albert', 12),
('Amber', 9)])Or you can reference a file (or files) located either locally or somewhere externally:
data_from_file = sc.\
textFile(
'/Users/drabast/Documents/PySpark_Data/VS14MORT.txt.gz',
4)
We downloaded the Mortality dataset VS14MORT.txt file from (accessed on July 31, 2016) ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2014us.zip; the record schema is explained in this document http://www.cdc.gov/nchs/data/dvs/Record_Layout_2014.pdf. We selected this dataset on purpose: The encoding of the ...
Read now
Unlock full access