August 2016
Beginner to intermediate
334 pages
8h 27m
English
Now that we have examined several algorithms for fitting classifier models in the scikit-learn library, let us look at how we might implement a similar model in PySpark. We can use the same census dataset from earlier in this chapter, and start by loading the data using a textRdd after starting the spark context:
>>> censusRdd = sc.textFile('census.data')
Next we need to split the data into individual fields, and strip whitespace
>>> censusRddSplit = censusRdd.map(lambda x: [e.strip() for e in x.split(',')])
Now, as before, we need to determine which of our features are categorical and need to be re-encoded using one-hot encoding. We do this by taking a single row and asking whether the string in ...
Read now
Unlock full access