Spark MLlib examples
Now, let's look at how to use the algorithms. Naturally, we need interesting datasets to implement the algorithms; we will use appropriate datasets for the algorithms shown in the next section. In the book text, we will use Scala, but I have included iPython notebooks of the algorithm examples in Python as well.
The code and data files are available in the GitHub repository at https://github.com/xsankar/fdps-vii. We'll keep it updated with corrections.
Let's read the car mileage data into an RDD and then compute some basic statistics. We will use a simple parse class to parse a line of data. This will work if you know the type and the structure of your CSV file. We will use this technique for the examples ...