We will now use the ideas we introduced in the previous section to write a basic Spark program to manipulate a dataset. We will start with Scala and then write the same program in Java and Python. Our program will be based on exploring some data from an online store, about which users have purchased which products. The data is contained in a Comma-Separated-Value (CSV) file called UserPurchaseHistory.csv. This file is expected to be in the data directory.
The contents are shown in the following snippet. The first column of the CSV is the username, the second column is the product name, and the final column is the price:
John,iPhone Cover,9.99John,Headphones,5.49Jack,iPhone Cover,9.99Jill,Samsung ...