June 2020
Intermediate to advanced
382 pages
11h 39m
English
Let's see how we can create an RDD in Apache Spark and run distributed processing on it across the cluster:
For this, first, we need to create a new Spark session, as follows:
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName('cloudanum').getOrCreate()
Once we have created a Spark session, we use a CSV file for the source of the RDD. Then, we will run the following function—it will create an RDD that is abstracted as a DataFrame called df. The ability to abstract an RDD as a DataFrame was added in Spark 2.0 and this makes it easier to process the data:
df = spark.read.csv('taxi2.csv',inferSchema=True,header=True)
Let's look into the columns of the DataFrame:
Next, ...