What is parallelization?
The best way to understand Spark, or any language, is to look at the documentation. If we look at Spark's documentation, it clearly states that, for the textFile function that we used last time, it reads the text file from HDFS.
On the other hand, if we look at the definition of parallelize, we can see that this is creating an RDD by distributing a local Scala collection.
So, the main difference between using parallelize to create an RDD and using the textFile to create an RDD is where the data is sourced from.
Let's look at how this works practically. Let's go to the PySpark installation screen, from where we left off previously. So, we imported urllib, we used urllib.request to retrieve some data from the internet, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access