Summary
RDDs are the backbone of Spark; these schema-less data structures are the most fundamental data structures that we will deal with within Spark.
In this chapter, we presented ways to create RDDs from text files, by means of the .parallelize(...) method as well as by reading data from text files. Also, some ways of processing unstructured data were shown.
Transformations in Spark are lazy - they are only applied when an action is called. In this chapter, we discussed and presented the most commonly used transformations and actions; the PySpark documentation contains many more http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.
One major distinction between Scala and Python RDDs is speed: Python RDDs can be much slower than ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access