Spark programming starts with a dataset or a few, usually residing in some form of distributed and persistent storage such as HDFS. A typical RDD programming model that Spark provides can be described as follows:
- From an environment variable, Spark context (the Spark shell provides you with a Spark Context or you can make your own, this will be described later in this chapter) creates an initial data reference RDD object.
- Transform the initial RDD to create more RDD objects following the functional programming style (to be discussed later on).
- Send the code, algorithms, or applications from the driver program to the cluster manager nodes. Then, the cluster manager provides a copy to each computing node. ...