This recipe explains the fundamentals of the Spark programming model. It covers the RDD basics that is, Spark provides a Resilient Distributed Dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. It also covers how to create and perform transformations and actions on RDDs.
filter, and a
fewactions such as
top, and so on, in Spark-shell:
scala> val data = Array(1, 2, 3, 4, 5) scala> val rddData = sc.parallelize(data) scala> val mydata = data.filter(ele => ele%2==0) mydata: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at filter at ...