O'Reilly logo

Fast Data Processing with Spark 2 - Third Edition by Krishna Sankar

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Loading data into an RDD

In this chapter, we will examine the different sources you can use for your RDD. If you decide to run it through the examples in the Spark shell, you can call .cache() or .first() on the RDDs you generate to check whether it can be loaded. In Chapter 2, Using the Spark Shell, you learned how to load data text from a file and from S3. In this chapter, we will look at the different formats of data (text file and CSV) and the different sources (filesystem and HDFS) supported.

One of the easiest ways to create an RDD is taking an existing Scala collection and converting it into an RDD. The SparkContext object provides a function called parallelize that takes a Scala collection and converts it into an RDD of the same type as ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required