O'Reilly logo

Fast Data Processing with Spark by Holden Karau

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Loading data into an RDD

Now the chapter will examine the different sources you can use for your RDD. If you decide to run through the examples in the Spark shell, you can call .cache() or .first() on the RDDs you generate to verify that it can be loaded. In Chapter 2, Using the Spark Shell, you learned how to load data text from a file and from the S3 storage system, where you can look at different formats of data and the different sources that are supported.

One of the easiest ways of creating an RDD is taking an existing Scala collection and converting it into an RDD. The Spark context provides a function called parallelize; this takes a Scala collection and turns it into an RDD that is of the same type as the data input.

  • Scala:
    val dataRDD = ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required