7. Ingestion from files

This chapter covers

  • Common behaviors of parsers
  • Ingesting from CSV, JSON, XML, and text files
  • Understanding the difference between one-line and multiline JSON records
  • Understanding the need for big data-specific file formats

Ingestion is the first step of your big data pipeline. You will have to onboard the data in your instance of Spark, whether it is in local mode or cluster mode. As you know by now, data in Spark is transient, meaning that when you shut down Spark, it’s all gone. You will learn how to import data from standard files including CSV, JSON, XML, and text.

In this chapter, after learning about common behaviors among various parsers, you’ll use made-up datasets to illustrate specific cases, as well as ...

Get Spark in Action, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.