Addressing errors in ETL pipelines

ETL tasks are usually considered to be complex, expensive, slow, and error-prone. Here, we will examine typical challenges in ETL processes, and how Spark SQL features assist in addressing them.

Spark can automatically infer the schema from a JSON file. For example, for the following JSON data, the inferred schema includes all the labels and the data types based on the content. Here, the data types for all the elements in the input data are longs by default:

test1.json

{"a":1, "b":2, "c":3}{"a":2, "d":5, "e":3}{"d":1, "c":4, "f":6}{"a":7, "b":8}{"c":5, "e":4, "d":3}{"f":3, "e":3, "d":4}{"a":1, "b":2, "c":3, "f":3, "e":3, "d":4}

You can print the schema to verify the data types, as shown:

scala> spark.read.json("file:///Users/aurobindosarkar/Downloads/test1.json").printSchema() ...

Get Learning Spark SQL now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.