In this chapter, we will be looking at how to write a typical batch processing data pipeline using .NET for Apache Spark. We will show how a typical data processing job reads the source data and parses the data including dealing with any oddities the source files may have and then write the files out to a common format that other consumers of the data can use.
Imperfect Source Data
It is rare that when we are working with data sources, the files are in a perfect condition for processing; we often have to do some work to tidy the data, and in the example we will use in this chapter, this is as ...